Project Page for

CSR: Small: Collaborative Research: System research on persistent high-dimensional data access and its application to semiclassical molecular dynamics simulation

NSF Grant Nos. CNS-1526005 ($255,807) and CNS-1526190 ($231,546) , Sep. 2015 — Aug. 2020.

_____________________________________________________________________________

Disclaimer: Work reported in this website was supported in part by the National Science Foundation. However, any finding or conclusion reported in this website, or reported in documents listed on this site reflect only the scientific opinions of the researchers, and do not necessarily reflect the views of National Science Foundation.

Persistent High-Dimensional Data Access

The fast advancement in data generation/collection capabilities has lead to a deluge of data, however data understanding cannot adequately catch up in many cases. In addition to the volume and velocity, the complexity of data also poses challenge to big data processing and understanding. One of the challenges of data complexity is high dimensionality.

In many problems, the multi-dimensionality of the data comes from multiple features that combine to characterize essential properties of the objects under study, or multiple influencing or contributing factors of an outcome. Multi-dimensionality of data arises in many scientific and engineering studies, and examples include multiple observed symptoms for a disease, the performance of a hardware/software combination with multiple hardware/software specification parameters, physical systems with multiple controllable and observable parameters, image object identification, the potential energy of a molecular system as a function of atoms' positions, etc. The importance of high dimensional data is well reflected by the plethora of research activities. But for big datasets, high dimensionality brings more and bigger challenges, some of which may not be well understood or even discovered. This website documents efforts in tackling the challenges of accessing, organizing, searching disk-resident big sets of high-dimensional data from scientific studies, including semiclassical molecular dynamics studies and vulnerability studies of physical unclonable functions.

_____________________________________________________________________________

[Participants] [Goal/Motivations] [Data/Software/Publications] [Research] [Education]

_____________________________________________________________________________

Project Participants/Contributors

• Misha Ahmadian, Computer Science PhD student at Texas Tech University.

• Mohammed S. Alkatheiri, Assistant Professor, University of Jeddah, Saudi Arabia. Former PhD student at Texas Tech.

• Ahmad O. Aseeri, Assistant Professor, Prince Sattam bin Abdulaziz University, Saudi Arabia. Former PhD student at Texas Tech.

• Xin Chen, Former Computer Science MS student at Texas Tech University.

• Yu Mao, Former Computer Science MS student at Texas Tech University.

• Md. Raihan Majumder, Computer Science MS student at Texas Tech University.

• Khalid Mursi, Computer Science PhD student at Texas Tech University.

• Zohreh Safari, Computer Science PhD student at Texas Tech University.

• Weijun Xiao, Associate Professor, Electric & Computer Engineering, Virginia Commonwealth University.

• Huyen Vu, Computer Science undergraduate student at Texas tech University.

• Yu Zhuang, Associate Professor, Computer Science, Texas Tech University.

_____________________________________________________________________________

Project Goal

This project is to develop software for fast access of high-dimensional data on storages by developing algorithms and software for fast searches of high-dimensional data, and a simple, user-friendly API for non-blocking I/O access, and apply the software to semiclassical molecular dynamics and other scientific applications including high-dimensional data of challenge-response pairs from studies of physical unclonable functions.

Scientific Motivations

Semiclassical Molecular Dynamics. Quantum effects are inherent factors for material properties and chemical processes. Ab initio semiclassical molecular dynamics simulations capture quantum effects with good accuracy, for which data are calculated from quantum mechanics-based electronic structure theories. Calculations based on quantum mechanical electronic structure theories are called ab initio calculations, which are highly accurate but enormously expensive. For instance, a trajectory of 5000 time steps took over 11 days of parallel computing time on two quad-core Intel Xeon processors (8 cores total) for the 10-atom glycine. But due to the sequential nature of the computations at different time steps, parallelization can be applied only to the computation of each time step, where each time step takes about 3 minutes to compute. Thus, larger scale parallel computing can further reduce computation time but not much, and other ways that can combine with parallel computing to reduce computation time are sorely needed. The dominant computation cost of semiclassical dynamics is the Hessian at each time step. We recently developed a subspace Hessian approximation, making it possible to perform ab initio semiclassical dynamics for not-too-small ensembles of atoms. The Hessian approximation involves searching for the closest Hessian from a database of all ab initio Hessians saved from earlier time steps, calling for efficient search algorithm and software for the high-dimensionality data search.

Physical Unclonable Functions. Physical unclonable functions (PUFs) are emerging hardware primitives for security protocols. PUFs utilize integrated circuits’ manufacturing variations to produce responses unique for individual PUFs, and cannot be reproduced by even manufacturers. In addition, implementable with small number of transistors and operable with extremely low power, PUFs have great potential for resource-constrained devices. An important goal of security research is to discover all possible security vulnerabilities. One of the security vulnerabilities is that some PUFs are mathematically clonable by modeling methods though they are not physically clonable. Studies show that machine learning models can accurately predict the responses of some PUFs, indicating the possibility for attackers to develop malicious software to impersonate the identity of PUF-embedded devices by producing the same responses PUFs would give. In our recent studies, we generated large datasets challenge-response pairs (CRPs) for PUF vulnerability studies. Large data sizes have led to long hours to train machine learning models. We observed that different CRP data contribute to the machine learning process differently, and believed some CRP data can be identified and excluded for machine learning. The CRP data are of high dimensions, and the data accessing and searching methods out of the project could be of use.

_____________________________________________________________________________

Software, Data, and Publications

• LIBKM (Limited-Iteration Bisecting K-Means): algorithm for fast clustering of big data to enable fast search of high-dimensional big data. Paper. Software.

A general LIBKM-based exact nearest neighbor searches. Software (please cite the LIBKM paper).

• MemCAFC (Memory-Capacity-Aware Fast Clustering): algorithm based on LIBKM, for data-hardware-dependent clustering of big disk-resident datasets that can enable fast search of high-dimensional big data, applied to CRP datasets of physical unclonable functions on an SSD storage. Paper. Software.

• LaHiIO (Latency Hiding Input/Output): An easy-to-use API for non-blocking I/O operations for persistent big data access, applied to multiple datasets from different applications including CRP datasets of physical unclonable functions on a HDD storage. Paper. Software.

• A problem-tailored search algorithm for semiclassical molecular dynamics and incorporated it into a semiclassical dynamics simulation method. Paper.

• A dataset of molecular configurations (i.e. atom coordinates) of the amino acid glycine.

[Note: The dataset is in gzipped text file format, containing 20,000 data points, each of 30 dimensions]

• Datasets of challenge-response-pairs of simulated XOR PUFs at UCI data depository.

• Datasets of challenge-response-pairs of silicon XOR PUFs implemented on FGPAs.

• Publications out of this project include three conference poster [15,26,27], nineteen conference or workshop papers published or accepted [1,3,5-7,9-14,16-18,20-24], and five journal papers [2,4,8,19,25].

[1] K. Mursi, B. Thapaliya and Y. Zhuang (2020), “A hybrid-optimizer-enhanced neural network method for the security vulnerability study of multiplexer arbiter PUFs”, To appear in International Conference on Circuits, Systems and Devices, Kitakyushu, Japan, August 2020.

[2] K. Mursi and Y. Zhuang (2020), “Experimental study of component-differentially-challenged XOR PUFs as security primitives for Internet-of-Things", accepted by Journal of Comunication. August 2020.

[3] K. Mursi and Y. Zhuang (2020), “Experimental examination of component-differentially-challenged XOR PUF circuits", To appear in International Conference on Circuits, Systems and Devices, Kitakyushu, Japan, August 2020.

[4] M. Alamro, K. Mursi, Y. Zhuang, A. Aseeri and M. Alkatheiri (2020), “Robustness and unpredictability for double arbiter PUFs on silicon data: Risk assessment and performance evaluation”, Electronics, vol. 9, no.5, p. 870, 2020. Paper.

[5] Z. Safari, K. Mursi, and Y. Zhuang (2020), “Fast Automatic Determination of Cluster Numbers for High Dimensional Big Data”, International Conference on Compute and Data Analysis, March 2020, San Jose, California. Paper.

[6] M. A. Alamro, Y. Zhuang, A. O. Aseeri, M. S. Alkatheiri (2019), “Examination of double arbiter PUFs on security against machine learning attacks”, Proceedings of IEEE International Conference on Big Data. December 2019, Los Angeles, California. Paper.

[7] K. Mursi, Y. Zhuang, M. Alkatheiri, A. Aseeri (2019), “Extensive examination of XOR arbiter PUFs as security primitives for resource-constrained IoT devices”, 17th International Conference on Privacy, Security and Trust (PST2019), Fredericton, New Brunswick, Canada, August 2019. Paper.

[8] R. Conte, F. Gabas, G. Botti, Y. Zhuang, M. Ceotto (2019), “Semiclassical vibrational spectroscopy with Hessian databases”, Journal of Chemical Physics, Vol. 150 (24), June 2019, 244118. Paper.

[9] A. Aseeri, Y. Zhuang, and M.S. Alkatheiri (2018), “A subspace pre-learning approach to fast high-accuracy machine learning of large XOR PUFs with component-differential challenges”. IEEE International Conf. Big Data, December 2018, Seattle, Washington. Paper.

[10] A. Aseeri, Y. Zhuang, and M.S. Alkatheiri (2018), “LaHiIO: Accelerating persistent big data machine learning via latency hiding IOs”. IEEE International Conf. Big Data, December 2018, Seattle, Washington.

[11] Huikang Cao, Ruixuan Li, Wenlong Tian, Zhiyong Xu, and Weijun Xiao (2018). “A practical accountability scheme for oblivious RAM in cloud storage”. 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications (IEEE TrustCom-18), August 2018, New York, USA.

[12] A. Aseeri, Y. Zhuang, and M.S. Alkatheiri (2018), “A machine learning-based security vulnerability study on XOR PUFs for resource-constraint Internet of Things”, IEEE International Congress on Internet of Things, July 2-7, 2018, San Francisco, California. Paper.

[13] J. Zhou, D. Dai, Y. Mao, X. Chen, Y. Zhuang, and Y. Chen (2018), “I/O characteristics discovery in cloud storage systems”, IEEE International Conference on Cloud Computing, July 2-7, 2018, San Francisco, California.

[14] Wenlong Tian, Ruixuan Li, Weijun Xiao, Zhiyong Xu (2018). “PTS-Dep: A high-performance two-party secure deduplication for cloud storage”. 20th IEEE International Conferences on High Performance Computing and Communications (HPCC), Exeter, UK.

[15] Liang Xu, Andrew Ward, and Weijun Xiao (2018). “GraphAcc: A FPGA-based Computational Accelerator for Large-scale Graphs” (WiP and Poster). 16th Usenix Conference on File and Storage Technologies. Oakland, California, February, 2018.

[16] Wenlong Tian, Ruixuan Li, Zhiyong Xu, Weijun Xiao (2017). “Does the content defined chunking really solve the local boundary shift problem?” 36th International Performance Computing and Communications Conference (IPCCC 2017), December 2017, San Diego, California.

[17] A. Aseeri, Y. Zhuang, and M. Alkatheiri (2017), “A memory capacity-aware algorithm for fast clustering of disk-resident big datasets”, IEEE Cyber Science and Technology Congress, Orlando, Florida, November 2017.

[18] Liang Xu, Qianbin Xia, and Weijun Xiao (2017). “Locality-driven dynamic flash cache allocation”. 2017 IEEE Cyber Science and Technology Congress. November 2017, Orlando, Florida.

[19] M. Ahmadian, Y. Zhuang, W. L. Hase, and Y. Chen (2017), "Data reduction through increased data Utilization in chemical dynamics simulations", Big Data Research, Vol. 9, pp. 57-66 September 2017.

[20] M. Alkatheiri and Y. Zhuang (2017), “Towards fast and accurate machine learning attacks of feed-forward arbiter PUFs”, IEEE Conference on Dependable and Secure Computing, Taipei, Taiwan, August 2017.

[21] M.S. Alkatheiri, Y. Zhuang, M. Korobkov, and A. R. Sangi (2017), “An experimental study of the state-of-the-art PUFs implemented on FPGAs”, IEEE Conference on Dependable and Secure Computing, Taipei, Taiwan, August 2017.

[22] Yu Zhuang (2016), “Symmetric repositioning of bisecting K-means centers for increased reduction of distance calculations for big data clustering”, Proceedings of IEEE International Conference on Bigdata, Washington DC, December 2016.

[23] Qianbin Xia and Weijun Xiao (2016). “Improving MLC flash performance with workload-aware differentiated ECC”. 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS 2016). December 2016, Wuhan, China.

[24] Yu Zhuang, Yu Mao, and Xin Chen (2016), “A limited-iteration bisecting K-means for fast clustering large datasets”, Proceedings of the 10th IEEE International Conference on Big Data Science & Engineering (BigdataSE), Tianjin, China, August 2016.

[25] Q. Xia and W. Xiao (2016), “High-performance and endurable cache management for flash-based read caching”, IEEE Transactions on Parallel and Distributed Systems, 27, 3518-3531, 2016.

[26] M. Ahmadian, Y. Zhuang, W. L. Hase, and Y. Chen (2016), “Information-preserving data reduction for scientific applications through efficient data modeling” (Poster), Conference on Data Analysis (CoDA), Santa Fe, New Mexico, March 2016.

[27] Q. Xia and W. Xiao (2015). “Zero-Migration Garbage Collection Scheme for Flash Read Cache” (Poster. 24^th International Conference on Parallel Architectures and Compilation Techniques (PACT), San Francisco, CA, Oct. 18-21, 2015.

___________________________________________________________________________

Research

● Clustering and Searching of Big Data

High-frequency accesses of large disk-resident high dimensional datasets pose both I/O and algorithmic challenges. Investigators of the project are working on data sets of high dimensional data from semiclassical molecular dynamics studies, and high dimensional data from vulnerability studies of physical unclonable functions. There exist a plethora of methods for high-dimensional data clustering, searching, and learning. One of the most popular groups of search methods are Locality Sensitive Hashing (LSH), but LSH methods return only approximately nearest neighbors, not necessarily the nearest. There are many tree-based hierarchical space-partition methods like R-tree, SS-tree, and their variants, but as pointed out by studies on LSH methods, tree-based space-partition methods do not work well for high-dimensional data.

Observing that LSH methods are also space-partition methods where the space of the search index is partitioned into buckets. LSH differs from tree-based space-partition methods in that LSH methods use only one-level of space partition, not a hierarchy of nested partitions. A single level of partition obviously makes the code implementation easier, and we surmise that for high-dimensional data, multi-level of nested partition may be ineffective since a level of space partition needs to produce data subsets whose diameters reasonably smaller than that that of the whole dataset and it is almost impossible for every level of a hierarchical space partition method to produce subsets whose diameters are at least half of that of its parent for high-dimensional data. Along this line of surmise, to search high-dimensional datasets for nearest neighbors, we developed Limited-Iteration Bisecting K-means (LIBKM), a memory capacity-aware LIBKM clustering method called MemCAFC for disk-resident large datasets, and Center-Repositioned LIBKM which significantly improve the clustering efficiency while maintaining almost the same clustering quality, making them good candidates as data-organizing methods for indexing high-dimensional data. And a single-level space-partition-based search method (software) was developed using LIBKM for the space partition.

● Simplistic API of non-blocking I/O operations

A small set of simple, user-friend non-blocking I/O operations has been developed (prototype) for easy implementation of overlapping of computation and IOs. The API has been tested for data mining and machine learning applications. Paper. Software.

● SSD-based HDD I/O accelerator

Rapid advances in non-volatile memories create opportunities to alleviate the I/O bottleneck. We have leveraged flash memories as an I/O accelerator, and designed a hybrid storage system that uses a fast and small SSD as a bypassable cache to HDDs, with a goal to serve a majority of random I/O accesses by the SSD cache. We designed a new scheme that takes the full advantages of application knowledge to provide a unified storage system. We leverage the physical properties of flash memory for designing a low-cost ghost cache. The idea behind this scheme is that when data blocks on the SSD are updated, the new data is redirected to new flash blocks and the old flash blocks are invalidated, but the invalidated data blocks are removed only in metadata and can still be made accessible before it is erased and reclaimed by garbage collection and wear leveling algorithms. These invalidated but not-erased data blocks are identified to form the ghost cache, and are utilized to speed up the I/Os. Paper.

● Generation of data for experimental studies

To have adequate sets of high-dimensional data to test accessing, mining, and learning methods developed and to be developed, we generated different sets of data from semiclassical dynamics and from physical unclonable functions. See Section Data/Software/Publications for datasets.

____________________________________________________________________________

Education

● Courses developed/offered

CS 4331: Data Mining, Fall 2018, Texas Tech University

CS 5331: Mining Big Data with HPC, Fall 2017, Texas Tech University

CS 5331: Data Mining with HPC, Fall 2016, Texas Tech University

● Students who graduated with project-related work as part of the thesis/dissertation

Misha Ahmadian, graduated in December 2016 with a MS degree in Computer Science.

Ahmad O. Aseeri, graduated in August 2018 with a PhD degree in Computer Science.

Zohreh Safari, a PhD student at Texas Tech, obtained a MS degree in Computer Science in Aug 2020.

____________________________________________________________________________