Project Page for
CSR: Small: Collaborative Research: System
research on persistent high-dimensional data access and its application to semiclassical molecular dynamics simulation
NSF Grant Nos. CNS-1526005 ($255,807) and CNS-1526190 ($231,546) ,
Sep. 2015 — Aug. 2020.
_____________________________________________________________________________
Disclaimer: Work
reported in this website was supported in part by the National Science
Foundation. However, any finding or
conclusion reported in this website, or reported in
documents listed on this site reflect only the scientific opinions of
the researchers, and do not necessarily reflect the views of National
Science Foundation. |
Persistent
High-Dimensional Data Access
The fast advancement
in data generation/collection capabilities has lead to a deluge of data,
however data understanding cannot adequately catch up in many cases. In addition to the volume and velocity, the
complexity of data also poses challenge to big data processing and
understanding. One of the challenges of
data complexity is high dimensionality.
In many
problems, the multi-dimensionality of the data comes from multiple features
that combine to characterize essential properties of the objects under study,
or multiple influencing or contributing factors of an outcome. Multi-dimensionality of data arises in many
scientific and engineering studies, and examples include multiple observed
symptoms for a disease, the performance of a hardware/software combination with
multiple hardware/software specification parameters, physical systems with
multiple controllable and observable parameters, image object identification,
the potential energy of a molecular system as a function of atoms' positions,
etc. The importance of high dimensional
data is well reflected by the plethora of research activities. But for big
datasets, high dimensionality brings more and bigger challenges, some of which
may not be well understood or even discovered.
This website documents efforts in tackling the challenges of accessing,
organizing, searching disk-resident big sets of high-dimensional data from
scientific studies, including semiclassical molecular
dynamics studies and vulnerability studies of physical unclonable
functions.
_____________________________________________________________________________
[Participants]
[Goal/Motivations] [Data/Software/Publications] [Research] [Education]
_____________________________________________________________________________
Project
Participants/Contributors
• Misha Ahmadian, Computer
Science PhD student at Texas Tech University.
• Mohammed S. Alkatheiri,
Assistant Professor, University of Jeddah, Saudi Arabia. Former PhD student at
Texas Tech.
• Ahmad O. Aseeri,
Assistant Professor, Prince Sattam bin Abdulaziz University, Saudi Arabia. Former PhD student at Texas Tech.
• Xin Chen, Former
Computer Science MS student at Texas Tech University.
• Yu Mao, Former Computer Science MS student
at Texas Tech University.
• Md. Raihan Majumder, Computer Science MS student at Texas Tech
University.
• Khalid Mursi,
Computer Science PhD student at Texas Tech University.
• Zohreh Safari,
Computer Science PhD student at Texas Tech University.
• Weijun
Xiao, Associate Professor, Electric & Computer Engineering, Virginia
Commonwealth University.
• Huyen Vu, Computer
Science undergraduate student at Texas tech University.
• Yu
Zhuang, Associate Professor, Computer Science,
Texas Tech University.
_____________________________________________________________________________
This project is
to develop software for fast access of high-dimensional data on storages by
developing algorithms and software for fast searches of high-dimensional data,
and a simple, user-friendly API for non-blocking I/O access, and apply the
software to semiclassical molecular dynamics and
other scientific applications including high-dimensional data of
challenge-response pairs from studies of physical unclonable
functions.
Scientific Motivations
Semiclassical Molecular Dynamics. Quantum effects are inherent factors for
material properties and chemical processes. Ab initio
semiclassical molecular dynamics simulations capture
quantum effects with good accuracy, for which data are calculated from quantum
mechanics-based electronic structure theories. Calculations based on quantum
mechanical electronic structure theories are called ab
initio calculations, which are highly accurate but enormously expensive. For
instance, a trajectory of 5000 time steps took over 11 days of parallel
computing time on two quad-core Intel Xeon processors (8 cores total) for the
10-atom glycine. But due to the sequential nature of
the computations at different time steps, parallelization can be applied only
to the computation of each time step, where each time step takes about 3
minutes to compute. Thus, larger scale parallel computing can further reduce
computation time but not much, and other ways that can combine with parallel
computing to reduce computation time are sorely needed. The dominant
computation cost of semiclassical dynamics is the
Hessian at each time step. We recently developed a subspace Hessian
approximation, making it possible to perform ab
initio semiclassical dynamics for not-too-small
ensembles of atoms. The Hessian approximation involves searching for the
closest Hessian from a database of all ab initio
Hessians saved from earlier time steps, calling for efficient search algorithm
and software for the high-dimensionality data search.
Physical Unclonable Functions. Physical unclonable functions (PUFs) are emerging hardware primitives
for security protocols. PUFs utilize integrated circuits’ manufacturing
variations to produce responses unique for individual PUFs, and cannot be
reproduced by even manufacturers. In addition, implementable with small number
of transistors and operable with extremely low power, PUFs have great potential
for resource-constrained devices. An important goal of security research is to
discover all possible security vulnerabilities. One of the security
vulnerabilities is that some PUFs are mathematically clonable
by modeling methods though they are not physically clonable.
Studies show that machine learning models can accurately predict the responses
of some PUFs, indicating the possibility for attackers to develop malicious
software to impersonate the identity of PUF-embedded devices by producing the
same responses PUFs would give. In our recent studies, we generated large
datasets challenge-response pairs (CRPs) for PUF vulnerability studies. Large
data sizes have led to long hours to train machine learning models. We observed
that different CRP data contribute to the machine learning process differently,
and believed some CRP data can be identified and excluded for machine learning.
The CRP data are of high dimensions, and the data accessing and searching
methods out of the project could be of
use.
_____________________________________________________________________________
Software,
Data, and Publications
• LIBKM (Limited-Iteration Bisecting K-Means):
algorithm for fast clustering of big data to enable fast search of
high-dimensional big data. Paper. Software.
A general LIBKM-based exact nearest
neighbor searches. Software (please cite the
LIBKM paper).
• MemCAFC
(Memory-Capacity-Aware Fast Clustering): algorithm based on LIBKM, for
data-hardware-dependent clustering of big disk-resident datasets that can
enable fast search of high-dimensional big data, applied to CRP datasets of
physical unclonable functions on an SSD storage. Paper. Software.
• LaHiIO (Latency
Hiding Input/Output): An easy-to-use API for
non-blocking I/O operations for persistent big data access, applied to multiple
datasets from different applications including CRP datasets of physical unclonable functions on a HDD storage. Paper. Software.
• A problem-tailored search algorithm for semiclassical molecular dynamics and incorporated it into a
semiclassical dynamics simulation method. Paper.
• A dataset
of molecular configurations (i.e. atom coordinates) of the amino acid glycine.
[Note: The dataset is in gzipped text file format, containing 20,000 data points,
each of 30 dimensions]
• Datasets of challenge-response-pairs of
simulated XOR PUFs at UCI
data depository.
• Datasets
of challenge-response-pairs of silicon XOR PUFs implemented on FGPAs.
• Publications out of this project include
three conference poster [15,26,27], nineteen conference or workshop papers
published or accepted [1,3,5-7,9-14,16-18,20-24], and five journal papers
[2,4,8,19,25].
[1]
K.
Mursi, B. Thapaliya and Y. Zhuang (2020), “A hybrid-optimizer-enhanced neural network
method for the security vulnerability study of multiplexer arbiter PUFs”, To
appear in International Conference on Circuits, Systems and Devices,
Kitakyushu, Japan, August 2020.
[2]
K.
Mursi and Y. Zhuang (2020), “Experimental study of component-differentially-challenged
XOR PUFs as security primitives for Internet-of-Things", accepted by Journal of Comunication. August 2020.
[3]
K.
Mursi and Y. Zhuang (2020), “Experimental examination of
component-differentially-challenged XOR PUF circuits", To appear in International Conference on
Circuits, Systems and Devices, Kitakyushu, Japan, August 2020.
[4]
M.
Alamro, K. Mursi, Y. Zhuang, A. Aseeri and M. Alkatheiri (2020), “Robustness
and unpredictability for double arbiter PUFs on silicon data: Risk assessment
and performance evaluation”, Electronics, vol. 9, no.5, p. 870, 2020. Paper.
[5]
Z.
Safari, K. Mursi, and Y. Zhuang (2020), “Fast
Automatic Determination of Cluster Numbers for High Dimensional Big Data”, International Conference on Compute and Data
Analysis, March 2020, San Jose, California. Paper.
[6]
M. A. Alamro, Y. Zhuang, A. O. Aseeri, M. S. Alkatheiri (2019),
“Examination of double arbiter PUFs on security against machine learning
attacks”, Proceedings of IEEE International Conference on Big Data.
December 2019, Los Angeles, California. Paper.
[7]
K. Mursi, Y. Zhuang, M. Alkatheiri, A. Aseeri (2019),
“Extensive examination of XOR arbiter PUFs as security primitives for
resource-constrained IoT devices”, 17th International Conference on Privacy,
Security and Trust (PST2019), Fredericton, New
Brunswick, Canada, August 2019. Paper.
[8]
R. Conte, F. Gabas, G. Botti, Y. Zhuang, M. Ceotto (2019), “Semiclassical vibrational
spectroscopy with Hessian databases”, Journal
of Chemical Physics, Vol. 150 (24), June 2019, 244118. Paper.
[9]
A. Aseeri, Y. Zhuang, and M.S. Alkatheiri (2018), “A subspace pre-learning approach to
fast high-accuracy machine learning of large XOR PUFs with
component-differential challenges”. IEEE
International Conf. Big Data, December 2018, Seattle, Washington. Paper.
[10] A. Aseeri, Y. Zhuang, and M.S. Alkatheiri (2018), “LaHiIO:
Accelerating persistent big data machine learning via latency hiding IOs”. IEEE International Conf. Big Data, December 2018,
Seattle, Washington.
[11] Huikang Cao, Ruixuan Li, Wenlong Tian, Zhiyong
Xu, and Weijun Xiao (2018).
“A practical accountability scheme for oblivious RAM in
cloud storage”. 17th
IEEE International Conference On Trust, Security And Privacy In Computing And
Communications (IEEE TrustCom-18), August 2018, New York, USA.
[12] A. Aseeri, Y. Zhuang, and M.S. Alkatheiri
(2018), “A
machine learning-based security vulnerability study on XOR PUFs for resource-constraint
Internet of Things”, IEEE International Congress on Internet of
Things, July 2-7, 2018, San Francisco, California. Paper.
[13] J. Zhou, D.
Dai, Y. Mao, X. Chen, Y. Zhuang, and Y. Chen (2018),
“I/O characteristics discovery in cloud storage systems”, IEEE International Conference on Cloud
Computing, July 2-7, 2018, San Francisco, California.
[14] Wenlong Tian, Ruixuan
Li, Weijun Xiao, Zhiyong Xu (2018). “PTS-Dep: A high-performance two-party secure deduplication for cloud storage”. 20th
IEEE International Conferences on High Performance Computing and Communications
(HPCC), Exeter, UK.
[15] Liang Xu, Andrew Ward, and Weijun Xiao (2018). “GraphAcc:
A FPGA-based Computational Accelerator for Large-scale Graphs” (WiP and Poster). 16th Usenix
Conference on File and Storage Technologies. Oakland, California, February,
2018.
[16] Wenlong Tian, Ruixuan
Li, Zhiyong Xu, Weijun Xiao (2017). “Does the content defined chunking
really solve the local boundary shift problem?” 36th International Performance Computing and Communications Conference
(IPCCC 2017), December 2017, San Diego, California.
[17] A. Aseeri, Y. Zhuang, and M. Alkatheiri (2017),
“A memory capacity-aware algorithm for fast clustering of disk-resident big
datasets”, IEEE Cyber Science and
Technology Congress, Orlando, Florida, November 2017.
[18] Liang Xu, Qianbin Xia, and Weijun Xiao
(2017). “Locality-driven dynamic flash cache allocation”. 2017 IEEE Cyber Science and Technology Congress. November
2017, Orlando, Florida.
[19] M. Ahmadian, Y. Zhuang, W. L. Hase, and Y. Chen
(2017), "Data reduction through increased data Utilization in chemical
dynamics simulations", Big Data
Research, Vol. 9, pp. 57-66 September 2017.
[20] M. Alkatheiri and Y. Zhuang (2017), “Towards fast and accurate machine learning
attacks of feed-forward arbiter PUFs”, IEEE
Conference on Dependable and Secure Computing, Taipei, Taiwan, August 2017.
[21] M.S. Alkatheiri, Y. Zhuang, M. Korobkov, and A. R. Sangi (2017), “An experimental study of the
state-of-the-art PUFs implemented on FPGAs”, IEEE Conference on Dependable and Secure Computing, Taipei, Taiwan,
August 2017.
[22] Yu
Zhuang (2016), “Symmetric
repositioning of bisecting K-means centers for increased reduction of distance
calculations for big data clustering”, Proceedings
of IEEE International Conference on Bigdata, Washington DC, December 2016.
[23] Qianbin Xia and Weijun Xiao (2016).
“Improving
MLC flash performance with workload-aware differentiated ECC”. 22nd IEEE International Conference on
Parallel and Distributed Systems (ICPADS 2016). December 2016, Wuhan,
China.
[24] Yu
Zhuang, Yu Mao, and Xin
Chen (2016), “A limited-iteration bisecting K-means for fast clustering large
datasets”, Proceedings of the 10th IEEE International Conference on
Big Data Science & Engineering (BigdataSE),
Tianjin, China, August 2016.
[25] Q. Xia and W. Xiao (2016), “High-performance and
endurable cache management for flash-based read caching”, IEEE Transactions on Parallel and Distributed Systems, 27,
3518-3531, 2016.
[26] M. Ahmadian, Y. Zhuang, W. L. Hase, and Y. Chen
(2016), “Information-preserving data reduction for scientific applications
through efficient data modeling” (Poster), Conference
on Data Analysis (CoDA),
Santa Fe, New Mexico, March 2016.
[27] Q. Xia and W. Xiao (2015). “Zero-Migration Garbage
Collection Scheme for Flash Read Cache” (Poster. 24th
International Conference on Parallel Architectures and Compilation Techniques
(PACT), San Francisco, CA, Oct. 18-21, 2015.
___________________________________________________________________________
●
Clustering and Searching of Big Data
High-frequency accesses
of large disk-resident high dimensional datasets pose both I/O and algorithmic challenges.
Investigators of the project are working on data sets of high dimensional data
from semiclassical molecular dynamics studies, and
high dimensional data from vulnerability studies of physical unclonable functions. There exist a plethora of methods for
high-dimensional data clustering, searching, and learning. One of the most
popular groups of search methods are Locality Sensitive Hashing (LSH), but LSH
methods return only approximately nearest neighbors, not necessarily the
nearest. There are many tree-based hierarchical space-partition methods like
R-tree, SS-tree, and their variants, but as pointed out by studies on LSH
methods, tree-based space-partition methods do not work well for
high-dimensional data.
Observing that LSH
methods are also space-partition methods where the space of the search index is
partitioned into buckets. LSH differs from tree-based space-partition methods
in that LSH methods use only one-level of space partition, not a hierarchy of
nested partitions. A single level of partition obviously makes the code
implementation easier, and we surmise that for high-dimensional data,
multi-level of nested partition may be ineffective since a level of space
partition needs to produce data subsets whose diameters reasonably smaller than
that that of the whole dataset and it is almost impossible for every level of a
hierarchical space partition method to produce subsets whose diameters are at
least half of that of its parent for high-dimensional data. Along this line of
surmise, to search high-dimensional datasets for nearest neighbors, we
developed Limited-Iteration Bisecting K-means (LIBKM), a memory capacity-aware LIBKM clustering
method called MemCAFC for disk-resident large datasets, and
Center-Repositioned LIBKM which significantly improve the clustering efficiency
while maintaining almost the same clustering quality, making them good
candidates as data-organizing methods for indexing high-dimensional data. And a
single-level space-partition-based search method (software)
was developed using LIBKM for the space partition.
●
Simplistic API of non-blocking I/O operations
A small set of simple, user-friend non-blocking I/O operations has been developed (prototype) for easy implementation of overlapping of computation and IOs. The API has been tested for data mining and machine learning applications. Paper. Software.
●
SSD-based HDD I/O accelerator
Rapid advances in non-volatile memories create
opportunities to alleviate the I/O bottleneck. We have leveraged flash memories
as an I/O accelerator, and designed a hybrid storage system that uses a fast
and small SSD as a bypassable cache to HDDs, with a
goal to serve a majority of random I/O accesses by the SSD cache. We designed a
new scheme that takes the full advantages of application knowledge to provide a
unified storage system. We leverage the physical properties of flash memory for
designing a low-cost ghost cache. The idea behind this scheme is that when data
blocks on the SSD are updated, the new data is redirected to new flash blocks
and the old flash blocks are invalidated, but the invalidated data blocks are
removed only in metadata and can still be made accessible before it is erased
and reclaimed by garbage collection and wear leveling algorithms. These
invalidated but not-erased data blocks are identified to form the ghost cache,
and are utilized to speed up the I/Os. Paper.
●
Generation of data for experimental studies
To have adequate sets of high-dimensional data to test accessing, mining, and learning methods developed and to be developed, we generated different sets of data from semiclassical dynamics and from physical unclonable functions. See Section Data/Software/Publications for datasets.
____________________________________________________________________________
Education
● Courses developed/offered
CS 4331: Data Mining, Fall 2018, Texas
Tech University
CS 5331: Mining Big Data with HPC, Fall
2017, Texas Tech University
CS 5331: Data Mining with HPC, Fall 2016, Texas Tech University
● Students who graduated with project-related
work as part of the thesis/dissertation
Misha Ahmadian, graduated in
December 2016 with a MS degree in Computer Science.
Ahmad O. Aseeri,
graduated in
August 2018 with a PhD degree in Computer Science.
Zohreh Safari, a PhD student at
Texas Tech, obtained a MS degree in Computer Science in Aug 2020.
____________________________________________________________________________
Acknowledgement
The work is supported in
part by National Science Foundation under the project titled "CSR: Small: Collaborative Research: System research on
persistent high-dimensional data access and its application to semiclassical molecular dynamics simulation" (Grant No.
CNS-1526005 and Grant No. CNS-1526190).
____________________________________________________________________________