Search | VHL Regional Portal

1.

Deep learning at the edge enables real-time streaming ptychographic imaging.

Babu, Anakha V; Zhou, Tao; Kandel, Saugat; Bicer, Tekin; Liu, Zhengchun; Judge, William; Ching, Daniel J; Jiang, Yi; Veseli, Sinisa; Henke, Steven; Chard, Ryan; Yao, Yudong; Sirazitdinova, Ekaterina; Gupta, Geetika; Holt, Martin V; Foster, Ian T; Miceli, Antonino; Cherukara, Mathew J.

Nat Commun ; 14(1): 7059, 2023 Nov 03.

Article in English | MEDLINE | ID: mdl-37923741

ABSTRACT

Coherent imaging techniques provide an unparalleled multi-scale view of materials across scientific and technological fields, from structural materials to quantum devices, from integrated circuits to biological cells. Driven by the construction of brighter sources and high-rate detectors, coherent imaging methods like ptychography are poised to revolutionize nanoscale materials characterization. However, these advancements are accompanied by significant increase in data and compute needs, which precludes real-time imaging, feedback and decision-making capabilities with conventional approaches. Here, we demonstrate a workflow that leverages artificial intelligence at the edge and high-performance computing to enable real-time inversion on X-ray ptychography data streamed directly from a detector at up to 2 kHz. The proposed AI-enabled workflow eliminates the oversampling constraints, allowing low-dose imaging using orders of magnitude less data than required by traditional methods.

2.

Robotic pendant drop: containerless liquid for µs-resolved, AI-executable XPCS.

Ozgulbas, Doga Yamac; Jensen, Don; Butler, Rory; Vescovi, Rafael; Foster, Ian T; Irvin, Michael; Nakaye, Yasukazu; Chu, Miaoqi; Dufresne, Eric M; Seifert, Soenke; Babnigg, Gyorgy; Ramanathan, Arvind; Zhang, Qingteng.

Light Sci Appl ; 12(1): 196, 2023 Aug 18.

Article in English | MEDLINE | ID: mdl-37596264

ABSTRACT

The dynamics and structure of mixed phases in a complex fluid can significantly impact its material properties, such as viscoelasticity. Small-angle X-ray Photon Correlation Spectroscopy (SA-XPCS) can probe the spontaneous spatial fluctuations of the mixed phases under various in situ environments over wide spatiotemporal ranges (10-6-103 s /10-10-10-6 m). Tailored material design, however, requires searching through a massive number of sample compositions and experimental parameters, which is beyond the bandwidth of the current coherent X-ray beamline. Using 3.7-µs-resolved XPCS synchronized with the clock frequency at the Advanced Photon Source, we demonstrated the consistency between the Brownian dynamics of ~100 nm diameter colloidal silica nanoparticles measured from an enclosed pendant drop and a sealed capillary. The electronic pipette can also be mounted on a robotic arm to access different stock solutions and create complex fluids with highly-repeatable and precisely controlled composition profiles. This closed-loop, AI-executable protocol is applicable to light scattering techniques regardless of the light wavelength and optical coherence, and is a first step towards high-throughput, autonomous material discovery.

3.

Linking scientific instruments and computation: Patterns, technologies, and experiences.

Vescovi, Rafael; Chard, Ryan; Saint, Nickolaus D; Blaiszik, Ben; Pruyne, Jim; Bicer, Tekin; Lavens, Alex; Liu, Zhengchun; Papka, Michael E; Narayanan, Suresh; Schwarz, Nicholas; Chard, Kyle; Foster, Ian T.

Patterns (N Y) ; 3(10): 100606, 2022 Oct 14.

Article in English | MEDLINE | ID: mdl-36277824

ABSTRACT

Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are required for configuring and running distributed computing pipelines-what we call flows-that link instruments, computers (e.g., for analysis, simulation, artificial intelligence [AI] model training), edge computing (e.g., for analysis), data stores, metadata catalogs, and high-speed networks. We review common patterns associated with such flows and describe methods for instantiating these patterns. We present experiences with the application of these methods to the processing of data from five different scientific instruments, each of which engages powerful computers for data inversion,model training, or other purposes. We also discuss implications of such methods for operators and users of scientific facilities.

4.

2'-O methylation of RNA cap in SARS-CoV-2 captured by serial crystallography.

Wilamowski, Mateusz; Sherrell, Darren A; Minasov, George; Kim, Youngchang; Shuvalova, Ludmilla; Lavens, Alex; Chard, Ryan; Maltseva, Natalia; Jedrzejczak, Robert; Rosas-Lemus, Monica; Saint, Nickolaus; Foster, Ian T; Michalska, Karolina; Satchell, Karla J F; Joachimiak, Andrzej.

Proc Natl Acad Sci U S A ; 118(21)2021 05 25.

Article in English | MEDLINE | ID: mdl-33972410

ABSTRACT

The genome of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) coronavirus has a capping modification at the 5'-untranslated region (UTR) to prevent its degradation by host nucleases. These modifications are performed by the Nsp10/14 and Nsp10/16 heterodimers using S-adenosylmethionine as the methyl donor. Nsp10/16 heterodimer is responsible for the methylation at the ribose 2'-O position of the first nucleotide. To investigate the conformational changes of the complex during 2'-O methyltransferase activity, we used a fixed-target serial synchrotron crystallography method at room temperature. We determined crystal structures of Nsp10/16 with substrates and products that revealed the states before and after methylation, occurring within the crystals during the experiments. Here we report the crystal structure of Nsp10/16 in complex with Cap-1 analog (m7GpppAm2'-O). Inhibition of Nsp16 activity may reduce viral proliferation, making this protein an attractive drug target.

Subject(s)

RNA Caps/metabolism , RNA, Messenger/metabolism , RNA, Viral/metabolism , SARS-CoV-2/chemistry , Crystallography , Methylation , Methyltransferases/chemistry , Methyltransferases/metabolism , Multiprotein Complexes/chemistry , Multiprotein Complexes/metabolism , RNA Cap Analogs/chemistry , RNA Cap Analogs/metabolism , RNA Caps/chemistry , RNA, Messenger/chemistry , RNA, Viral/chemistry , S-Adenosylhomocysteine/chemistry , S-Adenosylhomocysteine/metabolism , S-Adenosylmethionine/chemistry , S-Adenosylmethionine/metabolism , SARS-CoV-2/genetics , SARS-CoV-2/metabolism , Synchrotrons , Viral Nonstructural Proteins/chemistry , Viral Nonstructural Proteins/metabolism , Viral Regulatory and Accessory Proteins/chemistry , Viral Regulatory and Accessory Proteins/metabolism

5.

Quantum-Chemically Informed Machine Learning: Prediction of Energies of Organic Molecules with 10 to 14 Non-hydrogen Atoms.

Dandu, Naveen; Ward, Logan; Assary, Rajeev S; Redfern, Paul C; Narayanan, Badri; Foster, Ian T; Curtiss, Larry A.

J Phys Chem A ; 124(28): 5804-5811, 2020 Jul 16.

Article in English | MEDLINE | ID: mdl-32539388

ABSTRACT

High-fidelity quantum-chemical calculations can provide accurate predictions of molecular energies, but their high computational costs limit their utility, especially for larger molecules. We have shown in previous work that machine learning models trained on high-level quantum-chemical calculations (G4MP2) for organic molecules with one to nine non-hydrogen atoms can provide accurate predictions for other molecules of comparable size at much lower costs. Here we demonstrate that such models can also be used to effectively predict energies of molecules larger than those in the training set. To implement this strategy, we first established a set of 191 molecules with 10-14 non-hydrogen atoms having reliable experimental enthalpies of formation. We then assessed the accuracy of computed G4MP2 enthalpies of formation for these 191 molecules. The error in the G4MP2 results was somewhat larger than that for smaller molecules, and the reason for this increase is discussed. Two density functional methods, B3LYP and ωB97X-D, were also used on this set of molecules, with ωB97X-D found to perform better than B3LYP at predicting energies. The G4MP2 energies for the 191 molecules were then predicted using these two functionals with two machine learning methods, the FCHL-Δ and SchNet-Δ models, with the learning done on calculated energies of the one to nine non-hydrogen atom molecules. The better-performing model, FCHL-Δ, gave atomization energies of the 191 organic molecules with 10-14 non-hydrogen atoms within 0.4 kcal/mol of their G4MP2 energies. Thus, this work demonstrates that quantum-chemically informed machine learning can be used to successfully predict the energies of large organic molecules whose size is beyond that in the training set.

6.

Prevalence of Inherited Mutations in Breast Cancer Predisposition Genes among Women in Uganda and Cameroon.

Adedokun, Babatunde; Zheng, Yonglan; Ndom, Paul; Gakwaya, Antony; Makumbi, Timothy; Zhou, Alicia Y; Yoshimatsu, Toshio F; Rodriguez, Alex; Madduri, Ravi K; Foster, Ian T; Sallam, Aminah; Olopade, Olufunmilayo I; Huo, Dezheng.

Cancer Epidemiol Biomarkers Prev ; 29(2): 359-367, 2020 02.

Article in English | MEDLINE | ID: mdl-31871109

ABSTRACT

BACKGROUND: Sub-Saharan Africa (SSA) has a high proportion of premenopausal hormone receptor negative breast cancer. Previous studies reported a strikingly high prevalence of germline mutations in BRCA1 and BRCA2 among Nigerian patients with breast cancer. It is unknown if this exists in other SSA countries. METHODS: Breast cancer cases, unselected for age at diagnosis and family history, were recruited from tertiary hospitals in Kampala, Uganda and Yaoundé, Cameroon. Controls were women without breast cancer recruited from the same hospitals and age-matched to cases. A multigene sequencing panel was used to test for germline mutations. RESULTS: There were 196 cases and 185 controls with a mean age of 46.2 and 46.6 years for cases and controls, respectively. Among cases, 15.8% carried a pathogenic or likely pathogenic mutation in a breast cancer susceptibility gene: 5.6% in BRCA1, 5.6% in BRCA2, 1.5% in ATM, 1% in PALB2, 0.5% in BARD1, 0.5% in CDH1, and 0.5% in TP53. Among controls, 1.6% carried a mutation in one of these genes. Cases were 11-fold more likely to carry a mutation compared with controls (OR = 11.34; 95% confidence interval, 3.44-59.06; P < 0.001). The mean age of cases with BRCA1 mutations was 38.3 years compared with 46.7 years among other cases without such mutations (P = 0.03). CONCLUSIONS: Our findings replicate the earlier report of a high proportion of mutations in BRCA1/2 among patients with symptomatic breast cancer in SSA. IMPACT: Given the high burden of inherited breast cancer in SSA countries, genetic risk assessment could be integrated into national cancer control plans.

Subject(s)

Biomarkers, Tumor/genetics , Breast Neoplasms/genetics , Genetic Predisposition to Disease , Germ-Line Mutation , Adult , BRCA1 Protein/genetics , BRCA2 Protein/genetics , Breast Neoplasms/epidemiology , Cameroon/epidemiology , Case-Control Studies , DNA Mutational Analysis/statistics & numerical data , Female , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Middle Aged , Molecular Epidemiology , Prevalence , Uganda/epidemiology

7.

Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.

Bicer, Tekin; Gürsoy, Doga; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T.

Adv Struct Chem Imaging ; 3(1): 6, 2017.

Article in English | MEDLINE | ID: mdl-28261544

ABSTRACT

BACKGROUND: Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. METHODS: We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. RESULTS: Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. CONCLUSION: The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.

8.

Blending Education and Polymer Science: Semi Automated Creation of a Thermodynamic Property Database.

Tchoua, Roselyne B; Qin, Jian; Audus, Debra J; Chard, Kyle; Foster, Ian T; de Pablo, Juan.

J Chem Educ ; 93(9): 1561-1568, 2016 09 13.

Article in English | MEDLINE | ID: mdl-27795574

ABSTRACT

Structured databases of chemical and physical properties play a central role in the everyday research activities of scientists and engineers. In materials science, researchers and engineers turn to these databases to quickly query, compare, and aggregate various properties, thereby allowing for the development or application of new materials. The vast majority of these databases have been generated manually, through decades of labor-intensive harvesting of information from the literature; yet, while there are many examples of commonly used databases, a significant number of important properties remain locked within the tables, figures, and text of publications. The question addressed in our work is whether, and to what extent, the process of data collection can be automated. Students of the physical sciences and engineering are often confronted with the challenge of finding and applying property data from the literature, and a central aspect of their education is to develop the critical skills needed to identify such data and discern their meaning or validity. To address shortcomings associated with automated information extraction, while simultaneously preparing the next generation of scientists for their future endeavors, we developed a novel course-based approach in which students develop skills in polymer chemistry and physics and apply their knowledge by assisting with the semi-automated creation of a thermodynamic property database.

9.

Optimization of tomographic reconstruction workflows on geographically distributed resources.

Bicer, Tekin; Gürsoy, DogËa; Kettimuthu, Rajkumar; De Carlo, Francesco; Foster, Ian T.

J Synchrotron Radiat ; 23(Pt 4): 997-1005, 2016 07.

Article in English | MEDLINE | ID: mdl-27359149

ABSTRACT

New technological advancements in synchrotron light sources enable data acquisitions at unprecedented levels. This emergent trend affects not only the size of the generated data but also the need for larger computational resources. Although beamline scientists and users have access to local computational resources, these are typically limited and can result in extended execution times. Applications that are based on iterative processing as in tomographic reconstruction methods require high-performance compute clusters for timely analysis of data. Here, time-sensitive analysis and processing of Advanced Photon Source data on geographically distributed resources are focused on. Two main challenges are considered: (i) modeling of the performance of tomographic reconstruction workflows and (ii) transparent execution of these workflows on distributed resources. For the former, three main stages are considered: (i) data transfer between storage and computational resources, (i) wait/queue time of reconstruction jobs at compute resources, and (iii) computation of reconstruction tasks. These performance models allow evaluation and estimation of the execution time of any given iterative tomographic reconstruction workflow that runs on geographically distributed resources. For the latter challenge, a workflow management system is built, which can automate the execution of workflows and minimize the user interaction with the underlying infrastructure. The system utilizes Globus to perform secure and efficient data transfer operations. The proposed models and the workflow management system are evaluated by using three high-performance computing and two storage resources, all of which are geographically distributed. Workflows were created with different computational requirements using two compute-intensive tomographic reconstruction algorithms. Experimental evaluation shows that the proposed models and system can be used for selecting the optimum resources, which in turn can provide up to 3.13× speedup (on experimented resources). Moreover, the error rates of the models range between 2.1 and 23.3% (considering workflow execution times), where the accuracy of the model estimations increases with higher computational demands in reconstruction tasks.

10.

Choosing experiments to accelerate collective discovery.

Rzhetsky, Andrey; Foster, Jacob G; Foster, Ian T; Evans, James A.

Proc Natl Acad Sci U S A ; 112(47): 14569-74, 2015 Nov 24.

Article in English | MEDLINE | ID: mdl-26554009

ABSTRACT

A scientist's choice of research problem affects his or her personal career trajectory. Scientists' combined choices affect the direction and efficiency of scientific discovery as a whole. In this paper, we infer preferences that shape problem selection from patterns of published findings and then quantify their efficiency. We represent research problems as links between scientific entities in a knowledge network. We then build a generative model of discovery informed by qualitative research on scientific problem selection. We map salient features from this literature to key network properties: an entity's importance corresponds to its degree centrality, and a problem's difficulty corresponds to the network distance it spans. Drawing on millions of papers and patents published over 30 years, we use this model to infer the typical research strategy used to explore chemical relationships in biomedicine. This strategy generates conservative research choices focused on building up knowledge around important molecules. These choices become more conservative over time. The observed strategy is efficient for initial exploration of the network and supports scientific careers that require steady output, but is inefficient for science as a whole. Through supercomputer experiments on a sample of the network, we study thousands of alternatives and identify strategies much more efficient at exploring mature knowledge networks. We find that increased risk-taking and the publication of experimental failures would substantially improve the speed of discovery. We consider institutional shifts in grant making, evaluation, and publication that would help realize these efficiencies.

Subject(s)

Research , Science , Humans , Publications , Qualitative Research , Risk-Taking

11.

Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services.

Madduri, Ravi K; Sulakhe, Dinanath; Lacinski, Lukasz; Liu, Bo; Rodriguez, Alex; Chard, Kyle; Dave, Utpal J; Foster, Ian T.

Concurr Comput ; 26(13): 2266-2279, 2014 Sep 10.

Article in English | MEDLINE | ID: mdl-25342933

ABSTRACT

We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads.

12.

Lynx web services for annotations and systems analysis of multi-gene disorders.

Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia.

Nucleic Acids Res ; 42(Web Server issue): W473-7, 2014 Jul.

Article in English | MEDLINE | ID: mdl-24948611

ABSTRACT

Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform.

Subject(s)

Genetic Diseases, Inborn/genetics , Software , Databases, Factual , Genes , Humans , Internet , Knowledge Bases , Systems Biology

13.

Supercomputing for the parallelization of whole genome analysis.

Puckelwartz, Megan J; Pesce, Lorenzo L; Nelakuditi, Viswateja; Dellefave-Castillo, Lisa; Golbus, Jessica R; Day, Sharlene M; Cappola, Thomas P; Dorn, Gerald W; Foster, Ian T; McNally, Elizabeth M.

Bioinformatics ; 30(11): 1508-13, 2014 Jun 01.

Article in English | MEDLINE | ID: mdl-24526712

ABSTRACT

MOTIVATION: The declining cost of generating DNA sequence is promoting an increase in whole genome sequencing, especially as applied to the human genome. Whole genome analysis requires the alignment and comparison of raw sequence data, and results in a computational bottleneck because of limited ability to analyze multiple genomes simultaneously. RESULTS: We now adapted a Cray XE6 supercomputer to achieve the parallelization required for concurrent multiple genome analysis. This approach not only markedly speeds computational time but also results in increased usable sequence per genome. Relying on publically available software, the Cray XE6 has the capacity to align and call variants on 240 whole genomes in â¼50 h. Multisample variant calling is also accelerated. AVAILABILITY AND IMPLEMENTATION: The MegaSeq workflow is designed to harness the size and memory of the Cray XE6, housed at Argonne National Laboratory, for whole genome analysis in a platform designed to better match current and emerging sequencing volume.

Subject(s)

Computers , Genome, Human , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Humans , Software

14.

Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses.

Liu, Bo; Madduri, Ravi K; Sotomayor, Borja; Chard, Kyle; Lacinski, Lukasz; Dave, Utpal J; Li, Jianqiang; Liu, Chunchen; Foster, Ian T.

J Biomed Inform ; 49: 119-33, 2014 Jun.

Article in English | MEDLINE | ID: mdl-24462600

ABSTRACT

Due to the upcoming data deluge of genome data, the need for storing and processing large-scale genome data, easy access to biomedical analyses tools, efficient data sharing and retrieval has presented significant challenges. The variability in data volume results in variable computing and storage requirements, therefore biomedical researchers are pursuing more reliable, dynamic and convenient methods for conducting sequencing analyses. This paper proposes a Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses, which enables reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Our platform extends the existing Galaxy workflow system by adding data management capabilities for transferring large quantities of data efficiently and reliably (via Globus Transfer), domain-specific analyses tools preconfigured for immediate use by researchers (via user-specific tools integration), automatic deployment on Cloud for on-demand resource allocation and pay-as-you-go pricing (via Globus Provision), a Cloud provisioning tool for auto-scaling (via HTCondor scheduler), and the support for validating the correctness of workflows (via semantic verification tools). Two bioinformatics workflow use cases as well as performance evaluation are presented to validate the feasibility of the proposed approach.

Subject(s)

Computational Biology , Information Storage and Retrieval , Sequence Analysis/instrumentation

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL