|

1.

Short-read aligner performance in germline variant identification.

Wilton, Richard; Szalay, Alexander S.

Bioinformatics ; 39(8)2023 08 01.

Article En | MEDLINE | ID: mdl-37527006

MOTIVATION: Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read alignment and variant-calling software but also on the interaction between these complex software tools. RESULTS: In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners-BWA-MEM, Bowtie 2, and Arioc-in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance. AVAILABILITY AND IMPLEMENTATION: The quick brown fox jumps over the lazy dog.

High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Germ Cells , Sequence Analysis, DNA/methods

2.

Comparing and Correcting Spectral Sensitivities between Multispectral Microscopes: A Prerequisite to Clinical Implementation.

Eminizer, Margaret; Nagy, Melinda; Engle, Elizabeth L; Soto-Diaz, Sigfredo; Jorquera, Andrew; Roskes, Jeffrey S; Green, Benjamin F; Wilton, Richard; Taube, Janis M; Szalay, Alexander S.

Cancers (Basel) ; 15(12)2023 Jun 08.

Article En | MEDLINE | ID: mdl-37370719

Multispectral, multiplex immunofluorescence (mIF) microscopy has been used to great effect in research to identify cellular co-expression profiles and spatial relationships within tissue, providing a myriad of diagnostic advantages. As these technologies mature, it is essential that image data from mIF microscopes is reproducible and standardizable across devices. We sought to characterize and correct differences in illumination intensity and spectral sensitivity between three multispectral microscopes. We scanned eight melanoma tissue samples twice on each microscope and calculated their average tissue region flux intensities. We found a baseline average standard deviation of 29.9% across all microscopes, scans, and samples, which was reduced to 13.9% after applying sample-specific corrections accounting for differences in the tissue shown on each slide. We used a basic calibration model to correct sample- and microscope-specific effects on overall brightness and relative brightness as a function of the image layer. We tested the generalizability of the calibration procedure and found that applying corrections to independent validation subsets of the samples reduced the variation to 2.9 ± 0.03%. Variations in the unmixed marker expressions were reduced from 15.8% to 4.4% by correcting the raw images to a single reference microscope. Our findings show that mIF microscopes can be standardized for use in clinical pathology laboratories using a relatively simple correction model.

3.

Whole-Slide Imaging, Mutual Information Registration for Multiplex Immunohistochemistry and Immunofluorescence.

Doyle, Joshua; Green, Benjamin F; Eminizer, Margaret; Jimenez-Sanchez, Daniel; Lu, Steve; Engle, Elizabeth L; Xu, Haiying; Ogurtsova, Aleksandra; Lai, Jonathan; Soto-Diaz, Sigfredo; Roskes, Jeffrey S; Deutsch, Julie S; Taube, Janis M; Sunshine, Joel C; Szalay, Alexander S.

Lab Invest ; 103(8): 100175, 2023 08.

Article En | MEDLINE | ID: mdl-37196983

Multiplex immunohistochemistry/immunofluorescence (mIHC/mIF) is a developing technology that facilitates the evaluation of multiple, simultaneous protein expressions at single-cell resolution while preserving tissue architecture. These approaches have shown great potential for biomarker discovery, yet many challenges remain. Importantly, streamlined cross-registration of multiplex immunofluorescence images with additional imaging modalities and immunohistochemistry (IHC) can help increase the plex and/or improve the quality of the data generated by potentiating downstream processes such as cell segmentation. To address this problem, a fully automated process was designed to perform a hierarchical, parallelizable, and deformable registration of multiplexed digital whole-slide images (WSIs). We generalized the calculation of mutual information as a registration criterion to an arbitrary number of dimensions, making it well suited for multiplexed imaging. We also used the self-information of a given IF channel as a criterion to select the optimal channels to use for registration. Additionally, as precise labeling of cellular membranes in situ is essential for robust cell segmentation, a pan-membrane immunohistochemical staining method was developed for incorporation into mIF panels or for use as an IHC followed by cross-registration. In this study, we demonstrate this process by registering whole-slide 6-plex/7-color mIF images with whole-slide brightfield mIHC images, including a CD3 and a pan-membrane stain. Our algorithm, WSI, mutual information registration (WSIMIR), performed highly accurate registration allowing the retrospective generation of an 8-plex/9-color, WSI, and outperformed 2 alternative automated methods for cross-registration by Jaccard index and Dice similarity coefficient (WSIMIR vs automated WARPY, P < .01 and P < .01, respectively, vs HALO + transformix, P = .083 and P = .049, respectively). Furthermore, the addition of a pan-membrane IHC stain cross-registered to an mIF panel facilitated improved automated cell segmentation across mIF WSIs, as measured by significantly increased correct detections, Jaccard index (0.78 vs 0.65), and Dice similarity coefficient (0.88 vs 0.79).

Coloring Agents , Diagnostic Imaging , Immunohistochemistry , Retrospective Studies , Fluorescent Antibody Technique , Cell Membrane

4.

First Organoid Intelligence (OI) workshop to form an OI community.

Morales Pantoja, Itzy E; Smirnova, Lena; Muotri, Alysson R; Wahlin, Karl J; Kahn, Jeffrey; Boyd, J Lomax; Gracias, David H; Harris, Timothy D; Cohen-Karni, Tzahi; Caffo, Brian S; Szalay, Alexander S; Han, Fang; Zack, Donald J; Etienne-Cummings, Ralph; Akwaboah, Akwasi; Romero, July Carolina; Alam El Din, Dowlette-Mary; Plotkin, Jesse D; Paulhamus, Barton L; Johnson, Erik C; Gilbert, Frederic; Curley, J Lowry; Cappiello, Ben; Schwamborn, Jens C; Hill, Eric J; Roach, Paul; Tornero, Daniel; Krall, Caroline; Parri, Rheinallt; Sillé, Fenna; Levchenko, Andre; Jabbour, Rabih E; Kagan, Brett J; Berlinicke, Cynthia A; Huang, Qi; Maertens, Alexandra; Herrmann, Kathrin; Tsaioun, Katya; Dastgheyb, Raha; Habela, Christa Whelan; Vogelstein, Joshua T; Hartung, Thomas.

Front Artif Intell ; 6: 1116870, 2023.

Article En | MEDLINE | ID: mdl-36925616

The brain is arguably the most powerful computation system known. It is extremely efficient in processing large amounts of information and can discern signals from noise, adapt, and filter faulty information all while running on only 20 watts of power. The human brain's processing efficiency, progressive learning, and plasticity are unmatched by any computer system. Recent advances in stem cell technology have elevated the field of cell culture to higher levels of complexity, such as the development of three-dimensional (3D) brain organoids that recapitulate human brain functionality better than traditional monolayer cell systems. Organoid Intelligence (OI) aims to harness the innate biological capabilities of brain organoids for biocomputing and synthetic intelligence by interfacing them with computer technology. With the latest strides in stem cell technology, bioengineering, and machine learning, we can explore the ability of brain organoids to compute, and store given information (input), execute a task (output), and study how this affects the structural and functional connections in the organoids themselves. Furthermore, understanding how learning generates and changes patterns of connectivity in organoids can shed light on the early stages of cognition in the human brain. Investigating and understanding these concepts is an enormous, multidisciplinary endeavor that necessitates the engagement of both the scientific community and the public. Thus, on Feb 22-24 of 2022, the Johns Hopkins University held the first Organoid Intelligence Workshop to form an OI Community and to lay out the groundwork for the establishment of OI as a new scientific discipline. The potential of OI to revolutionize computing, neurological research, and drug development was discussed, along with a vision and roadmap for its development over the coming decade.

5.

Data-Rich Spatial Profiling of Cancer Tissue: Astronomy Informs Pathology.

Szalay, Alexander S; Taube, Janis M.

Clin Cancer Res ; 28(16): 3417-3424, 2022 08 15.

Article En | MEDLINE | ID: mdl-35522154

Astronomy was among the first disciplines to embrace Big Data and use it to characterize spatial relationships between stars and galaxies. Today, medicine, in particular pathology, has similar needs with regard to characterizing the spatial relationships between cells, with an emphasis on understanding the organization of the tumor microenvironment. In this article, we chronicle the emergence of data-intensive science through the development of the Sloan Digital Sky Survey and describe how analysis patterns and approaches similarly apply to multiplex immunofluorescence (mIF) pathology image exploration. The lessons learned from astronomy are detailed, and the new AstroPath platform that capitalizes on these learnings is described. AstroPath is being used to generate and display tumor-immune maps that can be used for mIF immuno-oncology biomarker development. The development of AstroPath as an open resource for visualizing and analyzing large-scale spatially resolved mIF datasets is underway, akin to how publicly available maps of the sky have been used by astronomers and citizen scientists alike. Associated technical, academic, and funding considerations, as well as extended future development for inclusion of spatial transcriptomics and application of artificial intelligence, are also addressed.

Astronomy , Neoplasms , Artificial Intelligence , Astronomy/methods , Fluorescent Antibody Technique , Humans , Neoplasms/diagnosis , Neoplasms/genetics , Tumor Microenvironment

6.

Now Is the Time to Build a National Data Ecosystem for Materials Science and Chemistry Research Data.

Campo, Eva M; Shankar, Sadasivan; Szalay, Alexander S; Hanisch, Robert J.

ACS Omega ; 7(16): 13398-13402, 2022 Apr 26.

Article En | MEDLINE | ID: mdl-35505822

Research organizations are critically in need of directed growth toward future interoperability and federation. The purpose of this Viewpoint is to alert the government, academia, professional societies, foundations, and industries of a further need for consideration of data in chemistry and materials as a long-term and sustained development in the US. This paper is a call for coordinated action from the government, academia, and industry to establish a national strategy and concomitant infrastructure focused on research data.

7.

Molecular phenotypes associated with antipsychotic drugs in the human caudate nucleus.

Perzel Mandell, Kira A; Eagles, Nicholas J; Deep-Soboslay, Amy; Tao, Ran; Han, Shizhong; Wilton, Richard; Szalay, Alexander S; Hyde, Thomas M; Kleinman, Joel E; Jaffe, Andrew E; Weinberger, Daniel R.

Mol Psychiatry ; 27(4): 2061-2067, 2022 04.

Article En | MEDLINE | ID: mdl-35236959

Antipsychotic drugs are the current first-line of treatment for schizophrenia and other psychotic conditions. However, their molecular effects on the human brain are poorly studied, due to difficulty of tissue access and confounders associated with disease status. Here we examine differences in gene expression and DNA methylation associated with positive antipsychotic drug toxicology status in the human caudate nucleus. We find no genome-wide significant differences in DNA methylation, but abundant differences in gene expression. These gene expression differences are overall quite similar to gene expression differences between schizophrenia cases and controls. Interestingly, gene expression differences based on antipsychotic toxicology are different between brain regions, potentially due to affected cell type differences. We finally assess similarities with effects in a mouse model, which finds some overlapping effects but many differences as well. As a first look at the molecular effects of antipsychotics in the human brain, the lack of epigenetic effects is unexpected, possibly because long term treatment effects may be relatively stable for extended periods.

Antipsychotic Agents , Psychotic Disorders , Schizophrenia , Animals , Antipsychotic Agents/pharmacology , Antipsychotic Agents/therapeutic use , Caudate Nucleus , Humans , Mice , Phenotype , Psychotic Disorders/drug therapy , Schizophrenia/drug therapy , Schizophrenia/genetics

8.

Performance optimization in DNA short-read alignment.

Wilton, Richard; Szalay, Alexander S.

Bioinformatics ; 38(8): 2081-2087, 2022 04 12.

Article En | MEDLINE | ID: mdl-35139149

SUMMARY: Over the past decade, short-read sequence alignment has become a mature technology. Optimized algorithms, careful software engineering and high-speed hardware have contributed to greatly increased throughput and accuracy. With these improvements, many opportunities for performance optimization have emerged. In this review, we examine three general-purpose short-read alignment tools-BWA-MEM, Bowtie 2 and Arioc-with a focus on performance optimization. We analyze the performance-related behavior of the algorithms and heuristics each tool implements, with the goal of arriving at practical methods of improving processing speed and accuracy. We indicate where an aligner's default behavior may result in suboptimal performance, explore the effects of computational constraints such as end-to-end mapping and alignment scoring threshold, and discuss sources of imprecision in the computation of alignment scores and mapping quality. With this perspective, we describe an approach to tuning short-read aligner performance to meet specific data-analysis and throughput requirements while avoiding potential inaccuracies in subsequent analysis of alignment results. Finally, we illustrate how this approach avoids easily overlooked pitfalls and leads to verifiable improvements in alignment speed and accuracy. CONTACT: richard.wilton@jhu.edu. SUPPLEMENTARY INFORMATION: Appendices referenced in this article are available at Bioinformatics online.

High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms , Sequence Alignment

9.

Genome-wide sequencing-based identification of methylation quantitative trait loci and their role in schizophrenia risk.

Perzel Mandell, Kira A; Eagles, Nicholas J; Wilton, Richard; Price, Amanda J; Semick, Stephen A; Collado-Torres, Leonardo; Ulrich, William S; Tao, Ran; Han, Shizhong; Szalay, Alexander S; Hyde, Thomas M; Kleinman, Joel E; Weinberger, Daniel R; Jaffe, Andrew E.

Nat Commun ; 12(1): 5251, 2021 09 02.

Article En | MEDLINE | ID: mdl-34475392

DNA methylation (DNAm) is an epigenetic regulator of gene expression and a hallmark of gene-environment interaction. Using whole-genome bisulfite sequencing, we have surveyed DNAm in 344 samples of human postmortem brain tissue from neurotypical subjects and individuals with schizophrenia. We identify genetic influence on local methylation levels throughout the genome, both at CpG sites and CpH sites, with 86% of SNPs and 55% of CpGs being part of methylation quantitative trait loci (meQTLs). These associations can further be clustered into regions that are differentially methylated by a given SNP, highlighting the genes and regions with which these loci are epigenetically associated. These findings can be used to better characterize schizophrenia GWAS-identified variants as epigenetic risk variants. Regions differentially methylated by schizophrenia risk-SNPs explain much of the heritability associated with risk loci, despite covering only a fraction of the genomic space. We provide a comprehensive, single base resolution view of association between genetic variation and genomic methylation, and implicate schizophrenia GWAS-associated variants as influencing the epigenetic plasticity of the brain.

DNA Methylation , Genome, Human , Quantitative Trait Loci/genetics , Schizophrenia/genetics , Age Factors , Brain/metabolism , Brain/pathology , CpG Islands/genetics , Epigenesis, Genetic , Genetic Predisposition to Disease/genetics , Genetic Variation , Genome-Wide Association Study , Genotype , Humans , Polymorphism, Single Nucleotide

10.

Analysis of multispectral imaging with the AstroPath platform informs efficacy of PD-1 blockade.

Berry, Sneha; Giraldo, Nicolas A; Green, Benjamin F; Cottrell, Tricia R; Stein, Julie E; Engle, Elizabeth L; Xu, Haiying; Ogurtsova, Aleksandra; Roberts, Charles; Wang, Daphne; Nguyen, Peter; Zhu, Qingfeng; Soto-Diaz, Sigfredo; Loyola, Jose; Sander, Inbal B; Wong, Pok Fai; Jessel, Shlomit; Doyle, Joshua; Signer, Danielle; Wilton, Richard; Roskes, Jeffrey S; Eminizer, Margaret; Park, Seyoun; Sunshine, Joel C; Jaffee, Elizabeth M; Baras, Alexander; De Marzo, Angelo M; Topalian, Suzanne L; Kluger, Harriet; Cope, Leslie; Lipson, Evan J; Danilova, Ludmila; Anders, Robert A; Rimm, David L; Pardoll, Drew M; Szalay, Alexander S; Taube, Janis M.

Science ; 372(6547)2021 06 11.

Article En | MEDLINE | ID: mdl-34112666

Next-generation tissue-based biomarkers for immunotherapy will likely include the simultaneous analysis of multiple cell types and their spatial interactions, as well as distinct expression patterns of immunoregulatory molecules. Here, we introduce a comprehensive platform for multispectral imaging and mapping of multiple parameters in tumor tissue sections with high-fidelity single-cell resolution. Image analysis and data handling components were drawn from the field of astronomy. Using this "AstroPath" whole-slide platform and only six markers, we identified key features in pretreatment melanoma specimens that predicted response to anti-programmed cell death-1 (PD-1)-based therapy, including CD163+PD-L1- myeloid cells and CD8+FoxP3+PD-1low/mid T cells. These features were combined to stratify long-term survival after anti-PD-1 blockade. This signature was validated in an independent cohort of patients with melanoma from a different institution.

Antineoplastic Agents, Immunological/therapeutic use , Biomarkers, Tumor/analysis , Fluorescent Antibody Technique , Melanoma/drug therapy , Programmed Cell Death 1 Receptor/antagonists & inhibitors , Adult , Aged , Aged, 80 and over , Antigens, CD/analysis , Antigens, Differentiation, Myelomonocytic/analysis , B7-H1 Antigen/analysis , CD8 Antigens/analysis , Female , Forkhead Transcription Factors/analysis , Humans , Immune Checkpoint Proteins/analysis , Macrophages/chemistry , Male , Melanoma/chemistry , Melanoma/immunology , Melanoma/pathology , Middle Aged , Prognosis , Programmed Cell Death 1 Receptor/analysis , Progression-Free Survival , Receptors, Cell Surface/analysis , SOXE Transcription Factors/analysis , Single-Cell Analysis , T-Lymphocyte Subsets/chemistry , T-Lymphocyte Subsets/immunology , Treatment Outcome , Tumor Microenvironment

11.

Characterizing the dynamic and functional DNA methylation landscape in the developing human cortex.

Perzel Mandell, Kira A; Price, Amanda J; Wilton, Richard; Collado-Torres, Leonardo; Tao, Ran; Eagles, Nicholas J; Szalay, Alexander S; Hyde, Thomas M; Weinberger, Daniel R; Kleinman, Joel E; Jaffe, Andrew E.

Epigenetics ; 16(1): 1-13, 2021 01.

Article En | MEDLINE | ID: mdl-32602773

DNA methylation (DNAm) is a key epigenetic regulator of gene expression across development. The developing prenatal brain is a highly dynamic tissue, but our understanding of key drivers of epigenetic variability across development is limited. We, therefore, assessed genomic methylation at over 39 million sites in the prenatal cortex using whole-genome bisulfite sequencing and found loci and regions in which methylation levels are dynamic across development. We saw that DNAm at these loci was associated with nearby gene expression and enriched for enhancer chromatin states in prenatal brain tissue. Additionally, these loci were enriched for genes associated with neuropsychiatric disorders and genes involved with neurogenesis. We also found autosomal differences in DNAm between the sexes during prenatal development, though these have less clear functional consequences. We lastly confirmed that the dynamic methylation at this critical period is specifically CpG methylation, with generally low levels of CpH methylation. Our findings provide detailed insight into prenatal brain development as well as clues to the pathogenesis of psychiatric traits seen later in life.

Cerebral Cortex/metabolism , DNA Methylation , Cerebral Cortex/embryology , CpG Islands , Epigenesis, Genetic , Epigenome , Female , Fetus/metabolism , Genetic Loci , Humans , Male

12.

Digital Pathology Analysis Quantifies Spatial Heterogeneity of CD3, CD4, CD8, CD20, and FoxP3 Immune Markers in Triple-Negative Breast Cancer.

Mi, Haoyang; Gong, Chang; Sulam, Jeremias; Fertig, Elana J; Szalay, Alexander S; Jaffee, Elizabeth M; Stearns, Vered; Emens, Leisha A; Cimino-Mathews, Ashley M; Popel, Aleksander S.

Front Physiol ; 11: 583333, 2020.

Article En | MEDLINE | ID: mdl-33192595

Overwhelming evidence has shown the significant role of the tumor microenvironment (TME) in governing the triple-negative breast cancer (TNBC) progression. Digital pathology can provide key information about the spatial heterogeneity within the TME using image analysis and spatial statistics. These analyses have been applied to CD8+ T cells, but quantitative analyses of other important markers and their correlations are limited. In this study, a digital pathology computational workflow is formulated for characterizing the spatial distributions of five immune markers (CD3, CD4, CD8, CD20, and FoxP3) and then the functionality is tested on whole slide images from patients with TNBC. The workflow is initiated by digital image processing to extract and colocalize immune marker-labeled cells and then convert this information to point patterns. Afterward invasive front (IF), central tumor (CT), and normal tissue (N) are characterized. For each region, we examine the intra-tumoral heterogeneity. The workflow is then repeated for all specimens to capture inter-tumoral heterogeneity. In this study, both intra- and inter-tumoral heterogeneities are observed for all five markers across all specimens. Among all regions, IF tends to have higher densities of immune cells and overall larger variations in spatial model fitting parameters and higher density in cell clusters and hotspots compared to CT and N. Results suggest a distinct role of IF in the tumor immuno-architecture. Though the sample size is limited in the study, the computational workflow could be readily reproduced and scaled due to its automatic nature. Importantly, the value of the workflow also lies in its potential to be linked to treatment outcomes and identification of predictive biomarkers for responders/non-responders, and its application to parameterization and validation of computational immuno-oncology models.

13.

Arioc: High-concurrency short-read alignment on multiple GPUs.

Wilton, Richard; Szalay, Alexander S.

PLoS Comput Biol ; 16(11): e1008383, 2020 11.

Article En | MEDLINE | ID: mdl-33166275

In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run-over 500 million 150nt paired-end reads-in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.

Sequence Alignment/methods , Software , Algorithms , Base Sequence , Computational Biology , Computer Graphics , Computers , Databases, Nucleic Acid , Humans , Information Storage and Retrieval , Sequence Alignment/statistics & numerical data , Sequence Analysis, DNA , Whole Genome Sequencing

14.

The Terabase Search Engine: a large-scale relational database of short-read sequences.

Wilton, Richard; Wheelan, Sarah J; Szalay, Alexander S; Salzberg, Steven L.

Bioinformatics ; 35(4): 665-670, 2019 02 15.

Article En | MEDLINE | ID: mdl-30052772

MOTIVATION: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples. RESULTS: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly. AVAILABILITY AND IMPLEMENTATION: Public access to the Terabase Search Engine database is available at http://tse.idies.jhu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Databases, Genetic , Search Engine , Software , Genome, Human , Genomics , Humans , Sequence Analysis, DNA

15.

Arioc: GPU-accelerated alignment of short bisulfite-treated reads.

Wilton, Richard; Li, Xin; Feinberg, Andrew P; Szalay, Alexander S.

Bioinformatics ; 34(15): 2673-2675, 2018 08 01.

Article En | MEDLINE | ID: mdl-29554207

Motivation: The alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can be mitigated by appropriate algorithmic and software-engineering improvements. One strategy is to modify the read-alignment algorithms by integrating the logic related to BS-seq alignment, with the goal of making the software implementation amenable to optimizations that lead to higher speed and greater sensitivity than might otherwise be attainable. Results: We evaluated this strategy using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by well-known CPU-based BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings. Availability and implementation: The Arioc software is available for download at https://github.com/RWilton/Arioc. It is released under a BSD open-source license. Supplementary information: Supplementary data are available at Bioinformatics online.

Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Humans , Sulfites

16.

Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space.

Wilton, Richard; Budavari, Tamas; Langmead, Ben; Wheelan, Sarah J; Salzberg, Steven L; Szalay, Alexander S.

PeerJ ; 3: e808, 2015.

Article En | MEDLINE | ID: mdl-25780763

When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU) hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

17.

From cosmos to connectomes: the evolution of data-intensive science.

Burns, Randal; Vogelstein, Joshua T; Szalay, Alexander S.

Neuron ; 83(6): 1249-52, 2014 Sep 17.

Article En | MEDLINE | ID: mdl-25233306

The analysis of data requires computation: originally by hand and more recently by computers. Different models of computing are designed and optimized for different kinds of data. In data-intensive science, the scale and complexity of data exceeds the comfort zone of local data stores on scientific workstations. Thus, cloud computing emerges as the preeminent model, utilizing data centers and high-performance clusters, enabling remote users to access and query subsets of the data efficiently. We examine how data-intensive computational systems originally built for cosmology, the Sloan Digital Sky Survey (SDSS), are now being used in connectomics, at the Open Connectome Project. We list lessons learned and outline the top challenges we expect to face. Success in computational connectomics would drastically reduce the time between idea and discovery, as SDSS did in cosmology.

Computers , Connectome/methods , Information Systems , Software , Statistics as Topic/methods , Animals , Computational Biology/methods , Humans

18.

The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience.

Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R; Bock, Davi D; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R Clay; Smith, Stephen J; Szalay, Alexander S; Vogelstein, Joshua T; Vogelstein, R Jacob.

Sci Stat Database Manag ; 2013.

Article En | MEDLINE | ID: mdl-24401992

We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes- neural connectivity maps of the brain-using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems-reads to parallel disk arrays and writes to solid-state storage-to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization.

19.

Toward Millions of File System IOPS on Low-Cost, Commodity Hardware.

Zheng, Da; Burns, Randal; Szalay, Alexander S.

ICS ; 2013.

Article En | MEDLINE | ID: mdl-24402052

We describe a storage system that removes I/O bottlenecks to achieve more than one million IOPS based on a user-space file abstraction for arrays of commodity SSDs. The file abstraction refactors I/O scheduling and placement for extreme parallelism and non-uniform memory and I/O. The system includes a set-associative, parallel page cache in the user space. We redesign page caching to eliminate CPU overhead and lock-contention in non-uniform memory architecture machines. We evaluate our design on a 32 core NUMA machine with four, eight-core processors. Experiments show that our design delivers 1.23 million 512-byte read IOPS. The page cache realizes the scalable IOPS of Linux asynchronous I/O (AIO) and increases user-perceived I/O performance linearly with cache hit rates. The parallel, set-associative cache matches the cache hit rates of the global Linux page cache under real workloads.

20.

Long-range autocorrelations of CpG islands in the human genome.

Koester, Benjamin; Rea, Thomas J; Templeton, Alan R; Szalay, Alexander S; Sing, Charles F.

PLoS One ; 7(1): e29889, 2012.

Article En | MEDLINE | ID: mdl-22253817

In this paper, we use a statistical estimator developed in astrophysics to study the distribution and organization of features of the human genome. Using the human reference sequence we quantify the global distribution of CpG islands (CGI) in each chromosome and demonstrate that the organization of the CGI across a chromosome is non-random, exhibits surprisingly long range correlations (10 Mb) and varies significantly among chromosomes. These correlations of CGI summarize functional properties of the genome that are not captured when considering variation in any particular separate (and local) feature. The demonstration of the proposed methods to quantify the organization of CGI in the human genome forms the basis of future studies. The most illuminating of these will assess the potential impact on phenotypic variation of inter-individual variation in the organization of the functional features of the genome within and among chromosomes, and among individuals for particular chromosomes.

CpG Islands/genetics , Genome, Human/genetics , Base Sequence , Chromosomes, Human/genetics , Databases, Genetic , Humans