RESUMEN
Microscopy is a central method in life sciences. Many popular methods, such as antibody labeling, are used to add physical fluorescent labels to specific cellular constituents. However, these approaches have significant drawbacks, including inconsistency; limitations in the number of simultaneous labels because of spectral overlap; and necessary perturbations of the experiment, such as fixing the cells, to generate the measurement. Here, we show that a computational machine-learning approach, which we call "in silico labeling" (ISL), reliably predicts some fluorescent labels from transmitted-light images of unlabeled fixed or live biological samples. ISL predicts a range of labels, such as those for nuclei, cell type (e.g., neural), and cell state (e.g., cell death). Because prediction happens in silico, the method is consistent, is not limited by spectral overlap, and does not disturb the experiment. ISL generates biological measurements that would otherwise be problematic or impossible to acquire.
Asunto(s)
Colorantes Fluorescentes/química , Procesamiento de Imagen Asistido por Computador/métodos , Microscopía Fluorescente/métodos , Neuronas Motoras/citología , Algoritmos , Animales , Línea Celular Tumoral , Supervivencia Celular , Corteza Cerebral/citología , Humanos , Células Madre Pluripotentes Inducidas/citología , Aprendizaje Automático , Redes Neurales de la Computación , Neurociencias , Ratas , Programas Informáticos , Células Madre/citologíaRESUMEN
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
Asunto(s)
Exoma/genética , Variación Genética/genética , Análisis Mutacional de ADN , Conjuntos de Datos como Asunto , Humanos , Fenotipo , Proteoma/genética , Enfermedades Raras/genética , Tamaño de la MuestraRESUMEN
SUMMARY: Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome. AVAILABILITY AND IMPLEMENTATION: GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genómica , Programas Informáticos , Genoma Humano , HumanosRESUMEN
Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified. To identify further genetic risk factors, here we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n = 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant, and the overall rate of mutation is only modestly higher than the expected rate. In contrast, the proteins encoded by genes that harboured de novo missense or nonsense mutations showed a higher degree of connectivity among themselves and to previous ASD genes as indexed by protein-protein interaction screens. The small increase in the rate of de novo events, when taken together with the protein interaction results, are consistent with an important but limited role for de novo point mutations in ASD, similar to that documented for de novo copy number variants. Genetic models incorporating these data indicate that most of the observed de novo events are unconnected to ASD; those that do confer risk are distributed across many genes and are incompletely penetrant (that is, not necessarily sufficient for disease). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5- to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favour of CHD8 and KATNAL2 as genuine autism risk factors.
Asunto(s)
Trastorno Autístico/genética , Proteínas de Unión al ADN/genética , Exones/genética , Predisposición Genética a la Enfermedad/genética , Mutación/genética , Factores de Transcripción/genética , Estudios de Casos y Controles , Exoma/genética , Salud de la Familia , Humanos , Modelos Genéticos , Herencia Multifactorial/genética , Fenotipo , Distribución de Poisson , Mapas de Interacción de ProteínasRESUMEN
We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.
Asunto(s)
Trastornos Generalizados del Desarrollo Infantil/genética , Exoma , Estudio de Asociación del Genoma Completo , Estudios de Casos y Controles , Niño , Trastornos Generalizados del Desarrollo Infantil/fisiopatología , Predisposición Genética a la Enfermedad , Variación Genética , Humanos , Regulación de la Población , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
BACKGROUND: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. RESULTS: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. CONCLUSIONS: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
Asunto(s)
Exoma/genética , Mutación INDEL/genética , Mutagénesis , Biología Computacional , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Proyecto Genoma Humano , Humanos , Aprendizaje AutomáticoRESUMEN
Synthetic genetic polymers (xeno-nucleic acids, XNAs) have the potential to transition aptamers from laboratory tools to therapeutic agents, but additional functionality is needed to compete with antibodies. Here, we describe the evolution of a biologically stable artificial genetic system composed of α-l-threofuranosyl nucleic acid (TNA) that facilitates the production of backbone- and base-modified aptamers termed "threomers" that function as high quality protein capture reagents. Threomers were discovered against two prototypical protein targets implicated in human diseases through a combination of in vitro selection and next-generation sequencing using uracil nucleotides that are uniformly equipped with aromatic side chains commonly found in the paratope of antibody-antigen crystal structures. Kinetic measurements reveal that the side chain modifications are critical for generating threomers with slow off-rate binding kinetics. These findings expand the chemical space of evolvable non-natural genetic systems to include functional groups that enhance protein target binding by mimicking the structural properties of traditional antibodies.
Asunto(s)
Aptámeros de Nucleótidos/química , Ácidos Nucleicos/química , Polímeros/química , Tetrosas/química , Anticuerpos/química , Cinética , Proteínas/químicaRESUMEN
BACKGROUND: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. METHODS: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning. RESULTS: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC. CONCLUSIONS: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
RESUMEN
Copy number variants (CNVs) are an important type of genetic variation that play a causal role in many diseases. The ability to identify high quality CNVs is of substantial clinical relevance. However, CNVs are notoriously difficult to identify accurately from array-based methods and next-generation sequencing (NGS) data, particularly for small (< 10kbp) CNVs. Manual curation by experts widely remains the gold standard but cannot scale with the pace of sequencing, particularly in fast-growing clinical applications. We present the first proof-of-principle study demonstrating high throughput manual curation of putative CNVs by non-experts. We developed a crowdsourcing framework, called CrowdVariant, that leverages Google's high-throughput crowdsourcing platform to create a high confidence set of deletions for NA24385 (NIST HG002/RM 8391), an Ashkenazim reference sample developed in partnership with the Genome In A Bottle (GIAB) Consortium. We show that non-experts tend to agree both with each other and with experts on putative CNVs. We show that crowdsourced non-expert classifications can be used to accurately assign copy number status to putative CNV calls and identify 1,781 high confidence deletions in a reference sample. Multiple lines of evidence suggest these calls are a substantial improvement over existing CNV callsets and can also be useful in benchmarking and improving CNV calling algorithms. Our crowdsourcing methodology takes the first step toward showing the clinical potential for manual curation of CNVs at scale and can further guide other crowdsourcing genomics applications.
Asunto(s)
Colaboración de las Masas/métodos , Variaciones en el Número de Copia de ADN , Algoritmos , Biología Computacional/métodos , Curaduría de Datos , Genoma Humano , Genómica/métodos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Análisis de Secuencia de ADN/estadística & datos numéricosRESUMEN
Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.
Asunto(s)
Enfermedades Cardiovasculares , Aprendizaje Profundo , Interpretación de Imagen Asistida por Computador/métodos , Retina/diagnóstico por imagen , Anciano , Anciano de 80 o más Años , Algoritmos , Enfermedades Cardiovasculares/diagnóstico por imagen , Enfermedades Cardiovasculares/epidemiología , Femenino , Fondo de Ojo , Humanos , Masculino , Persona de Mediana Edad , Factores de RiesgoRESUMEN
Purpose: We evaluate how deep learning can be applied to extract novel information such as refractive error from retinal fundus imaging. Methods: Retinal fundus images used in this study were 45- and 30-degree field of view images from the UK Biobank and Age-Related Eye Disease Study (AREDS) clinical trials, respectively. Refractive error was measured by autorefraction in UK Biobank and subjective refraction in AREDS. We trained a deep learning algorithm to predict refractive error from a total of 226,870 images and validated it on 24,007 UK Biobank and 15,750 AREDS images. Our model used the "attention" method to identify features that are correlated with refractive error. Results: The resulting algorithm had a mean absolute error (MAE) of 0.56 diopters (95% confidence interval [CI]: 0.55-0.56) for estimating spherical equivalent on the UK Biobank data set and 0.91 diopters (95% CI: 0.89-0.93) for the AREDS data set. The baseline expected MAE (obtained by simply predicting the mean of this population) was 1.81 diopters (95% CI: 1.79-1.84) for UK Biobank and 1.63 (95% CI: 1.60-1.67) for AREDS. Attention maps suggested that the foveal region was one of the most important areas used by the algorithm to make this prediction, though other regions also contribute to the prediction. Conclusions: To our knowledge, the ability to estimate refractive error with high accuracy from retinal fundus photos has not been previously known and demonstrates that deep learning can be applied to make novel predictions from medical images.
Asunto(s)
Aprendizaje Profundo , Fondo de Ojo , Errores de Refracción/diagnóstico , Retina/diagnóstico por imagen , Adulto , Anciano , Algoritmos , Conjuntos de Datos como Asunto , Femenino , Humanos , Masculino , Persona de Mediana Edad , Refracción Ocular , Pruebas de Visión , Campos Visuales/fisiologíaRESUMEN
Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
Asunto(s)
Genoma Humano , Mamíferos/genética , Redes Neurales de la Computación , Polimorfismo de Nucleótido Simple , Animales , Análisis Mutacional de ADN , Genómica , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación INDEL , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
Accurate prediction of the functional effect of genetic variation is critical for clinical genome interpretation. We systematically characterized the transcriptome effects of protein-truncating variants, a class of variants expected to have profound effects on gene function, using data from the Genotype-Tissue Expression (GTEx) and Geuvadis projects. We quantitated tissue-specific and positional effects on nonsense-mediated transcript decay and present an improved predictive model for this decay. We directly measured the effect of variants both proximal and distal to splice junctions. Furthermore, we found that robustness to heterozygous gene inactivation is not due to dosage compensation. Our results illustrate the value of transcriptome data in the functional interpretation of genetic variants.
Asunto(s)
Regulación de la Expresión Génica , Variación Genética , Genoma Humano/genética , Proteínas/genética , Transcriptoma , Empalme Alternativo , Perfilación de la Expresión Génica , Silenciador del Gen , Heterocigoto , Humanos , Degradación de ARNm Mediada por Codón sin Sentido , FenotipoRESUMEN
This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK.
Asunto(s)
Variación Genética , Genoma Humano , Programas Informáticos , Calibración , Bases de Datos Genéticas , Haploidia , Haplotipos/genética , Humanos , Anotación de Secuencia Molecular , Polimorfismo de Nucleótido Simple/genética , Alineación de SecuenciaRESUMEN
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (â¼4×) 1000 Genomes Project datasets.