RESUMO
MOTIVATION: Genotype imputation has the potential to increase the amount of information that can be gained from the often limited biological material available in ancient samples. As many widely used tools have been developed with modern data in mind, their design is not necessarily reflective of the requirements in studies of ancient DNA. Here, we investigate if an imputation method based on the full probabilistic Li and Stephens model of haplotype frequencies might be beneficial for the particular challenges posed by ancient data. RESULTS: We present an implementation called prophaser and compare imputation performance to two alternative pipelines that have been used in the ancient DNA community based on the Beagle software. Considering empirical ancient data downsampled to lower coverages as well as present-day samples with artificially thinned genotypes, we show that the proposed method is advantageous at lower coverages, where it yields improved accuracy and ability to capture rare variation. The software prophaser is optimized for running in a massively parallel manner and achieved reasonable runtimes on the experiments performed when executed on a GPU. AVAILABILITY AND IMPLEMENTATION: The C++ code for prophaser is available in the GitHub repository https://github.com/scicompuu/prophaser. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.
Assuntos
DNA Antigo , Software , Animais , Cães , Humanos , Genótipo , Haplótipos , EtnicidadeRESUMO
KEY MESSAGE: Pooling and imputation are computational methods that can be combined for achieving cost-effective and accurate high-density genotyping of both common and rare variants, as demonstrated in a MAGIC wheat population. The plant breeding industry has shown growing interest in using the genotype data of relevant markers for performing selection of new competitive varieties. The selection usually benefits from large amounts of marker data, and it is therefore crucial to dispose of data collection methods that are both cost-effective and reliable. Computational methods such as genotype imputation have been proposed earlier in several plant science studies for addressing the cost challenge. Genotype imputation methods have though been used more frequently and investigated more extensively in human genetics research. The various algorithms that exist have shown lower accuracy at inferring the genotype of genetic variants occurring at low frequency, while these rare variants can have great significance and impact in the genetic studies that underlie selection. In contrast, pooling is a technique that can efficiently identify low-frequency items in a population, and it has been successfully used for detecting the samples that carry rare variants in a population. In this study, we propose to combine pooling and imputation and demonstrate this by simulating a hypothetical microarray for genotyping a population of recombinant inbred lines in a cost-effective and accurate manner, even for rare variants. We show that with an adequate imputation model, it is feasible to accurately predict the individual genotypes at lower cost than sample-wise genotyping and time-effectively. Moreover, we provide code resources for reproducing the results presented in this study in the form of a containerized workflow.
Assuntos
Polimorfismo de Nucleotídeo Único , Triticum , Humanos , Genótipo , Triticum/genética , Pão , Melhoramento Vegetal , Técnicas de Genotipagem/métodosRESUMO
BACKGROUND: Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented. RESULTS: We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts. CONCLUSIONS: We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.
Assuntos
Genoma , Polimorfismo de Nucleotídeo Único , Algoritmos , Animais , Cães , Genótipo , Técnicas de Genotipagem/métodos , HumanosRESUMO
Haploid high quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp-reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Most genomic regions we investigated show little to no mapping bias but even a small proportion of sites with bias can impact analyses of those particular loci or slightly skew genome-wide estimates. Therefore, reference bias has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.
Assuntos
DNA Antigo/análise , Hominidae/genética , Metagenômica/métodos , Animais , Viés , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA/métodos , SoftwareRESUMO
One hidden yet important issue for developing neural network potentials (NNPs) is the choice of training algorithm. In this article, we compare the performance of two popular training algorithms, the adaptive moment estimation algorithm (Adam) and the extended Kalman filter algorithm (EKF), using the Behler-Parrinello neural network and two publicly accessible datasets of liquid water [Morawietz et al., Proc. Natl. Acad. Sci. U. S. A. 113, 8368-8373, (2016) and Cheng et al., Proc. Natl. Acad. Sci. U. S. A. 116, 1110-1115, (2019)]. This is achieved by implementing EKF in TensorFlow. It is found that NNPs trained with EKF are more transferable and less sensitive to the value of the learning rate, as compared to Adam. In both cases, error metrics of the validation set do not always serve as a good indicator for the actual performance of NNPs. Instead, we show that their performance correlates well with a Fisher information based similarity measure.
RESUMO
Modern Flash X-ray diffraction Imaging (FXI) acquires diffraction signals from single biomolecules at a high repetition rate from X-ray Free Electron Lasers (XFELs), easily obtaining millions of 2D diffraction patterns from a single experiment. Due to the stochastic nature of FXI experiments and the massive volumes of data, retrieving 3D electron densities from raw 2D diffraction patterns is a challenging and time-consuming task. We propose a semi-automatic data analysis pipeline for FXI experiments, which includes four steps: hit-finding and preliminary filtering, pattern classification, 3D Fourier reconstruction, and post-analysis. We also include a recently developed bootstrap methodology in the post-analysis step for uncertainty analysis and quality control. To achieve the best possible resolution, we further suggest using background subtraction, signal windowing, and convex optimization techniques when retrieving the Fourier phases in the post-analysis step. As an application example, we quantified the 3D electron structure of the PR772 virus using the proposed data analysis pipeline. The retrieved structure was above the detector edge resolution and clearly showed the pseudo-icosahedral capsid of the PR772.
RESUMO
BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. RESULTS: We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds - an increasingly common scenario for both academic and commercial cloud providers - our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. CONCLUSIONS: BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis.
Assuntos
Computação em Nuvem/normas , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , HumanosRESUMO
In imaging modalities recording diffraction data, such as the imaging of viruses at X-ray free electron laser facilities, the original image can be reconstructed assuming known phases. When phases are unknown, oversampling and a constraint on the support region in the original object can be used to solve a non-convex optimization problem using iterative alternating-projection methods. Such schemes are ill-suited for finding the optimum solution for sparse data, since the recorded pattern does not correspond exactly to the original wave function. Different iteration starting points can give rise to different solutions. We construct a convex optimization problem, where the only local optimum is also the global optimum. This is achieved using a modified support constraint and a maximum-likelihood treatment of the recorded data as a sample from the underlying wave function. This relaxed problem is solved in order to provide a new set of most probable "healed" signal intensities, without sparseness and missing data. For these new intensities, it should be possible to satisfy the support constraint and intensity constraint exactly, without conflicts between them. By making both constraints satisfiable, traditional phase retrieval with superior results is made possible. On simulated data, we demonstrate the benefits of our approach visually, and quantify the improvement in terms of the crystallographic R factor for the recovered scalar amplitudes relative to true simulations from .405 to .097, as well as the mean-squared error in the reconstructed image from .233 to .139. We also compare our approach, with regards to theory and simulation results, to other approaches for healing as well as noise-tolerant phase retrieval. These tests indicate that the COACS pre-processing allows for best-in-class results.
RESUMO
The existence of noise and column-wise artifacts in the CSPAD-140K detector and in a module of the CSPAD-2.3M large camera, respectively, is reported for the L730 and L867 experiments performed at the CXI Instrument at the Linac Coherent Light Source (LCLS), in low-flux and low signal-to-noise ratio regime. Possible remedies are discussed and an additional step in the preprocessing of data is introduced, which consists of performing a median subtraction along the columns of the detector modules. Thus, we reduce the overall variation in the photon count distribution, lowering the mean false-positive photon detection rate by about 4% (from 5.57 × 10-5 to 5.32 × 10-5â photon counts pixel-1 frame-1 in L867, cxi86715) and 7% (from 1.70 × 10-3 to 1.58 × 10-3â photon counts pixel-1 frame-1 in L730, cxi73013), and the standard deviation in false-positive photon count per shot by 15% and 35%, while not making our average photon detection threshold more stringent. Such improvements in detector noise reduction and artifact removal constitute a step forward in the development of flash X-ray imaging techniques for high-resolution, low-signal and in serial nano-crystallography experiments at X-ray free-electron laser facilities.
RESUMO
We use extremely bright and ultrashort pulses from an x-ray free-electron laser (XFEL) to measure correlations in x rays scattered from individual bioparticles. This allows us to go beyond the traditional crystallography and single-particle imaging approaches for structure investigations. We employ angular correlations to recover the three-dimensional (3D) structure of nanoscale viruses from x-ray diffraction data measured at the Linac Coherent Light Source. Correlations provide us with a comprehensive structural fingerprint of a 3D virus, which we use both for model-based and ab initio structure recovery. The analyses reveal a clear indication that the structure of the viruses deviates from the expected perfect icosahedral symmetry. Our results anticipate exciting opportunities for XFEL studies of the structure and dynamics of nanoscale objects by means of angular correlations.
Assuntos
Vírus/ultraestrutura , Difração de Raios X , Lasers , Radiografia , Vírus/químicaRESUMO
BACKGROUND: This paper describes a combined heuristic and hidden Markov model (HMM) method to accurately impute missing genotypes in livestock datasets. Genomic selection in breeding programs requires high-density genotyping of many individuals, making algorithms that economically generate this information crucial. There are two common classes of imputation methods, heuristic methods and probabilistic methods, the latter being largely based on hidden Markov models. Heuristic methods are robust, but fail to impute markers in regions where the thresholds of heuristic rules are not met, or the pedigree is inconsistent. Hidden Markov models are probabilistic methods which typically do not require specific family structures or pedigree information, making them very flexible, but they are computationally expensive and, in some cases, less accurate. RESULTS: We implemented a new hybrid imputation method that combined heuristic and HMM methods, AlphaImpute and MaCH, and compared the computation time and imputation accuracy of the three methods. AlphaImpute was the fastest, followed by the hybrid method and then the HMM. The computation time of the hybrid method and the HMM increased linearly with the number of iterations used in the hidden Markov model, however, the computation time of the hybrid method increased almost linearly and that of the HMM quadratically with the number of template haplotypes. The hybrid method was the most accurate imputation method for low-density panels when pedigree information was missing, especially if minor allele frequency was also low. The accuracy of the hybrid method and the HMM increased with the number of template haplotypes. The imputation accuracy of all three methods increased with the marker density of the low-density panels. Excluding the pedigree information reduced imputation accuracy for the hybrid method and AlphaImpute. Finally, the imputation accuracy of the three methods decreased with decreasing minor allele frequency. CONCLUSIONS: The hybrid heuristic and probabilistic imputation method is able to impute all markers for all individuals in a population, as the HMM. The hybrid method is usually more accurate and never significantly less accurate than a purely heuristic method or a purely probabilistic method and is faster than a standard probabilistic method.
Assuntos
Cruzamento/métodos , Estudo de Associação Genômica Ampla/métodos , Gado/genética , Software , Animais , Cruzamento/normas , Frequência do Gene , Estudo de Associação Genômica Ampla/normas , GenótipoRESUMO
We present a proof-of-concept three-dimensional reconstruction of the giant mimivirus particle from experimentally measured diffraction patterns from an x-ray free-electron laser. Three-dimensional imaging requires the assembly of many two-dimensional patterns into an internally consistent Fourier volume. Since each particle is randomly oriented when exposed to the x-ray pulse, relative orientations have to be retrieved from the diffraction data alone. We achieve this with a modified version of the expand, maximize and compress algorithm and validate our result using new methods.
Assuntos
Imageamento Tridimensional/métodos , Mimiviridae/ultraestrutura , Difração de Raios X/métodos , Algoritmos , Elétrons , Lasers , Difração de Raios X/instrumentaçãoRESUMO
The idea of using ultrashort X-ray pulses to obtain images of single proteins frozen in time has fascinated and inspired many. It was one of the arguments for building X-ray free-electron lasers. According to theory, the extremely intense pulses provide sufficient signal to dispense with using crystals as an amplifier, and the ultrashort pulse duration permits capturing the diffraction data before the sample inevitably explodes. This was first demonstrated on biological samples a decade ago on the giant mimivirus. Since then, a large collaboration has been pushing the limit of the smallest sample that can be imaged. The ability to capture snapshots on the timescale of atomic vibrations, while keeping the sample at room temperature, may allow probing the entire conformational phase space of macromolecules. Here we show the first observation of an X-ray diffraction pattern from a single protein, that of Escherichia coli GroEL which at 14 nm in diameter is the smallest biological sample ever imaged by X-rays, and demonstrate that the concept of diffraction before destruction extends to single proteins. From the pattern, it is possible to determine the approximate orientation of the protein. Our experiment demonstrates the feasibility of ultrafast imaging of single proteins, opening the way to single-molecule time-resolved studies on the femtosecond timescale.
RESUMO
BACKGROUND: In many contexts, pedigrees for individuals are known even though not all individuals have been fully genotyped. In one extreme case, the genotypes for a set of full siblings are known, with no knowledge of parental genotypes. We propose a method for inferring phased haplotypes and genotypes for all individuals, even those with missing data, in such pedigrees, allowing a multitude of classic and recent methods for linkage and genome analysis to be used more efficiently. RESULTS: By artificially removing the founder generation genotype data from a well-studied simulated dataset, the quality of reconstructed genotypes in that generation can be verified. For the full structure of repeated matings with 15 offspring per mating, 10 dams per sire, 99.89% of all founder markers were phased correctly, given only the unphased genotypes for offspring. The accuracy was reduced only slightly, to 99.51%, when introducing a 2% error rate in offspring genotypes. When reduced to only 5 full-sib offspring in a single sire-dam mating, the corresponding percentage is 92.62%, which compares favorably with 89.28% from the leading Merlin package. Furthermore, Merlin is unable to handle more than approximately 10 sibs, as the number of states tracked rises exponentially with family size, while our approach has no such limit and handles 150 half-sibs with ease in our experiments. CONCLUSIONS: Our method is able to reconstruct genotypes for parents when genotype data is only available for offspring individuals, as well as haplotypes for all individuals. Compared to the Merlin package, we can handle larger pedigrees and produce superior results, mainly due to the fact that Merlin uses the Viterbi algorithm on the state space to infer the genotype sequence. Tracking of haplotype and allele origin can be used in any application where the marker set does not directly influence genotype variation influencing traits. Inference of genotypes can also reduce the effects of genotyping errors and missing data. The cnF2freq codebase implementing our approach is available under a BSD-style license.
Assuntos
Genótipo , Haplótipos , Modelos Estatísticos , Pais , Linhagem , Irmãos , Humanos , Cadeias de MarkovRESUMO
Dimensionality reduction is a data transformation technique widely used in various fields of genomics research. The application of dimensionality reduction to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. Among frequently used methods are principal component analysis, which is a linear transform that often misses more fine-scale structures, and neighbor-graph based methods which focus on local relationships rather than large-scale patterns. Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this study, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data. Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, while preserving global geometry to a higher extent than t-SNE and UMAP, yielding results that are comparable to an alternative deep learning approach based on variational autoencoders. We also discuss the use of the methodology for more general characterization of genotype data, showing that it preserves spatial properties in the form of decay of linkage disequilibrium with distance along the genome and demonstrating its use as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.
Assuntos
Simulação por Computador , Aprendizado Profundo , Análise por Conglomerados , Genoma Humano/genética , Genótipo , Humanos , SoftwareRESUMO
With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase the power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle v4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference, and study sample size. Making use of five ancient individuals with high-coverage data available, we evaluated imputed data for accuracy, reference bias, and genetic affinities as captured by principal component analysis. We obtained genotype concordance levels of over 99% for data with 1× coverage, and similar levels of accuracy and reference bias at levels as low as 0.75×. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1×. We also show that a large and varied phased reference panel as well as the inclusion of low- to moderate-coverage ancient individuals in the study sample can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for postprocessing and validation prior to downstream analysis.
Assuntos
DNA Antigo , Genótipo , Alelos , Frequência do Gene , Humanos , SoftwareRESUMO
Single Particle Imaging (SPI) with intense coherent X-ray pulses from X-ray free-electron lasers (XFELs) has the potential to produce molecular structures without the need for crystallization or freezing. Here we present a dataset of 285,944 diffraction patterns from aerosolized Coliphage PR772 virus particles injected into the femtosecond X-ray pulses of the Linac Coherent Light Source (LCLS). Additional exposures with background information are also deposited. The diffraction data were collected at the Atomic, Molecular and Optical Science Instrument (AMO) of the LCLS in 4 experimental beam times during a period of four years. The photon energy was either 1.2 or 1.7 keV and the pulse energy was between 2 and 4 mJ in a focal spot of about 1.3 µm x 1.7 µm full width at half maximum (FWHM). The X-ray laser pulses captured the particles in random orientations. The data offer insight into aerosolised virus particles in the gas phase, contain information relevant to improving experimental parameters, and provide a basis for developing algorithms for image analysis and reconstruction.
Assuntos
Colífagos , Lasers , Aceleradores de Partículas , Vírion , Difração de Raios XRESUMO
[This corrects the article DOI: 10.1107/S2052252517003591.].
RESUMO
The possibility of imaging single proteins constitutes an exciting challenge for x-ray lasers. Despite encouraging results on large particles, imaging small particles has proven to be difficult for two reasons: not quite high enough pulse intensity from currently available x-ray lasers and, as we demonstrate here, contamination of the aerosolized molecules by nonvolatile contaminants in the solution. The amount of contamination on the sample depends on the initial droplet size during aerosolization. Here, we show that, with our electrospray injector, we can decrease the size of aerosol droplets and demonstrate virtually contaminant-free sample delivery of organelles, small virions, and proteins. The results presented here, together with the increased performance of next-generation x-ray lasers, constitute an important stepping stone toward the ultimate goal of protein structure determination from imaging at room temperature and high temporal resolution.
RESUMO
Modern technology for producing extremely bright and coherent x-ray laser pulses provides the possibility to acquire a large number of diffraction patterns from individual biological nanoparticles, including proteins, viruses, and DNA. These two-dimensional diffraction patterns can be practically reconstructed and retrieved down to a resolution of a few angstroms. In principle, a sufficiently large collection of diffraction patterns will contain the required information for a full three-dimensional reconstruction of the biomolecule. The computational methodology for this reconstruction task is still under development and highly resolved reconstructions have not yet been produced. We analyze the expansion-maximization-compression scheme, the current state of the art approach for this very challenging application, by isolating different sources of resolution-limiting factors. Through numerical experiments on synthetic data we evaluate their respective impact. We reach conclusions of relevance for handling actual experimental data, and we also point out certain improvements to the underlying estimation algorithm. We also introduce a practically applicable computational methodology in the form of bootstrap procedures for assessing reconstruction uncertainty in the real data case. We evaluate the sharpness of this approach and argue that this type of procedure will be critical in the near future when handling the increasing amount of data.