RESUMEN
Reports of machine learning implementations in veterinary imaging are infrequent but changes in machine learning architecture and access to increased computing power will likely prompt increased interest. This diagnostic accuracy study describes a particular form of machine learning, a deep learning convolution neural network (ConvNet) for hip joint detection and classification of hip dysplasia from ventro-dorsal (VD) pelvis radiographs submitted for hip dysplasia screening. 11,759 pelvis images were available together with their Fédération Cynologique Internationale (FCI) scores. The dataset was dicotomized into images showing no signs of hip dysplasia (FCI grades "A" and "B", the "A-B" group) and hips showing signs of dysplasia (FCI grades "C", "D," and "E", the "C-E" group). In a transfer learning approach, an existing pretrained ConvNet was fine-tuned to provide models to recognize hip joints in VD pelvis images and to classify them according to their FCI score grouping. The results yielded two models. The first was successful in detecting hip joints in the VD pelvis images (intersection over union of 85%). The second yielded a sensitivity of 0.53, a specificity of 0.92, a positive predictive value of 0.91, and a negative predictive value of 0.81 for the classification of detected hip joints as being in the "C-E" group. ConvNets and transfer learning are applicable to veterinary imaging. The models obtained have potential to be a tool to aid in hip screening protocols if hip dysplasia classification performance was improved through access to more data and possibly by model optimization.
Asunto(s)
Aprendizaje Profundo , Luxación de la Cadera/veterinaria , Articulación de la Cadera/diagnóstico por imagen , Procesamiento de Imagen Asistido por Computador/métodos , Pelvis/diagnóstico por imagen , Radiografía/veterinaria , Animales , Luxación de la Cadera/diagnóstico por imagen , Humanos , Tamizaje Masivo/veterinaria , Valor Predictivo de las PruebasRESUMEN
Methods of protein structure determination based on NMR chemical shifts are becoming increasingly common. The most widely used approaches adopt the molecular fragment replacement strategy, in which structural fragments are repeatedly reassembled into different complete conformations in molecular simulations. Although these approaches are effective in generating individual structures consistent with the chemical shift data, they do not enable the sampling of the conformational space of proteins with correct statistical weights. Here, we present a method of molecular fragment replacement that makes it possible to perform equilibrium simulations of proteins, and hence to determine their free energy landscapes. This strategy is based on the encoding of the chemical shift information in a probabilistic model in Markov chain Monte Carlo simulations. First, we demonstrate that with this approach it is possible to fold proteins to their native states starting from extended structures. Second, we show that the method satisfies the detailed balance condition and hence it can be used to carry out an equilibrium sampling from the Boltzmann distribution corresponding to the force field used in the simulations. Third, by comparing the results of simulations carried out with and without chemical shift restraints we describe quantitatively the effects that these restraints have on the free energy landscapes of proteins. Taken together, these results demonstrate that the molecular fragment replacement strategy can be used in combination with chemical shift information to characterize not only the native structures of proteins but also their conformational fluctuations.
Asunto(s)
Simulación por Computador , Modelos Moleculares , Resonancia Magnética Nuclear Biomolecular/métodos , Proteínas/química , Cadenas de MarkovRESUMEN
BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases. RESULTS: We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.
Asunto(s)
Posición Específica de Matrices de Puntuación , Animales , Drosophila , Evolución Molecular , Genoma , Humanos , Probabilidad , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
We propose a method to formulate probabilistic models of protein structure in atomic detail, for a given amino acid sequence, based on Bayesian principles, while retaining a close link to physics. We start from two previously developed probabilistic models of protein structure on a local length scale, which concern the dihedral angles in main chain and side chains, respectively. Conceptually, this constitutes a probabilistic and continuous alternative to the use of discrete fragment and rotamer libraries. The local model is combined with a nonlocal model that involves a small number of energy terms according to a physical force field, and some information on the overall secondary structure content. In this initial study we focus on the formulation of the joint model and the evaluation of the use of an energy vector as a descriptor of a protein's nonlocal structure; hence, we derive the parameters of the nonlocal model from the native structure without loss of generality. The local and nonlocal models are combined using the reference ratio method, which is a well-justified probabilistic construction. For evaluation, we use the resulting joint models to predict the structure of four proteins. The results indicate that the proposed method and the probabilistic models show considerable promise for probabilistic protein structure prediction and related applications.
Asunto(s)
Modelos Moleculares , Modelos Estadísticos , Algoritmos , Secuencia de Aminoácidos , Proteínas Bacterianas/química , Teorema de Bayes , Enlace de Hidrógeno , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Homología Estructural de Proteína , TermodinámicaRESUMEN
We present the theoretical foundations of a general principle to infer structure ensembles of flexible biomolecules from spatially and temporally averaged data obtained in biophysical experiments. The central idea is to compute the Kullback-Leibler optimal modification of a given prior distribution τ(x) with respect to the experimental data and its uncertainty. This principle generalizes the successful inferential structure determination method and recently proposed maximum entropy methods. Tractability of the protocol is demonstrated through the analysis of simulated nuclear magnetic resonance spectroscopy data of a small peptide.
Asunto(s)
Biofisica , Modelos Teóricos , Algoritmos , Simulación por ComputadorRESUMEN
The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i.e., mapping the reads to a reference genome. In this new situation, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome (E-value) and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program.
Asunto(s)
Mapeo Cromosómico/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genómica/métodos , Humanos , ProbabilidadRESUMEN
We present a new software framework for Markov chain Monte Carlo sampling for simulation, prediction, and inference of protein structure. The software package contains implementations of recent advances in Monte Carlo methodology, such as efficient local updates and sampling from probabilistic models of local protein structure. These models form a probabilistic alternative to the widely used fragment and rotamer libraries. Combined with an easily extendible software architecture, this makes PHAISTOS well suited for Bayesian inference of protein structure from sequence and/or experimental data. Currently, two force-fields are available within the framework: PROFASI and OPLS-AA/L, the latter including the generalized Born surface area solvent model. A flexible command-line and configuration-file interface allows users quickly to set up simulations with the desired configuration. PHAISTOS is released under the GNU General Public License v3.0. Source code and documentation are freely available from http://phaistos.sourceforge.net. The software is implemented in C++ and has been tested on Linux and OSX platforms.
Asunto(s)
Cadenas de Markov , Método de Montecarlo , Proteínas/química , Programas Informáticos , Teorema de Bayes , Simulación por Computador , Modelos Químicos , Conformación ProteicaRESUMEN
Conventional methods for protein structure determination from NMR data rely on the ad hoc combination of physical forcefields and experimental data, along with heuristic determination of free parameters such as weight of experimental data relative to a physical forcefield. Recently, a theoretically rigorous approach was developed which treats structure determination as a problem of Bayesian inference. In this case, the forcefields are brought in as a prior distribution in the form of a Boltzmann factor. Due to high computational cost, the approach has been only sparsely applied in practice. Here, we demonstrate that the use of generative probabilistic models instead of physical forcefields in the Bayesian formalism is not only conceptually attractive, but also improves precision and efficiency. Our results open new vistas for the use of sophisticated probabilistic models of biomolecular structure in structure determination from experimental data.
Asunto(s)
Modelos Estadísticos , Resonancia Magnética Nuclear Biomolecular/métodos , Conformación Proteica , Proteínas/química , Algoritmos , Teorema de Bayes , Campos Electromagnéticos , Modelos Moleculares , Estructura Terciaria de Proteína , TemperaturaRESUMEN
Understanding protein structure is of crucial importance in science, medicine and biotechnology. For about two decades, knowledge-based potentials based on pairwise distances--so-called "potentials of mean force" (PMFs)--have been center stage in the prediction and design of protein structure and the simulation of protein folding. However, the validity, scope and limitations of these potentials are still vigorously debated and disputed, and the optimal choice of the reference state--a necessary component of these potentials--is an unsolved problem. PMFs are loosely justified by analogy to the reversible work theorem in statistical physics, or by a statistical argument based on a likelihood function. Both justifications are insightful but leave many questions unanswered. Here, we show for the first time that PMFs can be seen as approximations to quantities that do have a rigorous probabilistic justification: they naturally arise when probability distributions over different features of proteins need to be combined. We call these quantities "reference ratio distributions" deriving from the application of the "reference ratio method." This new view is not only of theoretical relevance but leads to many insights that are of direct practical use: the reference state is uniquely defined and does not require external physical insights; the approach can be generalized beyond pairwise distances to arbitrary features of protein structure; and it becomes clear for which purposes the use of these quantities is justified. We illustrate these insights with two applications, involving the radius of gyration and hydrogen bonding. In the latter case, we also show how the reference ratio method can be iteratively applied to sculpt an energy funnel. Our results considerably increase the understanding and scope of energy functions derived from known biomolecular structures.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Conformación Proteica , Pliegue de Proteína , Enlace de Hidrógeno , Modelos Moleculares , Reproducibilidad de los Resultados , TermodinámicaRESUMEN
BACKGROUND: Accurately covering the conformational space of amino acid side chains is essential for important applications such as protein design, docking and high resolution structure prediction. Today, the most common way to capture this conformational space is through rotamer libraries - discrete collections of side chain conformations derived from experimentally determined protein structures. The discretization can be exploited to efficiently search the conformational space. However, discretizing this naturally continuous space comes at the cost of losing detailed information that is crucial for certain applications. For example, rigorously combining rotamers with physical force fields is associated with numerous problems. RESULTS: In this work we present BASILISK: a generative, probabilistic model of the conformational space of side chains that makes it possible to sample in continuous space. In addition, sampling can be conditional upon the protein's detailed backbone conformation, again in continuous space - without involving discretization. CONCLUSIONS: A careful analysis of the model and a comparison with various rotamer libraries indicates that the model forms an excellent, fully continuous model of side chain conformational space. We also illustrate how the model can be used for rigorous, unbiased sampling with a physical force field, and how it improves side chain prediction when used as a pseudo-energy term. In conclusion, BASILISK is an important step forward on the way to a rigorous probabilistic description of protein structure in continuous space and in atomic detail.
Asunto(s)
Modelos Estadísticos , Proteínas/química , Modelos Moleculares , Conformación ProteicaRESUMEN
The increasing importance of non-coding RNA in biology and medicine has led to a growing interest in the problem of RNA 3-D structure prediction. As is the case for proteins, RNA 3-D structure prediction methods require two key ingredients: an accurate energy function and a conformational sampling procedure. Both are only partly solved problems. Here, we focus on the problem of conformational sampling. The current state of the art solution is based on fragment assembly methods, which construct plausible conformations by stringing together short fragments obtained from experimental structures. However, the discrete nature of the fragments necessitates the use of carefully tuned, unphysical energy functions, and their non-probabilistic nature impairs unbiased sampling. We offer a solution to the sampling problem that removes these important limitations: a probabilistic model of RNA structure that allows efficient sampling of RNA conformations in continuous space, and with associated probabilities. We show that the model captures several key features of RNA structure, such as its rotameric nature and the distribution of the helix lengths. Furthermore, the model readily generates native-like 3-D conformations for 9 out of 10 test structures, solely using coarse-grained base-pairing information. In conclusion, the method provides a theoretical and practical solution for a major bottleneck on the way to routine prediction and simulation of RNA structure and dynamics in atomic detail.
Asunto(s)
Modelos Estadísticos , Conformación de Ácido Nucleico , ARN/química , Algoritmos , Teorema de Bayes , Simulación por Computador , Bases de Datos de Ácidos Nucleicos , Imagenología Tridimensional/métodos , Cadenas de Markov , Modelos Moleculares , Método de Montecarlo , Programas InformáticosRESUMEN
BACKGROUND: In studies of gene regulation the efficient computational detection of over-represented transcription factor binding sites is an increasingly important aspect. Several published methods can be used for testing whether a set of hypothesised co-regulated genes share a common regulatory regime based on the occurrence of the modelled transcription factor binding sites. However there is little or no information available for guiding the end users choice of method. Furthermore it would be necessary to obtain several different software programs from various sources to make a well-founded choice. METHODOLOGY: We introduce a software package, Asap, for fast searching with position weight matrices that include several standard methods for assessing over-representation. We have compared the ability of these methods to detect over-represented transcription factor binding sites in artificial promoter sequences. Controlling all aspects of our input data we are able to identify the optimal statistics across multiple threshold values and for sequence sets containing different distributions of transcription factor binding sites. CONCLUSIONS: We show that our implementation is significantly faster than more naïve scanning algorithms when searching with many weight matrices in large sequence sets. When comparing the various statistics, we show that those based on binomial over-representation and Fisher's exact test performs almost equally good and better than the others. An online server is available at http://servers.binf.ku.dk/asap/.