Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 81
Filtrar
1.
Artículo en Inglés | MEDLINE | ID: mdl-38621825

RESUMEN

Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein "sentences"? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted. We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.


Asunto(s)
Biología Computacional , Aprendizaje Automático , Biología Computacional/métodos , Humanos , Proteínas , Sustitución de Aminoácidos , Polimorfismo de Nucleótido Simple , Aprendizaje Profundo
2.
Pac Symp Biocomput ; 29: 446-449, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38160298

RESUMEN

Precision medicine, also often referred to as personalized medicine, targets the development of treatments and preventative measures specific to the individual's genomic signatures, lifestyle, and environmental conditions. The series of Precision Medicine sessions in PSB has continuously highlighted the advances in this field. Our 2024 collection of manuscripts showcases algorithmic advances that integrate data from distinct modalities and introduce innovative approaches to extract new, medically relevant information from existing data. These evolving technology and analytical methods promise to bring closer the goals of precision medicine to improve health and increase lifespan.


Asunto(s)
Biología Computacional , Medicina de Precisión , Humanos , Medicina de Precisión/métodos , Genómica
3.
Nucleic Acids Res ; 51(19): 10162-10175, 2023 10 27.
Artículo en Inglés | MEDLINE | ID: mdl-37739408

RESUMEN

Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.


Asunto(s)
Bacterias , Bacterias/citología , Bacterias/genética , Bases de Datos Factuales , Microbiota , Filogenia , ARN Ribosómico 16S/genética , Fenómenos Fisiológicos Bacterianos
5.
Genes (Basel) ; 13(5)2022 04 27.
Artículo en Inglés | MEDLINE | ID: mdl-35627162

RESUMEN

Synonymous single nucleotide variants (sSNVs) are often considered functionally silent, but a few cases of cancer-causing sSNVs have been reported. From available databases, we collected four categories of sSNVs: germline, somatic in normal tissues, somatic in cancerous tissues, and putative cancer drivers. We found that screening sSNVs for recurrence among patients, conservation of the affected genomic position, and synVep prediction (synVep is a machine learning-based sSNV effect predictor) recovers cancer driver variants (termed proposed drivers) and previously unknown putative cancer genes. Of the 2.9 million somatic sSNVs found in the COSMIC database, we identified 2111 proposed cancer driver sSNVs. Of these, 326 sSNVs could be further tagged for possible RNA splicing effects, RNA structural changes, and affected RBP motifs. This list of proposed cancer driver sSNVs provides computational guidance in prioritizing the experimental evaluation of synonymous mutations found in cancers. Furthermore, our list of novel potential cancer genes, galvanized by synonymous mutations, may highlight yet unexplored cancer mechanisms.


Asunto(s)
Neoplasias , Mutación Silenciosa , Genómica , Humanos , Neoplasias/genética , Oncogenes , Empalme del ARN
6.
Hum Genet ; 141(10): 1615-1627, 2022 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-35347416

RESUMEN

Infertility is a major reproductive health issue that affects about 12% of women of reproductive age in the United States. Aneuploidy in eggs accounts for a significant proportion of early miscarriage and in vitro fertilization failure. Recent studies have shown that genetic variants in several genes affect chromosome segregation fidelity and predispose women to a higher incidence of egg aneuploidy. However, the exact genetic causes of aneuploid egg production remain unclear, making it difficult to diagnose infertility based on individual genetic variants in mother's genome. In this study, we evaluated machine learning-based classifiers for predicting the embryonic aneuploidy risk in female IVF patients using whole-exome sequencing data. Using two exome datasets, we obtained an area under the receiver operating curve of 0.77 and 0.68, respectively. High precision could be traded off for high specificity in classifying patients by selecting different prediction score cutoffs. For example, a strict prediction score cutoff of 0.7 identified 29% of patients as high-risk with 94% precision. In addition, we identified MCM5, FGGY, and DDX60L as potential aneuploidy risk genes that contribute the most to the predictive power of the model. These candidate genes and their molecular interaction partners are enriched for meiotic-related gene ontology categories and pathways, such as microtubule organizing center and DNA recombination. In summary, we demonstrate that sequencing data can be mined to predict patients' aneuploidy risk thus improving clinical diagnosis. The candidate genes and pathways we identified are promising targets for future aneuploidy studies.


Asunto(s)
Infertilidad , Diagnóstico Preimplantación , Aneuploidia , ADN , Femenino , Fertilización In Vitro , Humanos , Embarazo , Secuenciación del Exoma
7.
Nat Rev Genet ; 23(6): 322-323, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35338359

Asunto(s)
Algoritmos
8.
Sci Adv ; 8(2): eabj3984, 2022 Jan 14.
Artículo en Inglés | MEDLINE | ID: mdl-35030025

RESUMEN

Biological redox reactions drive planetary biogeochemical cycles. Using a novel, structure-guided sequence analysis of proteins, we explored the patterns of evolution of enzymes responsible for these reactions. Our analysis reveals that the folds that bind transition metal­containing ligands have similar structural geometry and amino acid sequences across the full diversity of proteins. Similarity across folds reflects the availability of key transition metals over geological time and strongly suggests that transition metal­ligand binding had a small number of common peptide origins. We observe that structures central to our similarity network come primarily from oxidoreductases, suggesting that ancestral peptides may have also facilitated electron transfer reactions. Last, our results reveal that the earliest biologically functional peptides were likely available before the assembly of fully functional protein domains over 3.8 billion years ago.Thus, life is a special, very complex form of motion of matter, but this form did not always exist, and it is not separated from inorganic nature by an impassable abyss; rather, it arose from inorganic nature as a new property in the process of evolution of the world. We must study the history of this evolution if we want to solve the problem of the origin of life. [A. I. Oparin (1)]

9.
Nucleic Acids Res ; 49(22): 12673-12691, 2021 12 16.
Artículo en Inglés | MEDLINE | ID: mdl-34850938

RESUMEN

Synonymous single nucleotide variants (sSNVs) are common in the human genome but are often overlooked. However, sSNVs can have significant biological impact and may lead to disease. Existing computational methods for evaluating the effect of sSNVs suffer from the lack of gold-standard training/evaluation data and exhibit over-reliance on sequence conservation signals. We developed synVep (synonymous Variant effect predictor), a machine learning-based method that overcomes both of these limitations. Our training data was a combination of variants reported by gnomAD (observed) and those unreported, but possible in the human genome (generated). We used positive-unlabeled learning to purify the generated variant set of any likely unobservable variants. We then trained two sequential extreme gradient boosting models to identify subsets of the remaining variants putatively enriched and depleted in effect. Our method attained 90% precision/recall on a previously unseen set of variants. Furthermore, although synVep does not explicitly use conservation, its scores correlated with evolutionary distances between orthologs in cross-species variation analysis. synVep was also able to differentiate pathogenic vs. benign variants, as well as splice-site disrupting variants (SDV) vs. non-SDVs. Thus, synVep provides an important improvement in annotation of sSNVs, allowing users to focus on variants that most likely harbor effects.


Asunto(s)
Variación Genética , Aprendizaje Automático , Enfermedad/genética , Genoma Humano , Humanos , Proteínas/genética , Empalme del ARN
11.
Nucleic Acids Res ; 49(W1): W535-W540, 2021 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-33999203

RESUMEN

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.


Asunto(s)
Conformación Proteica , Programas Informáticos , Sitios de Unión , Proteínas de la Nucleocápside de Coronavirus/química , Proteínas de Unión al ADN/química , Fosfoproteínas/química , Estructura Secundaria de Proteína , Proteínas/química , Proteínas/fisiología , Proteínas de Unión al ARN/química , Alineación de Secuencia , Análisis de Secuencia de Proteína
12.
Front Mol Biosci ; 8: 635382, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33816556

RESUMEN

Non-synonymous Single Nucleotide Variants (nsSNVs), resulting in single amino acid variants (SAVs), are important drivers of evolutionary adaptation across the tree of life. Humans carry on average over 10,000 SAVs per individual genome, many of which likely have little to no impact on the function of the protein they affect. Experimental evidence for protein function changes as a result of SAVs remain sparse - a situation that can be somewhat alleviated by predicting their impact using computational methods. Here, we used SNAP to examine both observed and in silico generated human variation in a set of 1,265 proteins that are consistently found across a number of diverse species. The number of SAVs that are predicted to have any functional effect on these proteins is smaller than expected, suggesting sequence/function optimization over evolutionary timescales. Additionally, we find that only a few of the yet-unobserved SAVs could drastically change the function of these proteins, while nearly a quarter would have only a mild functional effect. We observed that variants common in the human population localized to less conserved protein positions and carried mild to moderate functional effects more frequently than rare variants. As expected, rare variants carried severe effects more frequently than common variants. In line with current assumptions, we demonstrated that the change of the human reference sequence amino acid to the reference of another species (a cross-species variant) is unlikely to significantly impact protein function. However, we also observed that many cross-species variants may be weakly non-neutral for the purposes of quick adaptation to environmental changes, but may not be identified as such by current state-of-the-art methodology.

13.
J Lipid Res ; 62: 100046, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33587919

RESUMEN

Lecithin:retinol acyltransferase and retinol-binding protein enable vitamin A (VA) storage and transport, respectively, maintaining tissue homeostasis of retinoids (VA derivatives). The precarious VA status of the lecithin:retinol acyltransferase-deficient (Lrat-/-) retinol-binding protein-deficient (Rbp-/-) mice rapidly deteriorates upon dietary VA restriction, leading to signs of severe vitamin A deficiency (VAD). As retinoids impact gut morphology and functions, VAD is often linked to intestinal pathological conditions and microbial dysbiosis. Thus, we investigated the contribution of VA storage and transport to intestinal retinoid homeostasis and functionalities. We showed the occurrence of intestinal VAD in Lrat-/-Rbp-/- mice, demonstrating the critical role of both pathways in preserving gut retinoid homeostasis. Moreover, in the mutant colon, VAD resulted in a compromised intestinal barrier as manifested by reduced mucins and antimicrobial defense, leaky gut, increased inflammation and oxidative stress, and altered mucosal immunocytokine profiles. These perturbations were accompanied by fecal dysbiosis, revealing that the VA status (sufficient vs. deficient), rather than the amount of dietary VA per se, is likely a major initial discriminant of the intestinal microbiome. Our data also pointed to a specific fecal taxonomic profile and distinct microbial functionalities associated with VAD. Overall, our findings revealed the suitability of the Lrat-/-Rbp-/- mice as a model to study intestinal dysfunctions and dysbiosis promoted by changes in tissue retinoid homeostasis induced by the host VA status and/or intake.


Asunto(s)
Vitamina A
15.
Microbiologyopen ; 9(9): e1100, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32762019

RESUMEN

Microbes active in extreme cold are not as well explored as those of other extreme environments. Studies have revealed a substantial microbial diversity and identified cold-specific microbiome molecular functions. We analyzed the metagenomes and metatranscriptomes of 20 snow samples collected in early and late spring in Svalbard, Norway using mi-faser, our read-based computational microbiome function annotation tool. Our results reveal a more diverse microbiome functional capacity and activity in the early- vs. late-spring samples. We also find that functional dissimilarity between the same-sample metagenomes and metatranscriptomes is significantly higher in early than late spring samples. These findings suggest that early spring samples may contain a larger fraction of DNA of dormant (or dead) organisms, while late spring samples reflect a new, metabolically active community. We further show that the abundance of sequencing reads mapping to the fatty acid synthesis-related microbial pathways in late spring metagenomes and metatranscriptomes is significantly correlated with the organic acid levels measured in these samples. Similarly, the organic acid levels correlate with the pathway read abundances of geraniol degradation and inversely correlate with those of styrene degradation, suggesting a possible nutrient change. Our study thus highlights the activity of microbial degradation pathways of complex organic compounds previously unreported at low temperatures.


Asunto(s)
Bacterias/metabolismo , Microbiota/fisiología , Compuestos Orgánicos/metabolismo , Nieve/microbiología , Monoterpenos Acíclicos/metabolismo , Carbono/metabolismo , Ácidos Grasos/biosíntesis , Redes y Vías Metabólicas , Metagenoma , Microbiota/genética , Noruega , Estaciones del Año , Estireno/metabolismo , Transcriptoma
16.
BMC Bioinformatics ; 21(1): 235, 2020 Jun 09.
Artículo en Inglés | MEDLINE | ID: mdl-32517697

RESUMEN

BACKGROUND: The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction - a process called 'end-to-end learning' - has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. RESULTS: By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. CONCLUSION: Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.


Asunto(s)
Aminoácidos/metabolismo , Biología Computacional/métodos , Aprendizaje Profundo/normas , Humanos
17.
Biol Direct ; 14(1): 19, 2019 10 30.
Artículo en Inglés | MEDLINE | ID: mdl-31666099

RESUMEN

BACKGROUND: Accumulating evidence suggests that the human microbiome impacts individual and public health. City subway systems are human-dense environments, where passengers often exchange microbes. The MetaSUB project participants collected samples from subway surfaces in different cities and performed metagenomic sequencing. Previous studies focused on taxonomic composition of these microbiomes and no explicit functional analysis had been done till now. RESULTS: As a part of the 2018 CAMDA challenge, we functionally profiled the available ~ 400 subway metagenomes and built predictor for city origin. In cross-validation, our model reached 81% accuracy when only the top-ranked city assignment was considered and 95% accuracy if the second city was taken into account as well. Notably, this performance was only achievable if the similarity of distribution of cities in the training and testing sets was similar. To assure that our methods are applicable without such biased assumptions we balanced our training data to account for all represented cities equally well. After balancing, the performance of our method was slightly lower (76/94%, respectively, for one or two top ranked cities), but still consistently high. Here we attained an added benefit of independence of training set city representation. In testing, our unbalanced model thus reached (an over-estimated) performance of 90/97%, while our balanced model was at a more reliable 63/90% accuracy. While, by definition of our model, we were not able to predict the microbiome origins previously unseen, our balanced model correctly judged them to be NOT-from-training-cities over 80% of the time. Our function-based outlook on microbiomes also allowed us to note similarities between both regionally close and far-away cities. Curiously, we identified the depletion in mycobacterial functions as a signature of cities in New Zealand, while photosynthesis related functions fingerprinted New York, Porto and Tokyo. CONCLUSIONS: We demonstrated the power of our high-speed function annotation method, mi-faser, by analysing ~ 400 shotgun metagenomes in 2 days, with the results recapitulating functional signals of different city subway microbiomes. We also showed the importance of balanced data in avoiding over-estimated performance. Our results revealed similarities between both geographically close (Ofa and Ilorin) and distant (Boston and Porto, Lisbon and New York) city subway microbiomes. The photosynthesis related functional signatures of NYC were previously unseen in taxonomy studies, highlighting the strength of functional analysis.


Asunto(s)
Metagenoma , Microbiota , Vías Férreas , Ciudades
18.
Front Genet ; 10: 914, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31649718

RESUMEN

Recent advances in high-throughput experimentation have put the exploration of genome sequences at the forefront of precision medicine. In an effort to interpret the sequencing data, numerous computational methods have been developed for evaluating the effects of genome variants. Interestingly, despite the fact that every person has as many synonymous (sSNV) as non-synonymous single nucleotide variants, our ability to predict their effects is limited. The paucity of experimentally tested sSNV effects appears to be the limiting factor in development of such methods. Here, we summarize the details and evaluate the performance of nine existing computational methods capable of predicting sSNV effects. We used a set of observed and artificially generated variants to approximate large scale performance expectations of these tools. We note that the distribution of these variants across amino acid and codon types suggests purifying evolutionary selection retaining generated variants out of the observed set; i.e., we expect the generated set to be enriched for deleterious variants. Closer inspection of the relationship between the observed variant frequencies and the associated prediction scores identifies predictor-specific scoring thresholds of reliable effect predictions. Notably, across all predictors, the variants scoring above these thresholds were significantly more often generated than observed. which confirms our assumption that the generated set is enriched for deleterious variants. Finally, we find that while the methods differ in their ability to identify severe sSNV effects, no predictor appears capable of definitively recognizing subtle effects of such variants on a large scale.

19.
Nucleic Acids Res ; 47(21): e142, 2019 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-31584091

RESUMEN

Evaluating the impact of non-synonymous genetic variants is essential for uncovering disease associations and mechanisms of evolution. An in-depth understanding of sequence changes is also fundamental for synthetic protein design and stability assessments. However, the variant effect predictor performance gain observed in recent years has not kept up with the increased complexity of new methods. One likely reason for this might be that most approaches use similar sets of gene and protein features for modeling variant effects, often emphasizing sequence conservation. While high levels of conservation highlight residues essential for protein activity, much of the variation observable in vivo is arguably weaker in its impact, thus requiring evaluation at a higher level of resolution. Here, we describe functionNeutral/Toggle/Rheostatpredictor (funtrp), a novel computational method that categorizes protein positions based on the position-specific expected range of mutational impacts: Neutral (weak/no effects), Rheostat (function-tuning positions), or Toggle (on/off switches). We show that position types do not correlate strongly with familiar protein features such as conservation or protein disorder. We also find that position type distribution varies across different protein functions. Finally, we demonstrate that position types can improve performance of existing variant effect predictors and suggest a way forward for the development of new ones.


Asunto(s)
Biología Computacional/métodos , Secuencia Conservada/genética , Mutación/genética , Proteínas , Secuencia de Aminoácidos/genética , Secuencia de Bases/genética , Bases de Datos de Proteínas , Humanos , Modelos Moleculares , Proteínas/química , Proteínas/genética , Relación Estructura-Actividad
20.
Genome Med ; 11(1): 59, 2019 09 30.
Artículo en Inglés | MEDLINE | ID: mdl-31564248

RESUMEN

BACKGROUND: After years of concentrated research efforts, the exact cause of Crohn's disease (CD) remains unknown. Its accurate diagnosis, however, helps in management and preventing the onset of disease. Genome-wide association studies have identified 241 CD loci, but these carry small log odds ratios and are thus diagnostically uninformative. METHODS: Here, we describe a machine learning method-AVA,Dx (Analysis of Variation for Association with Disease)-that uses exonic variants from whole exome or genome sequencing data to extract CD signal and predict CD status. Using the person-specific coding variation in genes from a panel of only 111 individuals, we built disease-prediction models informative of previously undiscovered disease genes. By additionally accounting for batch effects, we were able to accurately predict CD status for thousands of previously unseen individuals from other panels. RESULTS: AVA,Dx highlighted known CD genes including NOD2 and new potential CD genes. AVA,Dx identified 16% (at strict cutoff) of CD patients at 99% precision and 58% of the patients (at default cutoff) with 82% precision in over 3000 individuals from separately sequenced panels. CONCLUSIONS: Larger training panels and additional features, including other types of genetic variants and environmental factors, e.g., human-associated microbiota, may improve model performance. However, the results presented here already position AVA,Dx as both an effective method for revealing pathogenesis pathways and as a CD risk analysis tool, which can improve clinical diagnostic time and accuracy. Links to the AVA,Dx Docker image and the BitBucket source code are at https://bromberglab.org/project/avadx/ .


Asunto(s)
Enfermedad de Crohn/diagnóstico , Exoma/genética , Marcadores Genéticos , Predisposición Genética a la Enfermedad , Metagenoma , Polimorfismo de Nucleótido Simple , Enfermedad de Crohn/genética , Enfermedad de Crohn/microbiología , Estudio de Asociación del Genoma Completo , Humanos , Aprendizaje Automático , Pronóstico
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...