Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Más filtros

Banco de datos
Tipo del documento
Asunto de la revista
País de afiliación
Intervalo de año de publicación
1.
J Biomed Inform ; 125: 103971, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34920127

RESUMEN

OBJECTIVE: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.


Asunto(s)
Registros Electrónicos de Salud , Publicaciones , Humanos , Procesamiento de Lenguaje Natural , PubMed , Reproducibilidad de los Resultados
2.
Nucleic Acids Res ; 47(W1): W183-W190, 2019 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-31069376

RESUMEN

High-throughput experiments produce increasingly large datasets that are difficult to analyze and integrate. While most data integration approaches focus on aligning metadata, data integration can be achieved by abstracting experimental results into gene sets. Such gene sets can be made available for reuse through gene set enrichment analysis tools such as Enrichr. Enrichr currently only supports gene sets compiled from human and mouse, limiting accessibility for investigators that study other model organisms. modEnrichr is an expansion of Enrichr for four model organisms: fish, fly, worm and yeast. The gene set libraries within FishEnrichr, FlyEnrichr, WormEnrichr and YeastEnrichr are created from the Gene Ontology, mRNA expression profiles, GeneRIF, pathway databases, protein domain databases and other organism-specific resources. Additionally, libraries were created by predicting gene function from RNA-seq co-expression data processed uniformly from the gene expression omnibus for each organism. The modEnrichr suite of tools provides the ability to convert gene lists across species using an ortholog conversion tool that automatically detects the species. For complex analyses, modEnrichr provides API access that enables submitting batch queries. In summary, modEnrichr leverages existing model organism databases and other resources to facilitate comprehensive hypothesis generation through data integration.


Asunto(s)
Bases de Datos Genéticas , Expresión Génica/genética , Biblioteca de Genes , Biblioteca Genómica , Programas Informáticos , Animales , Biología Computacional , Ontología de Genes , Humanos , Metadatos
3.
Nat Microbiol ; 9(2): 537-549, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38287147

RESUMEN

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.


Asunto(s)
Células Procariotas , Proteínas Virales , Proteínas Virales/genética , Genómica , Proteínas de la Cápside/genética , Metagenómica
4.
Res Sq ; 2023 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-37205395

RESUMEN

Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.

5.
Sci Adv ; 9(25): eade5492, 2023 06 23.
Artículo en Inglés | MEDLINE | ID: mdl-37343092

RESUMEN

Stem cells in many systems, including Drosophila germline stem cells (GSCs), increase ribosome biogenesis and translation during terminal differentiation. Here, we show that the H/ACA small nuclear ribonucleoprotein (snRNP) complex that promotes pseudouridylation of ribosomal RNA (rRNA) and ribosome biogenesis is required for oocyte specification. Reducing ribosome levels during differentiation decreased the translation of a subset of messenger RNAs that are enriched for CAG trinucleotide repeats and encode polyglutamine-containing proteins, including differentiation factors such as RNA-binding Fox protein 1. Moreover, ribosomes were enriched at CAG repeats within transcripts during oogenesis. Increasing target of rapamycin (TOR) activity to elevate ribosome levels in H/ACA snRNP complex-depleted germlines suppressed the GSC differentiation defects, whereas germlines treated with the TOR inhibitor rapamycin had reduced levels of polyglutamine-containing proteins. Thus, ribosome biogenesis and ribosome levels can control stem cell differentiation via selective translation of CAG repeat-containing transcripts.


Asunto(s)
Ribonucleoproteínas Nucleares Pequeñas , Ribosomas , Ribonucleoproteínas Nucleares Pequeñas/metabolismo , Ribosomas/metabolismo , ARN Ribosómico , Proteínas/metabolismo , Sirolimus
6.
Cell Syst ; 9(5): 417-421, 2019 11 27.
Artículo en Inglés | MEDLINE | ID: mdl-31677972

RESUMEN

As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.


Asunto(s)
Difusión de la Información/métodos , Internet/tendencias , Sistemas en Línea/normas , Recursos en Salud/normas , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA