Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 390
Filtrar
Más filtros

Intervalo de año de publicación
1.
Am J Hum Genet ; 111(7): 1405-1419, 2024 07 11.
Artículo en Inglés | MEDLINE | ID: mdl-38906146

RESUMEN

Genome-wide association studies (GWASs) have identified numerous lung cancer risk-associated loci. However, decoding molecular mechanisms of these associations is challenging since most of these genetic variants are non-protein-coding with unknown function. Here, we implemented massively parallel reporter assays (MPRAs) to simultaneously measure the allelic transcriptional activity of risk-associated variants. We tested 2,245 variants at 42 loci from 3 recent GWASs in East Asian and European populations in the context of two major lung cancer histological types and exposure to benzo(a)pyrene. This MPRA approach identified one or more variants (median 11 variants) with significant effects on transcriptional activity at 88% of GWAS loci. Multimodal integration of lung-specific epigenomic data demonstrated that 63% of the loci harbored multiple potentially functional variants in linkage disequilibrium. While 22% of the significant variants showed allelic effects in both A549 (adenocarcinoma) and H520 (squamous cell carcinoma) cell lines, a subset of the functional variants displayed a significant cell-type interaction. Transcription factor analyses nominated potential regulators of the functional variants, including those with cell-type-specific expression and those predicted to bind multiple potentially functional variants across the GWAS loci. Linking functional variants to target genes based on four complementary approaches identified candidate susceptibility genes, including those affecting lung cancer cell growth. CRISPR interference of the top functional variant at 20q13.33 validated variant-to-gene connections, including RTEL1, SOX18, and ARFRP1. Our data provide a comprehensive functional analysis of lung cancer GWAS loci and help elucidate the molecular basis of heterogeneity and polygenicity underlying lung cancer susceptibility.


Asunto(s)
Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Neoplasias Pulmonares , Polimorfismo de Nucleótido Simple , Humanos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patología , Desequilibrio de Ligamiento , Herencia Multifactorial/genética , Línea Celular Tumoral , Alelos , Células A549
2.
RNA ; 30(3): 189-199, 2024 Feb 16.
Artículo en Inglés | MEDLINE | ID: mdl-38164624

RESUMEN

Aptamers have emerged as research hotspots of the next generation due to excellent performance benefits and application potentials in pharmacology, medicine, and analytical chemistry. Despite the numerous aptamer investigations, the lack of comprehensive data integration has hindered the development of computational methods for aptamers and the reuse of aptamers. A public access database named AptaDB, derived from experimentally validated data manually collected from the literature, was hence developed, integrating comprehensive aptamer-related data, which include six key components: (i) experimentally validated aptamer-target interaction information, (ii) aptamer property information, (iii) structure information of aptamer, (iv) target information, (v) experimental activity information, and (vi) algorithmically calculated similar aptamers. AptaDB currently contains 1350 experimentally validated aptamer-target interactions, 1230 binding affinity constants, 1293 aptamer sequences, and more. Compared to other aptamer databases, it contains twice the number of entries found in available databases. The collection and integration of the above information categories is unique among available aptamer databases and provides a user-friendly interface. AptaDB will also be continuously updated as aptamer research evolves. We expect that AptaDB will become a powerful source for aptamer rational design and a valuable tool for aptamer screening in the future. For access to AptaDB, please visit http://lmmd.ecust.edu.cn/aptadb/.


Asunto(s)
Aptámeros de Nucleótidos , Oligonucleótidos , Bases de Datos Factuales , Aptámeros de Nucleótidos/química , Técnica SELEX de Producción de Aptámeros
3.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38842510

RESUMEN

Accurate and comprehensive annotation of microprotein-coding small open reading frames (smORFs) is critical to our understanding of normal physiology and disease. Empirical identification of translated smORFs is carried out primarily using ribosome profiling (Ribo-seq). While effective, published Ribo-seq datasets can vary drastically in quality and different analysis tools are frequently employed. Here, we examine the impact of these factors on identifying translated smORFs. We compared five commonly used software tools that assess open reading frame translation from Ribo-seq (RibORFv0.1, RibORFv1.0, RiboCode, ORFquant, and Ribo-TISH) and found surprisingly low agreement across all tools. Only ~2% of smORFs were called translated by all five tools, and ~15% by three or more tools when assessing the same high-resolution Ribo-seq dataset. For larger annotated genes, the same analysis showed ~74% agreement across all five tools. We also found that some tools are strongly biased against low-resolution Ribo-seq data, while others are more tolerant. Analyzing Ribo-seq coverage revealed that smORFs detected by more than one tool tend to have higher translation levels and higher fractions of in-frame reads, consistent with what was observed for annotated genes. Together these results support employing multiple tools to identify the most confident microprotein-coding smORFs and choosing the tools based on the quality of the dataset and the planned downstream characterization experiments of the predicted smORFs.


Asunto(s)
Sistemas de Lectura Abierta , Programas Informáticos , Ribosomas/metabolismo , Ribosomas/genética , Anotación de Secuencia Molecular/métodos , Humanos , Biosíntesis de Proteínas , Biología Computacional/métodos , Perfilado de Ribosomas
4.
Brief Bioinform ; 25(5)2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39120646

RESUMEN

Cell-type annotation is a critical step in single-cell data analysis. With the development of numerous cell annotation methods, it is necessary to evaluate these methods to help researchers use them effectively. Reference datasets are essential for evaluation, but currently, the cell labels of reference datasets mainly come from computational methods, which may have computational biases and may not reflect the actual cell-type outcomes. This study first constructed an experimentally labeled immune cell-subtype single-cell dataset of the same batch and systematically evaluated 18 cell annotation methods. We assessed those methods under five scenarios, including intra-dataset validation, immune cell-subtype validation, unsupervised clustering, inter-dataset annotation, and unknown cell-type prediction. Accuracy and ARI were evaluation metrics. The results showed that SVM, scBERT, and scDeepSort were the best-performing supervised methods. Seurat was the best-performing unsupervised clustering method, but it couldn't fully fit the actual cell-type distribution. Our results indicated that experimentally labeled immune cell-subtype datasets revealed the deficiencies of unsupervised clustering methods and provided new dataset support for supervised methods.


Asunto(s)
Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Análisis por Conglomerados , Biología Computacional/métodos , Anotación de Secuencia Molecular , RNA-Seq/métodos , Análisis de Expresión Génica de una Sola Célula
5.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38426325

RESUMEN

Accurate metabolite annotation and false discovery rate (FDR) control remain challenging in large-scale metabolomics. Recent progress leveraging proteomics experiences and interdisciplinary inspirations has provided valuable insights. While target-decoy strategies have been introduced, generating reliable decoy libraries is difficult due to metabolite complexity. Moreover, continuous bioinformatics innovation is imperative to improve the utilization of expanding spectral resources while reducing false annotations. Here, we introduce the concept of ion entropy for metabolomics and propose two entropy-based decoy generation approaches. Assessment of public databases validates ion entropy as an effective metric to quantify ion information in massive metabolomics datasets. Our entropy-based decoy strategies outperform current representative methods in metabolomics and achieve superior FDR estimation accuracy. Analysis of 46 public datasets provides instructive recommendations for practical application.


Asunto(s)
Algoritmos , Espectrometría de Masas en Tándem , Entropía , Espectrometría de Masas en Tándem/métodos , Metabolómica/métodos , Biología Computacional/métodos , Bases de Datos de Proteínas
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38600664

RESUMEN

Small open reading frames (smORFs) have been acknowledged to play various roles on essential biological pathways and affect human beings from diabetes to tumorigenesis. Predicting smORFs in silico is quite a prerequisite for processing the omics data. Here, we proposed the smORF-coding-potential-predicting framework, sOCP, which provides functions to construct a model for predicting novel smORFs in some species. The sOCP model constructed in human was based on in-frame features and the nucleotide bias around the start codon, and the small feature subset was proved to be competent enough and avoid overfitting problems for complicated models. It showed more advanced prediction metrics than previous methods and could correlate closely with experimental evidence in a heterogeneous dataset. The model was applied to Rattus norvegicus and exhibited satisfactory performance. We then scanned smORFs with ATG and non-ATG start codons from the human genome and generated a database containing about a million novel smORFs with coding potential. Around 72 000 smORFs are located on the lncRNA regions of the genome. The smORF-encoded peptides may be involved in biological pathways rare for canonical proteins, including glucocorticoid catabolic process and the prokaryotic defense system. Our work provides a model and database for human smORF investigation and a convenient tool for further smORF prediction in other species.


Asunto(s)
Genoma Humano , Péptidos , Animales , Humanos , Ratas , Sistemas de Lectura Abierta , Péptidos/genética , Proteínas/genética
7.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38279653

RESUMEN

Cluster analysis is one of the most widely used exploratory methods for visualization and grouping of gene expression patterns across multiple samples or treatment groups. Although several existing online tools can annotate clusters with functional terms, there is no all-in-one webserver to effectively prioritize genes/clusters using gene essentiality as well as congruency of mRNA-protein expression. Hence, we developed CAP-RNAseq that makes possible (1) upload and clustering of bulk RNA-seq data followed by identification, annotation and network visualization of all or selected clusters; and (2) prioritization using DepMap gene essentiality and/or dependency scores as well as the degree of correlation between mRNA and protein levels of genes within an expression cluster. In addition, CAP-RNAseq has an integrated primer design tool for the prioritized genes. Herein, we showed using comparisons with the existing tools and multiple case studies that CAP-RNAseq can uniquely aid in the discovery of co-expression clusters enriched with essential genes and prioritization of novel biomarker genes that exhibit high correlations between their mRNA and protein expression levels. CAP-RNAseq is applicable to RNA-seq data from different contexts including cancer and available at http://konulabapps.bilkent.edu.tr:3838/CAPRNAseq/ and the docker image is downloadable from https://hub.docker.com/r/konulab/caprnaseq.


Asunto(s)
Proteómica , Análisis de Secuencia de ARN/métodos , RNA-Seq , ARN Mensajero/genética
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38300514

RESUMEN

Somatic copy number alterations (SCNAs) are a predominant type of oncogenomic alterations that affect a large proportion of the genome in the majority of cancer samples. Current technologies allow high-throughput measurement of such copy number aberrations, generating results consisting of frequently large sets of SCNA segments. However, the automated annotation and integration of such data are particularly challenging because the measured signals reflect biased, relative copy number ratios. In this study, we introduce labelSeg, an algorithm designed for rapid and accurate annotation of CNA segments, with the aim of enhancing the interpretation of tumor SCNA profiles. Leveraging density-based clustering and exploiting the length-amplitude relationships of SCNA, our algorithm proficiently identifies distinct relative copy number states from individual segment profiles. Its compatibility with most CNA measurement platforms makes it suitable for large-scale integrative data analysis. We confirmed its performance on both simulated and sample-derived data from The Cancer Genome Atlas reference dataset, and we demonstrated its utility in integrating heterogeneous segment profiles from different data sources and measurement platforms. Our comparative and integrative analysis revealed common SCNA patterns in cancer and protein-coding genes with a strong correlation between SCNA and messenger RNA expression, promoting the investigation into the role of SCNA in cancer development.


Asunto(s)
Variaciones en el Número de Copia de ADN , Neoplasias , Humanos , Neoplasias/genética , Algoritmos , Análisis por Conglomerados , Análisis de Datos
9.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38941113

RESUMEN

This study describes the development of a resource module that is part of a learning platform named "NIGMS Sandbox for Cloud-based Learning" (https://github.com/NIGMS/NIGMS-Sandbox). The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox at the beginning of this Supplement. This module delivers learning materials on de novo transcriptome assembly using Nextflow in an interactive format that uses appropriate cloud resources for data access and analysis. Cloud computing is a powerful new means by which biomedical researchers can access resources and capacity that were previously either unattainable or prohibitively expensive. To take advantage of these resources, however, the biomedical research community needs new skills and knowledge. We present here a cloud-based training module, developed in conjunction with Google Cloud, Deloitte Consulting, and the NIH STRIDES Program, that uses the biological problem of de novo transcriptome assembly to demonstrate and teach the concepts of computational workflows (using Nextflow) and cost- and resource-efficient use of Cloud services (using Google Cloud Platform). Our work highlights the reduced necessity of on-site computing resources and the accessibility of cloud-based infrastructure for bioinformatics applications.


Asunto(s)
Nube Computacional , Transcriptoma , Biología Computacional/métodos , Biología Computacional/educación , Programas Informáticos , Humanos , Perfilación de la Expresión Génica/métodos , Internet
10.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38752856

RESUMEN

Enhancing the reproducibility and comprehension of adaptive immune receptor repertoire sequencing (AIRR-seq) data analysis is critical for scientific progress. This study presents guidelines for reproducible AIRR-seq data analysis, and a collection of ready-to-use pipelines with comprehensive documentation. To this end, ten common pipelines were implemented using ViaFoundry, a user-friendly interface for pipeline management and automation. This is accompanied by versioned containers, documentation and archiving capabilities. The automation of pre-processing analysis steps and the ability to modify pipeline parameters according to specific research needs are emphasized. AIRR-seq data analysis is highly sensitive to varying parameters and setups; using the guidelines presented here, the ability to reproduce previously published results is demonstrated. This work promotes transparency, reproducibility, and collaboration in AIRR-seq data analysis, serving as a model for handling and documenting bioinformatics pipelines in other research domains.


Asunto(s)
Biología Computacional , Programas Informáticos , Humanos , Biología Computacional/métodos , Reproducibilidad de los Resultados , Receptores Inmunológicos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Inmunidad Adaptativa/genética , Guías como Asunto
11.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38271481

RESUMEN

Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.


Asunto(s)
Biología Computacional , Enfermedades Raras , Humanos , Enfermedades Raras/diagnóstico , Enfermedades Raras/genética , Genómica , Genoma Humano , Células Germinativas , Secuenciación de Nucleótidos de Alto Rendimiento
12.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38446740

RESUMEN

Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.


Asunto(s)
Benchmarking , Biología Computacional , Ontología de Genes , Aprendizaje , Anotación de Secuencia Molecular
13.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38600668

RESUMEN

Microbial community analysis is an important field to study the composition and function of microbial communities. Microbial species annotation is crucial to revealing microorganisms' complex ecological functions in environmental, ecological and host interactions. Currently, widely used methods can suffer from issues such as inaccurate species-level annotations and time and memory constraints, and as sequencing technology advances and sequencing costs decline, microbial species annotation methods with higher quality classification effectiveness become critical. Therefore, we processed 16S rRNA gene sequences into k-mers sets and then used a trained DNABERT model to generate word vectors. We also design a parallel network structure consisting of deep and shallow modules to extract the semantic and detailed features of 16S rRNA gene sequences. Our method can accurately and rapidly classify bacterial sequences at the SILVA database's genus and species level. The database is characterized by long sequence length (1500 base pairs), multiple sequences (428,748 reads) and high similarity. The results show that our method has better performance. The technique is nearly 20% more accurate at the species level than the currently popular naive Bayes-dominated QIIME 2 annotation method, and the top-5 results at the species level differ from BLAST methods by <2%. In summary, our approach combines a multi-module deep learning approach that overcomes the limitations of existing methods, providing an efficient and accurate solution for microbial species labeling and more reliable data support for microbiology research and application.


Asunto(s)
Aprendizaje Profundo , Microbiota , ARN Ribosómico 16S/genética , Teorema de Bayes , Microbiota/genética , Bacterias/genética , Filogenia
14.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38581418

RESUMEN

Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.


Asunto(s)
Aprendizaje Profundo , Humanos , Genoma , Algoritmos , Programas Informáticos , Biología Computacional/métodos , Anotación de Secuencia Molecular
15.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38324622

RESUMEN

Liquid chromatography coupled with high-resolution mass spectrometry data-independent acquisition (LC-HRMS/DIA), including MSE, enable comprehensive metabolomics analyses though they pose challenges for data processing with automatic annotation and molecular networking (MN) implementation. This motivated the present proposal, in which we introduce DIA-IntOpenStream, a new integrated workflow combining open-source software to streamline MSE data handling. It provides 'in-house' custom database construction, allows the conversion of raw MSE data to a universal format (.mzML) and leverages open software (MZmine 3 and MS-DIAL) all advantages for confident annotation and effective MN data interpretation. This pipeline significantly enhances the accessibility, reliability and reproducibility of complex MSE/DIA studies, overcoming previous limitations of proprietary software and non-universal MS data formats that restricted integrative analysis. We demonstrate the utility of DIA-IntOpenStream with two independent datasets: dataset 1 consists of new data from 60 plant extracts from the Ocotea genus; dataset 2 is a publicly available actinobacterial extract spiked with authentic standard for detailed comparative analysis with existing methods. This user-friendly pipeline enables broader adoption of cutting-edge MS tools and provides value to the scientific community. Overall, it holds promise for speeding up metabolite discoveries toward a more collaborative and open environment for research.


Asunto(s)
Metabolómica , Programas Informáticos , Reproducibilidad de los Resultados , Flujo de Trabajo , Metabolómica/métodos , Espectrometría de Masas/métodos , Cromatografía Liquida/métodos
16.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38279647

RESUMEN

MOTIVATION: The rapid development of spatial transcriptome technologies has enabled researchers to acquire single-cell-level spatial data at an affordable price. However, computational analysis tools, such as annotation tools, tailored for these data are still lacking. Recently, many computational frameworks have emerged to integrate single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics datasets. While some frameworks can utilize well-annotated scRNA-seq data to annotate spatial expression patterns, they overlook critical aspects. First, existing tools do not explicitly consider cell type mapping when aligning the two modalities. Second, current frameworks lack the capability to detect novel cells, which remains a key interest for biologists. RESULTS: To address these problems, we propose an annotation method for spatial transcriptome data called SPANN. The main tasks of SPANN are to transfer cell-type labels from well-annotated scRNA-seq data to newly generated single-cell resolution spatial transcriptome data and discover novel cells from spatial data. The major innovations of SPANN come from two aspects: SPANN automatically detects novel cells from unseen cell types while maintaining high annotation accuracy over known cell types. SPANN finds a mapping between spatial transcriptome samples and RNA data prototypes and thus conducts cell-type-level alignment. Comprehensive experiments using datasets from various spatial platforms demonstrate SPANN's capabilities in annotating known cell types and discovering novel cell states within complex tissue contexts. AVAILABILITY: The source code of SPANN can be accessed at https://github.com/ddb-qiwang/SPANN-torch. CONTACT: dengmh@math.pku.edu.cn.


Asunto(s)
Análisis de Expresión Génica de una Sola Célula , Transcriptoma , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica/métodos , Programas Informáticos
17.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38935069

RESUMEN

MOTIVATION: In the past decade, single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal method for transcriptomic profiling in biomedical research. Precise cell-type identification is crucial for subsequent analysis of single-cell data. And the integration and refinement of annotated data are essential for building comprehensive databases. However, prevailing annotation techniques often overlook the hierarchical organization of cell types, resulting in inconsistent annotations. Meanwhile, most existing integration approaches fail to integrate datasets with different annotation depths and none of them can enhance the labels of outdated data with lower annotation resolutions using more intricately annotated datasets or novel biological findings. RESULTS: Here, we introduce scPLAN, a hierarchical computational framework designed for scRNA-seq data analysis. scPLAN excels in annotating unlabeled scRNA-seq data using a reference dataset structured along a hierarchical cell-type tree. It identifies potential novel cell types in a systematic, layer-by-layer manner. Additionally, scPLAN effectively integrates annotated scRNA-seq datasets with varying levels of annotation depth, ensuring consistent refinement of cell-type labels across datasets with lower resolutions. Through extensive annotation and novel cell detection experiments, scPLAN has demonstrated its efficacy. Two case studies have been conducted to showcase how scPLAN integrates datasets with diverse cell-type label resolutions and refine their cell-type labels. AVAILABILITY: https://github.com/michaelGuo1204/scPLAN.


Asunto(s)
Biología Computacional , Perfilación de la Expresión Génica , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica/métodos , Biología Computacional/métodos , Humanos , Programas Informáticos , Transcriptoma , Análisis de Secuencia de ARN/métodos , RNA-Seq/métodos , Anotación de Secuencia Molecular/métodos
18.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38366803

RESUMEN

The evolution in single-cell RNA sequencing (scRNA-seq) technology has opened a new avenue for researchers to inspect cellular heterogeneity with single-cell precision. One crucial aspect of this technology is cell-type annotation, which is fundamental for any subsequent analysis in single-cell data mining. Recently, the scientific community has seen a surge in the development of automatic annotation methods aimed at this task. However, these methods generally operate at a steady-state total cell-type capacity, significantly restricting the cell annotation systems'capacity for continuous knowledge acquisition. Furthermore, creating a unified scRNA-seq annotation system remains challenged by the need to progressively expand its understanding of ever-increasing cell-type concepts derived from a continuous data stream. In response to these challenges, this paper presents a novel and challenging setting for annotation, namely cell-type incremental annotation. This concept is designed to perpetually enhance cell-type knowledge, gleaned from continuously incoming data. This task encounters difficulty with data stream samples that can only be observed once, leading to catastrophic forgetting. To address this problem, we introduce our breakthrough methodology termed scEVOLVE, an incremental annotation method. This innovative approach is built upon the methodology of contrastive sample replay combined with the fundamental principle of partition confidence maximization. Specifically, we initially retain and replay sections of the old data in each subsequent training phase, then establish a unique prototypical learning objective to mitigate the cell-type imbalance problem, as an alternative to using cross-entropy. To effectively emulate a model that trains concurrently with complete data, we introduce a cell-type decorrelation strategy that efficiently scatters feature representations of each cell type uniformly. We constructed the scEVOLVE framework with simplicity and ease of integration into most deep softmax-based single-cell annotation methods. Thorough experiments conducted on a range of meticulously constructed benchmarks consistently prove that our methodology can incrementally learn numerous cell types over an extended period, outperforming other strategies that fail quickly. As far as our knowledge extends, this is the first attempt to propose and formulate an end-to-end algorithm framework to address this new, practical task. Additionally, scEVOLVE, coded in Python using the Pytorch machine-learning library, is freely accessible at https://github.com/aimeeyaoyao/scEVOLVE.


Asunto(s)
Algoritmos , Análisis de Expresión Génica de una Sola Célula , Benchmarking , Entropía , Biblioteca de Genes , Análisis de Secuencia de ARN , Perfilación de la Expresión Génica , Análisis por Conglomerados
19.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38747283

RESUMEN

The analysis and comparison of gene neighborhoods is a powerful approach for exploring microbial genome structure, function, and evolution. Although numerous tools exist for genome visualization and comparison, genome exploration across large genomic databases or user-generated datasets remains a challenge. Here, we introduce AnnoView, a web server designed for interactive exploration of gene neighborhoods across the bacterial and archaeal tree of life. Our server offers users the ability to identify, compare, and visualize gene neighborhoods of interest from 30 238 bacterial genomes and 1672 archaeal genomes, through integration with the comprehensive Genome Taxonomy Database and AnnoTree databases. Identified gene neighborhoods can be visualized using pre-computed functional annotations from different sources such as KEGG, Pfam and TIGRFAM, or clustered based on similarity. Alternatively, users can upload and explore their own custom genomic datasets in GBK, GFF or CSV format, or use AnnoView as a genome browser for relatively small genomes (e.g. viruses and plasmids). Ultimately, we anticipate that AnnoView will catalyze biological discovery by enabling user-friendly search, comparison, and visualization of genomic data. AnnoView is available at http://annoview.uwaterloo.ca.


Asunto(s)
Programas Informáticos , Bases de Datos Genéticas , Genoma Bacteriano , Genoma Arqueal , Genómica/métodos , Archaea/genética , Genes Microbianos/genética , Biología Computacional/métodos , Bacterias/genética , Bacterias/clasificación
20.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38388681

RESUMEN

MOTIVATION: Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable. RESULTS: We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL. AVAILABILITY: An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch. CONTACT: dengmh@pku.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Journal Name online.


Asunto(s)
Perfilación de la Expresión Génica , Programas Informáticos , Perfilación de la Expresión Génica/métodos , Análisis de Expresión Génica de una Sola Célula , Análisis de la Célula Individual/métodos , Lenguaje , Análisis de Secuencia de ARN/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA