Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 956
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38706315

RESUMEN

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Asunto(s)
Bases de Datos de Proteínas , Proteínas , Proteínas/química , Anotación de Secuencia Molecular/métodos , Biología Computacional/métodos , Aprendizaje Automático
2.
Curr Protoc ; 4(5): e1046, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38717471

RESUMEN

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.


Asunto(s)
Polimorfismo de Nucleótido Simple , Programas Informáticos , Flujo de Trabajo , Polimorfismo de Nucleótido Simple/genética , Biología Computacional/métodos , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Secuenciación Completa del Genoma/métodos
3.
PLoS Biol ; 22(5): e3002405, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38713717

RESUMEN

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.


Asunto(s)
Genoma , Programas Informáticos , Animales , Genoma/genética , Alineación de Secuencia/métodos , Genómica/métodos , Algoritmos , Estados Unidos , Humanos , Eucariontes/genética , Bases de Datos Genéticas , National Library of Medicine (U.S.) , Anotación de Secuencia Molecular/métodos
4.
Nat Genet ; 56(5): 767-777, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38689000

RESUMEN

We develop a method, SBayesRC, that integrates genome-wide association study (GWAS) summary statistics with functional genomic annotations to improve polygenic prediction of complex traits. Our method is scalable to whole-genome variant analysis and refines signals from functional annotations by allowing them to affect both causal variant probability and causal effect distribution. We analyze 50 complex traits and diseases using ∼7 million common single-nucleotide polymorphisms (SNPs) and 96 annotations. SBayesRC improves prediction accuracy by 14% in European ancestry and up to 34% in cross-ancestry prediction compared to the baseline method SBayesR, which does not use annotations, and outperforms other methods, including LDpred2, LDpred-funct, MegaPRS, PolyPred-S and PRS-CSx. Investigation of factors affecting prediction accuracy identifies a significant interaction between SNP density and annotation information, suggesting whole-genome sequence variants with annotations may further improve prediction. Functional partitioning analysis highlights a major contribution of evolutionary constrained regions to prediction accuracy and the largest per-SNP contribution from nonsynonymous SNPs.


Asunto(s)
Estudio de Asociación del Genoma Completo , Anotación de Secuencia Molecular , Herencia Multifactorial , Polimorfismo de Nucleótido Simple , Herencia Multifactorial/genética , Estudio de Asociación del Genoma Completo/métodos , Humanos , Anotación de Secuencia Molecular/métodos , Genómica/métodos , Genoma Humano , Modelos Genéticos
5.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38640488

RESUMEN

MOTIVATION: The ENCODE project generated a large collection of eCLIP-seq RNA binding protein (RBP) profiling data with accompanying RNA-seq transcriptomes of shRNA knockdown of RBPs. These data could have utility in understanding the functional impact of genetic variants, however their potential has not been fully exploited. We implement INCA (Integrative annotation scores of variants for impact on RBP activities) as a multi-step genetic variant scoring approach that leverages the ENCODE RBP data together with ClinVar and integrates multiple computational approaches to aggregate evidence. RESULTS: INCA evaluates variant impacts on RBP activities by leveraging genotypic differences in cell lines used for eCLIP-seq. We show that INCA provides critical specificity, beyond generic scoring for RBP binding disruption, for candidate variants and their linkage-disequilibrium partners. As a result, it can, on average, augment scoring of 46.2% of the candidate variants beyond generic scoring for RBP binding disruption and aid in variant prioritization for follow-up analysis. AVAILABILITY AND IMPLEMENTATION: INCA is implemented in R and is available at https://github.com/keleslab/INCA.


Asunto(s)
Proteínas de Unión al ARN , Humanos , Proteínas de Unión al ARN/metabolismo , Proteínas de Unión al ARN/genética , Programas Informáticos , Variación Genética , Biología Computacional/métodos , Anotación de Secuencia Molecular/métodos
6.
BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-38664627

RESUMEN

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.


Asunto(s)
Algoritmos , Anotación de Secuencia Molecular , Alineación de Secuencia , Anotación de Secuencia Molecular/métodos , Alineación de Secuencia/métodos , Proteínas Virales/genética , Proteínas Virales/química , Genes Virales , Bases de Datos de Proteínas , Biología Computacional/métodos , Secuencia de Aminoácidos
7.
Nat Methods ; 21(5): 793-797, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38509328

RESUMEN

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.


Asunto(s)
Anotación de Secuencia Molecular , Transcriptoma , Humanos , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Isoformas de Proteínas/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
8.
J Mol Biol ; 436(4): 168416, 2024 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-38143020

RESUMEN

Neuropeptides not only work through nervous system but some of them also work peripherally to regulate numerous physiological processes. They are important in regulation of numerous physiological processes including growth, reproduction, social behavior, inflammation, fluid homeostasis, cardiovascular function, and energy homeostasis. The various roles of neuropeptides make them promising candidates for prospective therapeutics of different diseases. Currently, NeuroPep has been updated to version 2.0, it now holds 11,417 unique neuropeptide entries, which is nearly double of the first version of NeuroPep. When available, we collected information about the receptor for each neuropeptide entry and predicted the 3D structures of those neuropeptides without known experimental structure using AlphaFold2 or APPTEST according to the peptide sequence length. In addition, DeepNeuropePred and NeuroPred-PLM, two neuropeptide prediction tools developed by us recently, were also integrated into NeuroPep 2.0 to help to facilitate the identification of new neuropeptides. NeuroPep 2.0 is freely accessible at https://isyslab.info/NeuroPepV2/.


Asunto(s)
Bases de Datos de Proteínas , Anotación de Secuencia Molecular , Neuropéptidos , Secuencia de Aminoácidos , Neuropéptidos/química , Anotación de Secuencia Molecular/métodos
9.
J Biol Chem ; 299(9): 105130, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37543366

RESUMEN

Long noncoding RNAs (lncRNAs) are increasingly being recognized as modulators in various biological processes. However, due to their low expression, their systematic characterization is difficult to determine. Here, we performed transcript annotation by a newly developed computational pipeline, termed RNA-seq and small RNA-seq combined strategy (RSCS), in a wide variety of cellular contexts. Thousands of high-confidence potential novel transcripts were identified by the RSCS, and the reliability of the transcriptome was verified by analysis of transcript structure, base composition, and sequence complexity. Evidenced by the length comparison, the frequency of the core promoter and the polyadenylation signal motifs, and the locations of transcription start and end sites, the transcripts appear to be full length. Furthermore, taking advantage of our strategy, we identified a large number of endogenous retrovirus-associated lncRNAs, and a novel endogenous retrovirus-lncRNA that was functionally involved in control of Yap1 expression and essential for early embryogenesis was identified. In summary, the RSCS can generate a more complete and precise transcriptome, and our findings greatly expanded the transcriptome annotation for the mammalian community.


Asunto(s)
Anotación de Secuencia Molecular , ARN Largo no Codificante , RNA-Seq , Animales , Desarrollo Embrionario/genética , Mamíferos/embriología , Mamíferos/genética , Anotación de Secuencia Molecular/métodos , Regiones Promotoras Genéticas/genética , Reproducibilidad de los Resultados , Retroviridae/genética , ARN Largo no Codificante/genética , RNA-Seq/métodos , Sitio de Iniciación de la Transcripción , Transcriptoma/genética , Proteínas Señalizadoras YAP/genética , Proteínas Señalizadoras YAP/metabolismo
10.
Genome Biol ; 24(1): 135, 2023 06 08.
Artículo en Inglés | MEDLINE | ID: mdl-37291671

RESUMEN

BACKGROUND: In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied. RESULTS: Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. CONCLUSIONS: These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.


Asunto(s)
Anotación de Secuencia Molecular , Proteínas , Análisis de Secuencia de Proteína , Secuencia de Aminoácidos , Anotación de Secuencia Molecular/métodos , Proteínas/química , Proteínas/clasificación , Proteoma , Análisis de Secuencia de Proteína/métodos , Eucariontes , Bacterias , Archaea
11.
Science ; 380(6643): eabn3107, 2023 04 28.
Artículo en Inglés | MEDLINE | ID: mdl-37104600

RESUMEN

Annotating coding genes and inferring orthologs are two classical challenges in genomics and evolutionary biology that have traditionally been approached separately, limiting scalability. We present TOGA (Tool to infer Orthologs from Genome Alignments), a method that integrates structural gene annotation and orthology inference. TOGA implements a different paradigm to infer orthologous loci, improves ortholog detection and annotation of conserved genes compared with state-of-the-art methods, and handles even highly fragmented assemblies. TOGA scales to hundreds of genomes, which we demonstrate by applying it to 488 placental mammal and 501 bird assemblies, creating the largest comparative gene resources so far. Additionally, TOGA detects gene losses, enables selection screens, and automatically provides a superior measure of mammalian genome quality. TOGA is a powerful and scalable method to annotate and compare genes in the genomic era.


Asunto(s)
Euterios , Genómica , Anotación de Secuencia Molecular , Animales , Femenino , Ratones , Euterios/genética , Genoma , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Aves/genética
12.
Science ; 379(6639): 1358-1363, 2023 03 31.
Artículo en Inglés | MEDLINE | ID: mdl-36996195

RESUMEN

Enzyme function annotation is a fundamental challenge, and numerous computational tools have been developed. However, most of these tools cannot accurately predict functional annotations, such as enzyme commission (EC) number, for less-studied proteins or those with previously uncharacterized functions or multiple activities. We present a machine learning algorithm named CLEAN (contrastive learning-enabled enzyme annotation) to assign EC numbers to enzymes with better accuracy, reliability, and sensitivity compared with the state-of-the-art tool BLASTp. The contrastive learning framework empowers CLEAN to confidently (i) annotate understudied enzymes, (ii) correct mislabeled enzymes, and (iii) identify promiscuous enzymes with two or more EC numbers-functions that we demonstrate by systematic in silico and in vitro experiments. We anticipate that this tool will be widely used for predicting the functions of uncharacterized enzymes, thereby advancing many fields, such as genomics, synthetic biology, and biocatalysis.


Asunto(s)
Enzimas , Aprendizaje Automático , Anotación de Secuencia Molecular , Proteínas , Análisis de Secuencia de Proteína , Algoritmos , Biología Computacional , Enzimas/química , Genómica , Proteínas/química , Reproducibilidad de los Resultados , Anotación de Secuencia Molecular/métodos , Análisis de Secuencia de Proteína/métodos , Biocatálisis
13.
Sci Rep ; 13(1): 1417, 2023 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-36697464

RESUMEN

We report here a new application, CustomProteinSearch (CusProSe), whose purpose is to help users to search for proteins of interest based on their domain composition. The application is customizable. It consists of two independent tools, IterHMMBuild and ProSeCDA. IterHMMBuild allows the iterative construction of Hidden Markov Model (HMM) profiles for conserved domains of selected protein sequences, while ProSeCDA scans a proteome of interest against an HMM profile database, and annotates identified proteins using user-defined rules. CusProSe was successfully used to identify, in fungal genomes, genes encoding key enzyme families involved in secondary metabolism, such as polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS), hybrid PKS-NRPS and dimethylallyl tryptophan synthases (DMATS), as well as to characterize distinct terpene synthases (TS) sub-families. The highly configurable characteristics of this application makes it a generic tool, which allows the user to refine the function of predicted proteins, to extend detection to new enzymes families, and may also be applied to biological systems other than fungi and to other proteins than those involved in secondary metabolism.


Asunto(s)
Hongos , Anotación de Secuencia Molecular , Metabolismo Secundario , Programas Informáticos , Secuencia de Aminoácidos , Anotación de Secuencia Molecular/métodos , Péptido Sintasas/genética , Sintasas Poliquetidas/genética , Metabolismo Secundario/genética , Hongos/enzimología , Hongos/genética , Triptófano Sintasa/genética , Secuencia Conservada/genética
14.
Nucleic Acids Res ; 50(W1): W57-W65, 2022 07 05.
Artículo en Inglés | MEDLINE | ID: mdl-35640593

RESUMEN

The Annotation Query (AnnoQ) (http://annoq.org/) is designed to provide comprehensive and up-to-date functional annotations for human genetic variants. The system is supported by an annotation database with ∼39 million human variants from the Haplotype Reference Consortium (HRC) pre-annotated with sequence feature annotations by WGSA and functional annotations to Gene Ontology (GO) and pathways in PANTHER. The database operates on an optimized Elasticsearch framework to support real-time complex searches. This implementation enables users to annotate data with the most up-to-date functional annotations via simple queries instead of setting up individual tools. A web interface allows users to interactively browse the annotations, annotate variants and search variant data. Its easy-to-use interface and search capabilities are well-suited for scientists with fewer bioinformatics skills such as bench scientists and statisticians. AnnoQ also has an API for users to access and annotate the data programmatically. Packages for programming languages, such as the R package, are available for users to embed the annotation queries in their scripts. AnnoQ serves researchers with a wide range of backgrounds and research interests as an integrated annotation platform.


Asunto(s)
Variación Genética , Anotación de Secuencia Molecular , Programas Informáticos , Humanos , Bases de Datos Genéticas , Internet , Anotación de Secuencia Molecular/métodos , Interfaz Usuario-Computador , Variación Genética/genética , Haplotipos/genética , Lenguajes de Programación
15.
Gene ; 807: 145952, 2022 Jan 10.
Artículo en Inglés | MEDLINE | ID: mdl-34500049

RESUMEN

Extreme temperature is one of the serious threats to crop production in present and future scenarios of global climate changes. Lentil (Lens culinaris) is an important crop, and there is a serious lack of genetic information regarding environmental and temperature stresses responses. This study is the first report of evaluation of key genes and molecular mechanisms related to temperature stresses in lentil using the RNA sequencing technique. De novo transcriptome assembly created 44,673 contigs and differential gene expression analysis revealed 7494 differentially expressed genes between the temperature stresses and control group. Basic annotation of generated transcriptome assembly in our study led to the identification of 2765 novel transcripts that have not been identified yet in lentil genome draft v1.2. In addition, several unigenes involved in mechanisms of temperature sensing, calcium and hormone signaling and DNA-binding transcription factor activity were identified. Also, common mechanisms in response to temperature stresses, including the proline biosynthesis, the photosynthetic light reactions balancing, chaperone activity and circadian rhythms, are determined by the hub genes through the protein-protein interaction networks analysis. Deciphering the mechanisms of extreme temperature tolerance would be a new way for developing crops with enhanced plasticity against climate change. In general, this study has identified set of mechanisms and various genes related to cold and heat stresses which will be useful in better understanding of the lentil's reaction to temperature stresses.


Asunto(s)
Lens (Planta)/crecimiento & desarrollo , Lens (Planta)/genética , Estrés Fisiológico/genética , Cambio Climático , Frío/efectos adversos , Respuesta al Choque por Frío/genética , Productos Agrícolas/genética , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica de las Plantas/genética , Respuesta al Choque Térmico/genética , Respuesta al Choque Térmico/fisiología , Calor/efectos adversos , Anotación de Secuencia Molecular/métodos , Fotosíntesis , Mapas de Interacción de Proteínas/genética , Temperatura , Transcriptoma/genética
16.
Gene ; 808: 145996, 2022 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-34634440

RESUMEN

Russula griseocarnosa is a well-known ectomycorrhizal mushroom, which is mainly distributed in the Southern China. Although several scholars have attempted to isolate and cultivate fungal strains, no accurate method for culture of artificial fruiting bodies has been presented owing to difficulties associated with mycelium growth on artificial media. Herein, we sequenced R. griseocarnosa genome using the second- and third-generation sequencing technologies, followed by de novo assembly of high-throughput sequencing reads, and GeneMark-ES, BLAST, CAZy, and other databases were utilized for functional gene annotation. We also constructed a phylogenetic tree using different species of fungi, and also conducted comparative genomics analysis of R. griseocarnosa against its four representative species. In addition, we evaluated the accuracy of one already sequenced genome of R. griseocarnosa based on the internal transcribed spacer (ITS) sequencing of that type of species. The assembly process resulted in identification of 230 scaffolds with a total genome size of 50.67 Mbp. The gene prediction showed that R. griseocarnosa genome included 14,229 coding sequences (CDs). In addition, 470 RNAs were predicted with 155 transfer RNAs (tRNAs), 49 ribosomal RNAs (rRNAs), 41 small noncoding RNAs (sRNAs), 42 small nuclear RNAs (snRNAs), and 183 microRNAs (miRNAs). The predicted protein sequences of R. griseocarnosa were analyzed to indicate the existence of carbohydrate-active enzymes (CAZymes), and the results revealed that 153 genes encoded CAZymes, which were distributed in 58 CAZyme families. These enzymes included 78 glycoside hydrolases (GHs), 34 glycosyl transferases (GTs), 30 auxiliary activities (AAs), 2 carbohydrate esterases (CEs), 8 carbohydrate-binding modules (CBMs), and only one polysaccharide lyase (PL). Compared with other fungi, R. griseocarnosa had fewer CAZymes, and the number and distribution of CAZymes were similar to other mycorrhizal fungi, such as Tricholoma matsutake and Suillus luteus. Well-defined effector proteins that were associated with mycorrhiza-induced small-secreted proteins (MiSSPs) were not found in R. griseocarnosa, which indicated that there may be some special effector proteins to interact with host plants in R. griseocarnosa. The genome of R. griseocarnosa may provide new insights into the energy metabolism of ectomycorrhizal (ECM) fungi, a reference to study ecosystem and evolutionary diversification of R. griseocarnosa, as well as promoting the study of artificial domestication.


Asunto(s)
Basidiomycota/genética , Basidiomycota/metabolismo , Agaricales/genética , China , Genoma Fúngico/genética , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Micorrizas/genética , Micorrizas/metabolismo , Filogenia , Secuenciación Completa del Genoma/métodos
17.
Genomics Proteomics Bioinformatics ; 20(3): 455-465, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-34954426

RESUMEN

Exploring the genetic basis of human infertility is currently under intensive investigation. However, only a handful of genes have been validated in animal models as disease-causing genes in infertile men. Thus, to better understand the genetic basis of human spermatogenesis and bridge the knowledge gap between humans and other animal species, we construct the FertilityOnline, a database integrating the literature-curated functional genes during spermatogenesis into an existing spermatogenic database, SpermatogenesisOnline 1.0. Additional features, including the functional annotation and genetic variants of human genes, are also incorporated into FertilityOnline. By searching this database, users can browse the functional genes involved in spermatogenesis and instantly narrow down the number of candidates of genetic mutations underlying male infertility in a user-friendly web interface. Clinical application of this database was exampled by the identification of novel causative mutations in synaptonemal complex central element protein 1 (SYCE1) and stromal antigen 3 (STAG3) in azoospermic men. In conclusion, FertilityOnline is not only an integrated resource for spermatogenic genes but also a useful tool facilitating the exploration of the genetic basis of male infertility. FertilityOnline can be freely accessed at http://mcg.ustc.edu.cn/bsc/spermgenes2.0/index.html.


Asunto(s)
Análisis Mutacional de ADN , Bases de Datos Genéticas , Infertilidad Masculina , Anotación de Secuencia Molecular , Espermatogénesis , Humanos , Masculino , Proteínas de Ciclo Celular/genética , Infertilidad Masculina/genética , Anotación de Secuencia Molecular/métodos , Mutación , Análisis Mutacional de ADN/métodos , Espermatogénesis/genética , Sistemas en Línea
18.
PLoS Biol ; 19(12): e3001464, 2021 12.
Artículo en Inglés | MEDLINE | ID: mdl-34871295

RESUMEN

The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.


Asunto(s)
Colaboración de las Masas/métodos , Curaduría de Datos/métodos , Anotación de Secuencia Molecular/métodos , Secuencia de Aminoácidos/genética , Biología Computacional/métodos , Bases de Datos de Proteínas/tendencias , Humanos , Literatura , Proteínas/metabolismo , Participación de los Interesados
20.
Genes (Basel) ; 12(10)2021 10 19.
Artículo en Inglés | MEDLINE | ID: mdl-34681040

RESUMEN

Functional annotation allows adding biologically relevant information to predicted features in genomic sequences, and it is, therefore, an important procedure of any de novo genome sequencing project. It is also useful for proofreading and improving gene structural annotation. Here, we introduce FA-nf, a pipeline implemented in Nextflow, a versatile computational workflow management engine. The pipeline integrates different annotation approaches, such as NCBI BLAST+, DIAMOND, InterProScan, and KEGG. It starts from a protein sequence FASTA file and, optionally, a structural annotation file in GFF format, and produces several files, such as GO assignments, output summaries of the abovementioned programs and final annotation reports. The pipeline can be broken easily into smaller processes for the purpose of parallelization and easily deployed in a Linux computational environment, thanks to software containerization, thus helping to ensure full reproducibility.


Asunto(s)
Genoma/genética , Anotación de Secuencia Molecular/métodos , Programas Informáticos , Mapeo Cromosómico , Biología Computacional , Genómica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA