Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 12.473
1.
Sci Data ; 11(1): 577, 2024 Jun 04.
Article En | MEDLINE | ID: mdl-38834611

Solanum pimpinellifolium, the closest wild relative of the domesticated tomato, has high potential for use in breeding programs aimed at developing multi-pathogen resistance and quality improvement. We generated a chromosome-level genome assembly of S. pimpinellifolium LA1589, with a size of 833 Mb and a contig N50 of 31 Mb. We anchored 98.80% of the contigs into 12 pseudo-chromosomes, and identified 74.47% of the sequences as repetitive sequences. The genome evaluation revealed BUSCO and LAI score of 98.3% and 14.49, respectively, indicating high quality of this assembly. A total of 41,449 protein-coding genes were predicted in the genome, of which 89.17% were functionally annotated. This high-quality genome assembly serves as a valuable resource for accelerating the biological discovery and molecular breeding of this important horticultural crop.


Chromosomes, Plant , Genome, Plant , Solanum , Solanum/genetics , Molecular Sequence Annotation
2.
Sci Data ; 11(1): 576, 2024 Jun 04.
Article En | MEDLINE | ID: mdl-38834644

Exopalaemon carinicauda, a eurythermal and euryhaline shrimp, contributes one third of the total biomass production of polyculture ponds in eastern China and is considered as a potential ideal experimental animal for research on crustaceans. We conducted a high-quality chromosome-level genome assembly of E. carinicauda combining PacBio HiFi and Hi-C sequencing data. The total assembly size was 5.86 Gb, with a contig N50 of 235.52 kb and a scaffold N50 of 138.24 Mb. Approximately 95.29% of the assembled sequences were anchored onto 45 pseudochromosomes. BUSCO analysis revealed that 92.89% of 1,013 single-copy genes were highly conserved orthologs. A total of 44, 288 protein-coding genes were predicted, of which 70.53% were functionally annotated. Given its high heterozygosity (2.62%) and large proportion of repeat sequences (71.49%), it is one of the most complex genome assemblies. This chromosome-scale genome will be a valuable resource for future molecular breeding and functional genomics research on E. carinicauda.


Chromosomes , Genome , Palaemonidae , Animals , Palaemonidae/genetics , China , Molecular Sequence Annotation
3.
BMC Genomics ; 25(1): 552, 2024 Jun 03.
Article En | MEDLINE | ID: mdl-38825700

BACKGROUND: The disputed phylogenetic position of Aerides flabellata Rolfe ex Downie, due to morphological overlaps with related species, was investigated based on evidence of complete chloroplast (cp) genomes. The structural characterization of complete cp genomes of A. flabellata and A. rosea Lodd. ex Lindl. & Paxton were analyzed and compared with those of six related species in "Vanda-Aerides alliance" to provide genomic information on taxonomy and phylogeny. RESULTS: The cp genomes of A. flabellata and A. rosea exhibited conserved quadripartite structures, 148,145 bp and 147,925 bp in length, with similar GC content (36.7 ~ 36.8%). Gene annotations revealed 110 single-copy genes, 18 duplicated in inverted regions, and ten with introns. Comparative analysis across related species confirmed stable sequence identity and higher variation in single-copy regions. However, there are notable differences in the IR regions between two Aerides Lour. species and the other six related species. The phylogenetic analysis based on CDS from complete cp genomes indicated that Aerides species except A. flabellata formed a monophyletic clade nested in the subtribe Aeridinae, being a sister group to Renanthera Lour., consistent with previous studies. Meanwhile, a separate clade consisted of A. flabellata and six Vanda R. Br. species was formed, as a sister taxon to Holcoglossum Schltr. CONCLUSIONS: This research was the first report on the complete cp genomes of A. flabellata. The results provided insights into understanding of plastome evolution and phylogenetic relationships of Aerides. The phylogenetic analysis based on complete cp genomes showed that A. flabellata should be placed in Vanda rather than in Aerides.


Genome, Chloroplast , Orchidaceae , Phylogeny , Orchidaceae/genetics , Orchidaceae/classification , Base Composition , Molecular Sequence Annotation
4.
BMC Genomics ; 25(1): 560, 2024 Jun 05.
Article En | MEDLINE | ID: mdl-38840265

BACKGROUND: Nitzschia closterium f. minutissima is a commonly available diatom that plays important roles in marine aquaculture. It was originally classified as Nitzschia (Bacillariaceae, Bacillariophyta) but is currently regarded as a heterotypic synonym of Phaeodactylum tricornutum. The aim of this study was to obtain the draft genome of the marine microalga N. closterium f. minutissima to understand its phylogenetic placement and evolutionary specialization. Given that the ornate hierarchical silicified cell walls (frustules) of diatoms have immense applications in nanotechnology for biomedical fields, biosensors and optoelectric devices, transcriptomic data were generated by using reference genome-based read mapping to identify significantly differentially expressed genes and elucidate the molecular processes involved in diatom biosilicification. RESULTS: In this study, we generated 13.81 Gb of pass reads from the PromethION sequencer. The draft genome of N. closterium f. minutissima has a total length of 29.28 Mb, and contains 28 contigs with an N50 value of 1.23 Mb. The GC content was 48.55%, and approximately 18.36% of the genome assembly contained repeat sequences. Gene annotation revealed 9,132 protein-coding genes. The results of comparative genomic analysis showed that N. closterium f. minutissima was clustered as a sister lineage of Phaeodactylum tricornutum and the divergence time between them was estimated to be approximately 17.2 million years ago (Mya). CAFF analysis demonstrated that 220 gene families that significantly changed were unique to N. closterium f. minutissima and that 154 were specific to P. tricornutum, moreover, only 26 gene families overlapped between these two species. A total of 818 DEGs in response to silicon were identified in N. closterium f. minutissima through RNA sequencing, these genes are involved in various molecular processes such as transcription regulator activity. Several genes encoding proteins, including silicon transporters, heat shock factors, methyltransferases, ankyrin repeat domains, cGMP-mediated signaling pathways-related proteins, cytoskeleton-associated proteins, polyamines, glycoproteins and saturated fatty acids may contribute to the formation of frustules in N. closterium f. minutissima. CONCLUSIONS: Here, we described a draft genome of N. closterium f. minutissima and compared it with those of eight other diatoms, which provided new insight into its evolutionary features. Transcriptome analysis to identify DEGs in response to silicon will help to elucidate the underlying molecular mechanism of diatom biosilicification in N. closterium f. minutissima.


Diatoms , Gene Expression Profiling , Phylogeny , Diatoms/genetics , Diatoms/metabolism , Diatoms/classification , Genome , Transcriptome , Molecular Sequence Annotation
5.
Brief Bioinform ; 25(4)2024 May 23.
Article En | MEDLINE | ID: mdl-38842510

Accurate and comprehensive annotation of microprotein-coding small open reading frames (smORFs) is critical to our understanding of normal physiology and disease. Empirical identification of translated smORFs is carried out primarily using ribosome profiling (Ribo-seq). While effective, published Ribo-seq datasets can vary drastically in quality and different analysis tools are frequently employed. Here, we examine the impact of these factors on identifying translated smORFs. We compared five commonly used software tools that assess open reading frame translation from Ribo-seq (RibORFv0.1, RibORFv1.0, RiboCode, ORFquant, and Ribo-TISH) and found surprisingly low agreement across all tools. Only ~2% of smORFs were called translated by all five tools, and ~15% by three or more tools when assessing the same high-resolution Ribo-seq dataset. For larger annotated genes, the same analysis showed ~74% agreement across all five tools. We also found that some tools are strongly biased against low-resolution Ribo-seq data, while others are more tolerant. Analyzing Ribo-seq coverage revealed that smORFs detected by more than one tool tend to have higher translation levels and higher fractions of in-frame reads, consistent with what was observed for annotated genes. Together these results support employing multiple tools to identify the most confident microprotein-coding smORFs and choosing the tools based on the quality of the dataset and the planned downstream characterization experiments of the predicted smORFs.


Open Reading Frames , Software , Ribosomes/metabolism , Ribosomes/genetics , Molecular Sequence Annotation/methods , Humans , Protein Biosynthesis , Computational Biology/methods , Ribosome Profiling
6.
Sci Data ; 11(1): 568, 2024 Jun 01.
Article En | MEDLINE | ID: mdl-38824125

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.


Proteome , Humans , Gastrointestinal Tract/metabolism , Cluster Analysis , Molecular Sequence Annotation , Metagenomics , Databases, Protein
7.
BMC Genom Data ; 25(1): 53, 2024 Jun 06.
Article En | MEDLINE | ID: mdl-38844844

OBJECTIVES: The new data provide an important genomic resource for the Critically Endangered Cuban crocodile (Crocodylus rhombifer). Cuban crocodiles are restricted to the Zapata Swamp in southern Matanzas Province, Cuba, and readily hybridize with the widespread American crocodile (Crocodylus acutus) in areas of sympatry. The reported de novo assembly will contribute to studies of crocodylian evolutionary history and provide a resource for informing Cuban crocodile conservation. DATA DESCRIPTION: The final 2.2 Gb draft genome for C. rhombifer consists of 41,387 scaffolds (contigs: N50 = 104.67 Kb; scaffold: N50-518.55 Kb). Benchmarking Universal Single-Copy Orthologs (BUSCO) identified 92.3% of the 3,354 genes in the vertebrata_odb10 database. Approximately 42% of the genome (960Mbp) comprises repeat elements. We predicted 30,138 unique protein-coding sequences (17,737 unique genes) in the genome assembly. Functional annotation found the top Gene Ontology annotations for Biological Processes, Molecular Function, and Cellular Component were regulation, protein, and intracellular, respectively. This assembly will support future macroevolutionary, conservation, and molecular studies of the Cuban crocodile.


Alligators and Crocodiles , Genome , Molecular Sequence Annotation , Alligators and Crocodiles/genetics , Animals , Genome/genetics , Cuba , Genomics/methods
8.
BMC Genomics ; 25(1): 546, 2024 Jun 01.
Article En | MEDLINE | ID: mdl-38824587

BACKGROUND: Purple flowering stalk (Brassica rapa var. purpuraria) is a widely cultivated plant with high nutritional and medicinal value and exhibiting strong adaptability during growing. Mitochondrial (mt) play important role in plant cells for energy production, developing with an independent genetic system. Therefore, it is meaningful to assemble and annotate the functions for the mt genome of plants independently. Though there have been several reports referring the mt genome of in Brassica species, the genome of mt in B. rapa var. purpuraria and its functional gene variations when compared to its closely related species has not yet been addressed. RESULTS: The mt genome of B. rapa var. purpuraria was assembled through the Illumina and Nanopore sequencing platforms, which revealed a length of 219,775 bp with a typical circular structure. The base composition of the whole B. rapa var. purpuraria mt genome revealed A (27.45%), T (27.31%), C (22.91%), and G (22.32%). 59 functional genes, composing of 33 protein-coding genes (PCGs), 23 tRNA genes, and 3 rRNA genes, were annotated. The sequence repeats, codon usage, RNA editing, nucleotide diversity and gene transfer between the cp genome and mt genome were examined in the B. rapa var. purpuraria mt genome. Phylogenetic analysis show that B. rapa var. Purpuraria was closely related to B. rapa subsp. Oleifera and B. juncea. Ka/Ks analysis reflected that most of the PCGs in the B. rapa var. Purpuraria were negatively selected, illustrating that those mt genes were conserved during evolution. CONCLUSIONS: The results of our findings provide valuable information on the B.rapa var. Purpuraria genome, which might facilitate molecular breeding, genetic variation and evolutionary researches for Brassica species in the future.


Brassica rapa , Genome, Mitochondrial , Phylogeny , Brassica rapa/genetics , Molecular Sequence Annotation , Genome, Plant , RNA, Transfer/genetics , Base Composition
9.
Curr Protoc ; 4(5): e1046, 2024 May.
Article En | MEDLINE | ID: mdl-38717471

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.


Polymorphism, Single Nucleotide , Software , Workflow , Polymorphism, Single Nucleotide/genetics , Computational Biology/methods , Genomics/methods , Molecular Sequence Annotation/methods , Whole Genome Sequencing/methods
10.
Sci Rep ; 14(1): 10520, 2024 05 08.
Article En | MEDLINE | ID: mdl-38714765

The hemibiotrophic Basidiomycete pathogen Ganoderma boninense (Gb) is the dominant causal agent of oil palm basal stem rot disease. Here, we report a complete chromosomal genome map of Gb using a combination of short-read Illumina and long-read Pacific Biosciences (PacBio) sequencing platforms combined with chromatin conformation capture data from the Chicago and Hi-C platforms. The genome was 55.87 Mb in length and assembled to a high contiguity (N50: 304.34 kb) of 12 chromosomes built from 112 scaffolds, with a total of only 4.34 Mb (~ 7.77%) remaining unplaced. The final assemblies were evaluated for completeness of the genome by using Benchmarking Universal Single Copy Orthologs (BUSCO) v4.1.4, and based on 4464 total BUSCO polyporales group searches, the assemblies yielded 4264 (95.52%) of the conserved orthologs as complete and only a few fragmented BUSCO of 42 (0.94%) as well as a missing BUSCO of 158 (3.53%). Genome annotation predicted a total of 21,074 coding genes, with a GC content ratio of 59.2%. The genome features were analyzed with different databases, which revealed 2471 Gene Ontology/GO (11.72%), 5418 KEGG (Kyoto Encyclopedia of Genes and Genomes) Orthologous/KO (25.71%), 13,913 Cluster of Orthologous Groups of proteins/COG (66.02%), 60 ABC transporter (0.28%), 1049 Carbohydrate-Active Enzymes/CAZy (4.98%), 4005 pathogen-host interactions/PHI (19%), and 515 fungal transcription factor/FTFD (2.44%) genes. The results obtained in this study provide deep insight for further studies in the future.


Arecaceae , Ganoderma , Genome, Fungal , Plant Diseases , Whole Genome Sequencing , Ganoderma/genetics , Whole Genome Sequencing/methods , Plant Diseases/microbiology , Arecaceae/microbiology , Arecaceae/genetics , Molecular Sequence Annotation
11.
BMC Genomics ; 25(1): 430, 2024 May 01.
Article En | MEDLINE | ID: mdl-38693501

BACKGROUND: Although multiple chicken genomes have been assembled and annotated, the numbers of protein-coding genes in chicken genomes and their variation among breeds are still uncertain due to the low quality of these genome assemblies and limited resources used in their gene annotations. To fill these gaps, we recently assembled genomes of four indigenous chicken breeds with distinct traits at chromosome-level. In this study, we annotated genes in each of these assembled genomes using a combination of RNA-seq- and homology-based approaches. RESULTS: We identified varying numbers (17,497-17,718) of protein-coding genes in the four indigenous chicken genomes, while recovering 51 of the 274 "missing" genes in birds in general, and 36 of the 174 "missing" genes in chickens in particular. Intriguingly, based on deeply sequenced RNA-seq data collected in multiple tissues in the four breeds, we found 571 ~ 627 protein-coding genes in each genome, which were missing in the annotations of the reference chicken genomes (GRCg6a and GRCg7b/w). After removing redundancy, we ended up with a total of 1,420 newly annotated genes (NAGs). The NAGs tend to be found in subtelomeric regions of macro-chromosomes (chr1 to chr5, plus chrZ) and middle chromosomes (chr6 to chr13, plus chrW), as well as in micro-chromosomes (chr14 to chr39) and unplaced contigs, where G/C contents are high. Moreover, the NAGs have elevated quadruplexes G frequencies, while both G/C contents and quadruplexes G frequencies in their surrounding regions are also high. The NAGs showed tissue-specific expression, and we were able to verify 39 (92.9%) of 42 randomly selected ones in various tissues of the four chicken breeds using RT-qPCR experiments. Most of the NAGs were also encoded in the reference chicken genomes, thus, these genomes might harbor more genes than previously thought. CONCLUSION: The NAGs are widely distributed in wild, indigenous and commercial chickens, and they might play critical roles in chicken physiology. Counting these new genes, chicken genomes harbor more genes than originally thought.


Chickens , Genome , Molecular Sequence Annotation , Animals , Chickens/genetics , Base Composition , Telomere/genetics , Chromosomes/genetics , Genomics/methods
12.
Brief Bioinform ; 25(3)2024 Mar 27.
Article En | MEDLINE | ID: mdl-38706315

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning
13.
Proc Natl Acad Sci U S A ; 121(23): e2403750121, 2024 Jun 04.
Article En | MEDLINE | ID: mdl-38805269

Haplotype-resolved genome assemblies were produced for Chasselas and Ugni Blanc, two heterozygous Vitis vinifera cultivars by combining high-fidelity long-read sequencing and high-throughput chromosome conformation capture (Hi-C). The telomere-to-telomere full coverage of the chromosomes allowed us to assemble separately the two haplo-genomes of both cultivars and revealed structural variations between the two haplotypes of a given cultivar. The deletions/insertions, inversions, translocations, and duplications provide insight into the evolutionary history and parental relationship among grape varieties. Integration of de novo single long-read sequencing of full-length transcript isoforms (Iso-Seq) yielded a highly improved genome annotation. Given its higher contiguity, and the robustness of the IsoSeq-based annotation, the Chasselas assembly meets the standard to become the annotated reference genome for V. vinifera. Building on these resources, we developed VitExpress, an open interactive transcriptomic platform, that provides a genome browser and integrated web tools for expression profiling, and a set of statistical tools (StatTools) for the identification of highly correlated genes. Implementation of the correlation finder tool for MybA1, a major regulator of the anthocyanin pathway, identified candidate genes associated with anthocyanin metabolism, whose expression patterns were experimentally validated as discriminating between black and white grapes. These resources and innovative tools for mining genome-related data are anticipated to foster advances in several areas of grapevine research.


Genome, Plant , Haplotypes , Transcriptome , Vitis , Vitis/genetics , Haplotypes/genetics , Transcriptome/genetics , Molecular Sequence Annotation/methods , Gene Expression Profiling/methods , Software
14.
Database (Oxford) ; 20242024 May 07.
Article En | MEDLINE | ID: mdl-38713862

Germline and somatic mutations can give rise to proteins with altered activity, including both gain and loss-of-function. The effects of these variants can be captured in disease-specific reactions and pathways that highlight the resulting changes to normal biology. A disease reaction is defined as an aberrant reaction in which a variant protein participates. A disease pathway is defined as a pathway that contains a disease reaction. Annotation of disease variants as participants of disease reactions and disease pathways can provide a standardized overview of molecular phenotypes of pathogenic variants that is amenable to computational mining and mathematical modeling. Reactome (https://reactome.org/), an open source, manually curated, peer-reviewed database of human biological pathways, in addition to providing annotations for >11 000 unique human proteins in the context of ∼15 000 wild-type reactions within more than 2000 wild-type pathways, also provides annotations for >4000 disease variants of close to 400 genes as participants of ∼800 disease reactions in the context of ∼400 disease pathways. Functional annotation of disease variants proceeds from normal gene functions, described in wild-type reactions and pathways, through disease variants whose divergence from normal molecular behaviors has been experimentally verified, to extrapolation from molecular phenotypes of characterized variants to variants of unknown significance using criteria of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Reactome's data model enables mapping of disease variant datasets to specific disease reactions within disease pathways, providing a platform to infer pathway output impacts of numerous human disease variants and model organism orthologs, complementing computational predictions of variant pathogenicity. Database URL: https://reactome.org/.


Molecular Sequence Annotation , Phenotype , Humans , Databases, Genetic , Disease/genetics
15.
BMC Plant Biol ; 24(1): 417, 2024 May 17.
Article En | MEDLINE | ID: mdl-38760756

BACKGROUND: The Polygonaceae is a family well-known for its weeds, and edible plants, Fagopyrum (buckwheat) and Rheum (rhubarb), which are primarily herbaceous and temperate in distribution. Yet, the family also contains a number of lineages that are principally distributed in the tropics and subtropics. Notably, these lineages are woody, unlike their temperate relatives. To date, full-genome sequencing has focused on the temperate and herbaceous taxa. In an effort to increase breadth of genetic knowledge of the Polygonaceae, we here present six fully assembled and annotated chloroplast genomes from six of the tropical, woody genera: Coccoloba rugosa (a narrow and endangered Puerto Rican endemic), Gymnopodium floribundum, Neomillspaughia emarginata, Podopterus mexicanus, Ruprechtia coriacea, and Triplaris cumingiana. RESULTS: These assemblies represent the first publicly-available assembled and annotated plastomes for the genera Podopterus, Gymnopodium, and Neomillspaughia, and the first assembled and annotated plastomes for the species Coccoloba rugosa, Ruprechtia coriacea, and Triplaris cumingiana. We found the assembled chloroplast genomes to be above the median size of Polygonaceae plastomes, but otherwise exhibit features typical of the family. The features of greatest sequence variation are found among the ndh genes and in the small single copy (SSC) region of the plastome. The inverted repeats show high GC content and little sequence variation across genera. When placed in a phylogenetic context, our sequences were resolved within the Eriogonoideae. CONCLUSIONS: These six plastomes from among the tropical woody Polygonaceae appear typical within the family. The plastome assembly of Ruprechtia coriacea presented here calls into question the sequence identity of a previously published plastome assembly of R. albida.


Genome, Chloroplast , Polygonaceae , Polygonaceae/genetics , Polygonaceae/classification , Phylogeny , Molecular Sequence Annotation
16.
Protein Sci ; 33(6): e4988, 2024 Jun.
Article En | MEDLINE | ID: mdl-38757367

Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.


Molecular Sequence Annotation , Protein Domains , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Databases, Protein , Computational Biology/methods , Gene Ontology , Humans , Software
17.
Sci Data ; 11(1): 523, 2024 May 22.
Article En | MEDLINE | ID: mdl-38778061

Remora albescens, also known as white suckerfish, recognized for its distinctive suction-cup attachment behavior and medicinal significance. In this study, we produced a high-quality chromosome-level genome assembly of R. albescens through the integration of 23.87 Gb PacBio long reads, 64.54 Gb T7 short reads, and 88.63 Gb Hi-C data. Initially, we constructed a contig-level genome assembly totaling 605.30 Mb with a contig N50 of 23.12 Mb. Subsequently, employing Hi-C technology, approximately 99.68% (603.38 Mb) of the contig-level genome was successfully assigned to 23 pseudo-chromosomes. Through the integration of homologous-based predictions, ab initio predictions, and RNA-sequencing methods, we successfully identified a comprehensive set of 22,445 protein-coding genes. Notably, 96.36% (21,629 genes) of these were effectively annotated with functional information. The genome assembly achieved an estimated completeness of 98.1% according to BUSCO analysis. This work promotes the applicability of the R. albescens genome, laying a solid foundation for future investigations into genomics, biology, and medicinal importance within this species.


Chromosomes , Decapodiformes , Genome , Animals , Decapodiformes/genetics , Molecular Sequence Annotation
18.
BMC Genom Data ; 25(1): 48, 2024 May 23.
Article En | MEDLINE | ID: mdl-38783174

OBJECTIVES: Ottelia Pers. is in the Hydrocharitaceae family. Species in the genus are aquatic, and China is their centre of origin in Asia. Ottelia alismoides (L.) Pers., which is distributed worldwide, is a distinguishing element in China, while other species of this genus are endemic to China. However, O. alismoides is also considered endangered due to habitat loss and pollution in some Asian countries. Ottelia alismoides is the only submerged macrophyte that contains three carbon dioxide-concentrating mechanisms, i.e. bicarbonate (HCO3-) use, crassulacean acid metabolism and the C4 pathway. In this study, we present its first genome assembly to help illustrate the various carbon metabolism mechanisms and to enable genetic conservation in the future. DATA DESCRIPTION: Using DNA and RNA extracted from one O. alismoides leaf, this work produced ∼ 73.4 Gb HiFi reads, ∼ 126.4 Gb whole genome sequencing short reads and ∼ 21.9 Gb RNA-seq reads. The de novo genome assembly was 6,455,939,835 bp in length, with 11,923 scaffolds/contigs and an N50 of 790,733 bp. Genome assembly completeness assessment with Benchmarking Universal Single-Copy Orthologs revealed a score of 94.4%. The repetitive sequence in the assembly was 4,875,817,144 bp (75.5%). A total of 116,176 genes were predicted. The protein sequences were functionally annotated against multiple databases, facilitating comparative genomic analysis.


Carbon , Genome, Plant , Hydrocharitaceae , Hydrocharitaceae/genetics , Hydrocharitaceae/metabolism , Carbon/metabolism , Molecular Sequence Annotation , Whole Genome Sequencing , China
19.
BMC Genomics ; 25(1): 510, 2024 May 23.
Article En | MEDLINE | ID: mdl-38783193

Domesticated safflower (Carthamus tinctorius L.) is a widely cultivated edible oil crop. However, despite its economic importance, the genetic basis underlying key traits such as oil content, resistance to biotic and abiotic stresses, and flowering time remains poorly understood. Here, we present the genome assembly for C. tinctorius variety Jihong01, which was obtained by integrating Oxford Nanopore Technologies (ONT) and BGI-SEQ500 sequencing results. The assembled genome was 1,061.1 Mb, and consisted of 32,379 protein-coding genes, 97.71% of which were functionally annotated. Safflower had a recent whole genome duplication (WGD) event in evolution history and diverged from sunflower approximately 37.3 million years ago. Through comparative genomic analysis at five seed development stages, we unveiled the pivotal roles of fatty acid desaturase 2 (FAD2) and fatty acid desaturase 6 (FAD6) in linoleic acid (LA) biosynthesis. Similarly, the differential gene expression analysis further reinforced the significance of these genes in regulating LA accumulation. Moreover, our investigation of seed fatty acid composition at different seed developmental stages unveiled the crucial roles of FAD2 and FAD6 in LA biosynthesis. These findings offer important insights into enhancing breeding programs for the improvement of quality traits and provide reference resource for further research on the natural properties of safflower.


Carthamus tinctorius , Fatty Acid Desaturases , Fatty Acids, Unsaturated , Genome, Plant , Carthamus tinctorius/genetics , Carthamus tinctorius/metabolism , Fatty Acids, Unsaturated/biosynthesis , Fatty Acids, Unsaturated/metabolism , Fatty Acid Desaturases/genetics , Fatty Acid Desaturases/metabolism , Seeds/genetics , Seeds/metabolism , Seeds/growth & development , Genomics/methods , Gene Expression Regulation, Plant , Molecular Sequence Annotation
20.
PLoS Biol ; 22(5): e3002405, 2024 May.
Article En | MEDLINE | ID: mdl-38713717

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.


Genome , Software , Animals , Genome/genetics , Sequence Alignment/methods , Genomics/methods , Algorithms , United States , Humans , Eukaryota/genetics , Databases, Genetic , National Library of Medicine (U.S.) , Molecular Sequence Annotation/methods
...