Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 111
Filter
1.
Nucleic Acids Res ; 52(D1): D762-D769, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37962425

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.


Subject(s)
Archaea , Bacteria , Databases, Nucleic Acid , Metagenome , Archaea/genetics , Bacteria/genetics , Databases, Nucleic Acid/standards , Databases, Nucleic Acid/trends , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Internet , Molecular Sequence Annotation , Proteins/genetics
2.
Nucleic Acids Res ; 52(D1): D919-D928, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37986229

ABSTRACT

Long non-coding RNAs (lncRNAs) possess a wide range of biological functions, and research has demonstrated their significance in regulating major biological processes such as development, differentiation, and immune response. The accelerating accumulation of lncRNA research has greatly expanded our understanding of lncRNA functions. Here, we introduce LncSEA 2.0 (http://bio.liclab.net/LncSEA/index.php), aiming to provide a more comprehensive set of functional lncRNAs and enhanced enrichment analysis capabilities. Compared with LncSEA 1.0, we have made the following improvements: (i) We updated the lncRNA sets for 11 categories and extremely expanded the lncRNA scopes for each set. (ii) We newly introduced 15 functional lncRNA categories from multiple resources. This update not only included a significant amount of downstream regulatory data for lncRNAs, but also covered numerous epigenetic regulatory data sets, including lncRNA-related transcription co-factor binding, chromatin regulator binding, and chromatin interaction data. (iii) We incorporated two new lncRNA set enrichment analysis functions based on GSEA and GSVA. (iv) We adopted the snakemake analysis pipeline to track data processing and analysis. In summary, LncSEA 2.0 offers a more comprehensive collection of lncRNA sets and a greater variety of enrichment analysis modules, assisting researchers in a more comprehensive study of the functional mechanisms of lncRNAs.


Subject(s)
Databases, Nucleic Acid , RNA, Long Noncoding , Databases, Nucleic Acid/standards , RNA, Long Noncoding/genetics , Data Analysis
3.
BMC Bioinformatics ; 21(Suppl 16): 540, 2020 Dec 16.
Article in English | MEDLINE | ID: mdl-33323107

ABSTRACT

BACKGROUND: Single-cell RNA sequencing can be used to fairly determine cell types, which is beneficial to the medical field, especially the many recent studies on COVID-19. Generally, single-cell RNA data analysis pipelines include data normalization, size reduction, and unsupervised clustering. However, different normalization and size reduction methods will significantly affect the results of clustering and cell type enrichment analysis. Choices of preprocessing paths is crucial in scRNA-Seq data mining, because a proper preprocessing path can extract more important information from complex raw data and lead to more accurate clustering results. RESULTS: We proposed a method called NDRindex (Normalization and Dimensionality Reduction index) to evaluate data quality of outcomes of normalization and dimensionality reduction methods. The method includes a function to calculate the degree of data aggregation, which is the key to measuring data quality before clustering. For the five single-cell RNA sequence datasets we tested, the results proved the efficacy and accuracy of our index. CONCLUSIONS: This method we introduce focuses on filling the blanks in the selection of preprocessing paths, and the result proves its effectiveness and accuracy. Our research provides useful indicators for the evaluation of RNA-Seq data.


Subject(s)
Computational Biology/methods , Databases, Nucleic Acid/classification , Databases, Nucleic Acid/standards , RNA-Seq/methods , COVID-19/virology , Cluster Analysis , Humans , SARS-CoV-2/genetics
4.
Genes (Basel) ; 11(10)2020 09 23.
Article in English | MEDLINE | ID: mdl-32977385

ABSTRACT

Science and technology are extensively used in criminal investigation. From the mid- to late-1980s, one of the scientific discoveries that has had a particularly remarkable impact on this field has been the use of highly variable DNA sequence regions (minisatellites) in the human genome for individual identification. The technique was initially referred to as DNA fingerprinting, but is now more widely referred to as DNA profiling. Since then, many new developments have occurred within this area of science. These include the introduction of new genetic markers (microsatellites also known as short tandem repeats/STRs), the use of the polymerase chain reaction for target amplification, the development of DNA databases (databanking), and the advancement and/or improvement of genotyping protocols and technologies. In 2019, we described the progress of DNA profiling and DNA databanking in Malaysia for the first time. This report included information on DNA analysis regulations and legislation, STR genotyping protocols, database management, and accreditation status. Here, we provide an update on the performance of our DNA databank (numbers of DNA profiles and hits) plus the technical issues associated with correctly assigning the weight of evidence for DNA profiles in an ethnically diverse population, and the potential application of rapid DNA testing in the country. A total of 116,534 DNA profiles were obtained and stored in the Forensic DNA Databank of Malaysia (FDDM) by 2019, having increased from 70,570 in 2017. The number of hits increased by more than three-fold in just two years, where 17 and 69 hits between the DNA profiles stored in the FDDM and those from crime scenes, suspects, detainees, drug users, convicts, missing persons, or volunteers were recorded in 2017 and 2019, respectively. Forensic DNA analysis and databanking are thus progressing well in Malaysia and have already contributed to many criminal investigations. However, several other issues are discussed here, including the need for STR population data for uncharacterized population groups, and pilot trials for adopting rapid DNA profiling technology. These aspects should be considered by policy makers and law enforcement agencies in order to increase the reliability and efficiency of DNA profiling in criminal cases and in kinship analysis in Malaysia.


Subject(s)
DNA Fingerprinting/methods , DNA/genetics , Databases, Nucleic Acid/statistics & numerical data , Forensic Genetics/methods , Genetic Markers , Microsatellite Repeats , DNA/analysis , Databases, Nucleic Acid/standards , Genetics, Population , Genotyping Techniques , Humans , Malaysia
5.
Klin Lab Diagn ; 64(9): 578-584, 2019.
Article in Russian | MEDLINE | ID: mdl-31610112

ABSTRACT

The relevance of the chosen topic is due to the need to resolve legal problems in the field of observance of human and civil rights and freedoms when storing, accessing and protecting full genome sequencing data. The purpose of this study is the formation of conceptual criteria on the basis of which a new model of regulatory regulation of this sphere of public relations will be built. To achieve this goal, the tasks of studying the regulatory legal acts in force in Russia and a number of foreign countries were solved. General scientific, private-scientific and special methods of scientific knowledge (system-structural, formal-legal) were used. In order to formulate conceptual criteria of practical importance for storing access and protecting genome-wide sequencing data in Russia and foreign countries, it was proposed to develop clarifying characteristics or gradation of human and civil rights and freedoms in the context of realization of public state interests. It is also necessary to unify the content of the conceptual apparatus of normative acts taking into account the peculiarities of genetic information, work out the procedure for accessing data, and provide for a system of its depersonification. For the first time, the authors substantiate the need to transform the content of the human rights declared by the state to life, freedom, personal and family secrets, and others with the development of new technologies in the field of DNA scanning. The basic criteria that are of practical importance for the storage, access and protection of genome-wide sequencing data indicate the need to improve normative concepts, establish categories of persons with the right to access such data, normatively fix the conditions for observing an anonymous survey, and also refuse to get acquainted with the results , to develop mechanisms for the depersonification of the obtained genetic information).


Subject(s)
Computer Security , Confidentiality , Databases, Nucleic Acid/standards , Information Storage and Retrieval , Genome, Human , Genomics , Humans , Russia
6.
Nat Commun ; 10(1): 3313, 2019 07 25.
Article in English | MEDLINE | ID: mdl-31346170

ABSTRACT

FDA proactively invests in tools to support innovation of emerging technologies, such as infectious disease next generation sequencing (ID-NGS). Here, we introduce FDA-ARGOS quality-controlled reference genomes as a public database for diagnostic purposes and demonstrate its utility on the example of two use cases. We provide quality control metrics for the FDA-ARGOS genomic database resource and outline the need for genome quality gap filling in the public domain. In the first use case, we show more accurate microbial identification of Enterococcus avium from metagenomic samples with FDA-ARGOS reference genomes compared to non-curated GenBank genomes. In the second use case, we demonstrate the utility of FDA-ARGOS reference genomes for Ebola virus target sequence comparison as part of a composite validation strategy for ID-NGS diagnostic tests. The use of FDA-ARGOS as an in silico target sequence comparator tool combined with representative clinical testing could reduce the burden for completing ID-NGS clinical trials.


Subject(s)
Communicable Diseases/diagnosis , Databases, Nucleic Acid/standards , Genome , Access to Information , Communicable Diseases/microbiology , Databases, Nucleic Acid/organization & administration , High-Throughput Nucleotide Sequencing , Humans , United States , United States Food and Drug Administration
7.
BMC Bioinformatics ; 20(1): 48, 2019 Jan 23.
Article in English | MEDLINE | ID: mdl-30674273

ABSTRACT

BACKGROUND: With sequencing technologies becoming cheaper and easier to use, more groups are able to obtain whole genome sequences of viruses of public health and scientific importance. Submission of genomic data to NCBI GenBank is a requirement prior to publication and plays a critical role in making scientific data publicly available. GenBank currently has automatic prokaryotic and eukaryotic genome annotation pipelines but has no viral annotation pipeline beyond influenza virus. Annotation and submission of viral genome sequence is a non-trivial task, especially for groups that do not routinely interact with GenBank for data submissions. RESULTS: We present Viral Annotation Pipeline and iDentification (VAPiD), a portable and lightweight command-line tool for annotation and GenBank deposition of viral genomes. VAPiD supports annotation of nearly all unsegmented viral genomes. The pipeline has been validated on human immunodeficiency virus, human parainfluenza virus 1-4, human metapneumovirus, human coronaviruses (229E/OC43/NL63/HKU1/SARS/MERS), human enteroviruses/rhinoviruses, measles virus, mumps virus, Hepatitis A-E Virus, Chikungunya virus, dengue virus, and West Nile virus, as well the human polyomaviruses BK/JC/MCV, human adenoviruses, and human papillomaviruses. The program can handle individual or batch submissions of different viruses to GenBank and correctly annotates multiple viruses, including those that contain ribosomal slippage or RNA editing without prior knowledge of the virus to be annotated. VAPiD is programmed in Python and is compatible with Windows, Linux, and Mac OS systems. CONCLUSIONS: We have created a portable, lightweight, user-friendly, internet-enabled, open-source, command-line genome annotation and submission package to facilitate virus genome submissions to NCBI GenBank. Instructions for downloading and installing VAPiD can be found at https://github.com/rcs333/VAPiD .


Subject(s)
Databases, Nucleic Acid/standards , Genome, Viral/genetics , Genomics/methods , Humans
8.
Genomics ; 111(4): 759-761, 2019 07.
Article in English | MEDLINE | ID: mdl-29842948

ABSTRACT

The typical wet lab user often annotates smaller sequences in the GenBank format, but resulting files are not accepted for database submission by NCBI. This makes submission of such annotations a cumbersome task. Here we present "GB2sequin" an easy-to-use web application that converts custom annotations in the GenBank format into the NCBI direct submission format Sequin. Additionally, the program generates a "five-column, tab-delimited feature table" and a FASTA file. Those are required for submission through BankIt or the update of an existing GenBank entry. We specifically developed "GB2sequin" for the regular wet lab researcher with strong focus on user-friendliness and flexibility. The application is equipped with an intuitive graphical interface and a comprehensive documentation. It can be employed to prepare any GenBank file for database submission and is freely available online at https://chlorobox.mpimp-golm.mpg.de/GenBank2Sequin.html.


Subject(s)
Databases, Nucleic Acid/standards , Molecular Sequence Annotation/methods , Software , Molecular Sequence Annotation/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards
9.
BMC Bioinformatics ; 19(1): 478, 2018 Dec 12.
Article in English | MEDLINE | ID: mdl-30541436

ABSTRACT

BACKGROUND: Researchers typically sequence a given individual multiple times, either re-sequencing the same DNA sample (technical replication) or sequencing different DNA samples collected on the same individual (biological replication) or both. Before merging the data from these replicate sequence runs, it is important to verify that no errors, such as DNA contamination or mix-ups, occurred during the data collection pipeline. Methods to detect such errors exist but are often ad hoc, cannot handle missing data and several require phased data. Because they require some combination of genotype calling, imputation, and haplotype phasing, these methods are unsuitable for error detection in low- to moderate-depth sequence data where such tasks are difficult to perform accurately. Additionally, because most existing methods employ a pairwise-comparison approach for error detection rather than joint analysis of the putative replicates, results may be difficult to interpret. RESULTS: We introduce a new method for error detection suitable for shallow-, moderate-, and high-depth sequence data. Using Bayes Theorem, we calculate the posterior probability distribution over the set of relations describing the putative replicates and infer which of the samples originated from an identical genotypic source. CONCLUSIONS: Our method addresses key limitations of existing approaches and produced highly accurate results in simulation experiments. Our method is implemented as an R package called BIGRED (Bayes Inferred Genotype Replicate Error Detector), which is freely available for download: https://github.com/ac2278/BIGRED .


Subject(s)
Databases, Nucleic Acid/standards , Sequence Analysis, DNA/methods , Humans
11.
Bioinformatics ; 34(5): 755-759, 2018 03 01.
Article in English | MEDLINE | ID: mdl-29069347

ABSTRACT

Motivation: Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true-positive (contamination) from false-positive (not contamination) VecScreen matches. Results: A principal reason for false-positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions. Availability and implementation: Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact: aschaffe@helix.nih.gov. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Nucleic Acid/standards , Sequence Analysis, DNA/methods , Software , Bacteria , Eukaryota
13.
Planta Med ; 83(18): 1420-1430, 2017 Dec.
Article in English | MEDLINE | ID: mdl-28651291

ABSTRACT

Precise, species-level identification of plants in foods and dietary supplements is difficult. While the use of DNA barcoding regions (short regions of DNA with diagnostic utility) has been effective for many inquiries, it is not always a robust approach for closely related species, especially in highly processed products. The use of fully sequenced chloroplast genomes, as an alternative to short diagnostic barcoding regions, has demonstrated utility for closely related species. The U. S. Food and Drug Administration (FDA) has also developed species-specific DNA-based assays targeting plant species of interest by utilizing chloroplast genome sequences. Here, we introduce a repository of complete chloroplast genome sequences called GenomeTrakrCP, which will be publicly available at the National Center for Biotechnology Information (NCBI). Target species for inclusion are plants found in foods and dietary supplements, toxin producers, common contaminants and adulterants, and their close relatives. Publicly available data will include annotated assemblies, raw sequencing data, and voucher information with each NCBI accession associated with an authenticated reference herbarium specimen. To date, 40 complete chloroplast genomes have been deposited in GenomeTrakrCP (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA325670/), and this will be expanded in the future.


Subject(s)
Databases, Nucleic Acid/standards , Genome, Chloroplast/genetics , Plants/classification , DNA Barcoding, Taxonomic , DNA, Chloroplast/chemistry , DNA, Chloroplast/genetics , Molecular Sequence Annotation , Plant Leaves/classification , Plant Leaves/genetics , Plants/genetics , Reference Standards , Species Specificity , United States , United States Food and Drug Administration
15.
Article in English | MEDLINE | ID: mdl-28077566

ABSTRACT

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC-a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases - in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w.


Subject(s)
Computational Biology , Databases, Nucleic Acid/standards , Molecular Sequence Annotation , Computational Biology/methods , Computational Biology/standards , Databases, Nucleic Acid/organization & administration , Japan , Molecular Sequence Annotation/methods
16.
Arch Virol ; 162(4): 1141-1144, 2017 Apr.
Article in English | MEDLINE | ID: mdl-28040837

ABSTRACT

Nucleotide sequences are the fundamental basis for work on molecular mechanisms and for phylogenetic analysis. Recently, we identified sequence errors in all of the LTR sequences of the prototypic foamy virus stored in the GenBank database. Here, we report the resequencing of the proviral plasmids pHSRV13 and pHSRV2. Sequence comparisons revealed an error rate for the foamy virus sequences stored in the database of up to 10 errors per 1000 bp. Even the newest sequences of the codon-optimized foamy virus synthetic Gag, Pol, and Env amino acid sequences showed exchanges compared to the new proviral pHSRV13n sequence. Our results provide evidence that some prototypic foamy virus sequences contain errors and should be revised.


Subject(s)
Databases, Nucleic Acid/standards , Plasmids/genetics , Sequence Analysis, DNA/standards , Spumavirus/genetics , Base Sequence , Molecular Sequence Data , Phylogeny
17.
Genet Med ; 19(7): 772-777, 2017 07.
Article in English | MEDLINE | ID: mdl-27906200

ABSTRACT

BACKGROUND: Genetic FBN1 testing is pivotal for confirming the clinical diagnosis of Marfan syndrome. In an effort to evaluate variant causality, FBN1 databases are often used. We evaluated the current databases regarding FBN1 variants and validated associated phenotype records with a new Marfan syndrome geno-phenotyping tool called the Marfan score. METHODS AND RESULTS: We evaluated four databases (UMD-FBN1, ClinVar, the Human Gene Mutation Database (HGMD), and Uniprot) containing 2,250 FBN1 variants supported by 4,904 records presented in 307 references. The Marfan score calculated for phenotype data from the records quantified variant associations with Marfan syndrome phenotype. We calculated a Marfan score for 1,283 variants, of which we confirmed the database diagnosis of Marfan syndrome in 77.1%. This represented only 35.8% of the total registered variants; 18.5-33.3% (UMD-FBN1 versus HGMD) of variants associated with Marfan syndrome in the databases could not be confirmed by the recorded phenotype. CONCLUSION: FBN1 databases can be imprecise and incomplete. Data should be used with caution when evaluating FBN1 variants. At present, the UMD-FBN1 database seems to be the biggest and best curated; therefore, it is the most comprehensive database. However, the need for better genotype-phenotype curated databases is evident, and we hereby present such a database.Genet Med advance online publication 01 December 2016.


Subject(s)
Fibrillin-1/genetics , Genetic Association Studies/methods , Marfan Syndrome/diagnosis , Databases, Factual , Databases, Nucleic Acid/standards , Female , Fibrillins , Genetic Testing/methods , Genetic Variation/genetics , Genotype , Humans , Male , Marfan Syndrome/genetics , Microfilament Proteins/genetics , Mutation , Phenotype
18.
Genome ; 59(11): 913-932, 2016 Nov.
Article in English | MEDLINE | ID: mdl-27829306

ABSTRACT

Fungi are ubiquitous in both natural and human-made environments. They play important roles in the health of plants, animals, and humans, and in broad ecosystem functions. Thus, having an efficient species-level identification system could significantly enhance our ability to treat fungal diseases and to monitor the spatial and temporal patterns of fungal distributions and migrations. DNA barcoding is a potent approach for rapid identification of fungal specimens, generating novel species hypothesis, and guiding biodiversity and ecological studies. In this mini-review, I briefly summarize (i) the history of DNA sequence-based fungal identification; (ii) the emergence of the ITS region as the consensus primary fungal barcode; (iii) the use of the ITS barcodes to address a variety of issues on fungal diversity from local to global scales, including generating a large number of species hypothesis; and (iv) the problems with the ITS barcode region and the approaches to overcome these problems. Similar to DNA barcoding research on plants and animals, significant progress has been achieved over the last few years in terms of both the questions being addressed and the foundations being laid for future research endeavors. However, significant challenges remain. I suggest three broad areas of research to enhance the usefulness of fungal DNA barcoding to meet the current and future challenges: (i) develop a common set of primers and technologies that allow the amplification and sequencing of all fungi at both the primary and secondary barcode loci; (ii) compile a centralized reference database that includes all recognized fungal species as well as species hypothesis, and allows regular updates from the research community; and (iii) establish a consensus set of new species recognition criteria based on barcode DNA sequences that can be applied across the fungal kingdom.


Subject(s)
DNA Barcoding, Taxonomic , Fungi/classification , Fungi/genetics , Animals , Biodiversity , DNA Barcoding, Taxonomic/methods , DNA Barcoding, Taxonomic/standards , DNA, Fungal , DNA, Intergenic , Databases, Nucleic Acid/standards , Environmental Microbiology , Food Microbiology , Fungi/metabolism , Gastrointestinal Microbiome , Humans , Metagenome , Metagenomics/methods , Microbiota , Mouth/microbiology , Research , Sequence Analysis, DNA/methods , Soil Microbiology
19.
PLoS One ; 11(8): e0159644, 2016.
Article in English | MEDLINE | ID: mdl-27489953

ABSTRACT

MOTIVATION: First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. RESULTS: We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.


Subject(s)
Computational Biology/methods , Databases, Nucleic Acid , Supervised Machine Learning , Animals , Base Sequence , Caenorhabditis elegans/genetics , Computational Biology/standards , Databases, Nucleic Acid/standards , Escherichia coli/genetics , Zea mays/genetics , Zebrafish/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...