Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
2.
Nucleic Acids Res ; 47(D1): D721-D728, 2019 01 08.
Article in English | MEDLINE | ID: mdl-30289549

ABSTRACT

One of the most fundamental questions in biology is what types of cells form different tissues and organs in a functionally coordinated fashion. Larger-scale single-cell sequencing and biology experiment studies are now rapidly opening up new ways to track this question by revealing substantial cell markers for distinguishing different cell types in tissues. Here, we developed the CellMarker database (http://biocc.hrbmu.edu.cn/CellMarker/ or http://bio-bigdata.hrbmu.edu.cn/CellMarker/), aiming to provide a comprehensive and accurate resource of cell markers for various cell types in tissues of human and mouse. By manually curating over 100 000 published papers, 4124 entries including the cell marker information, tissue type, cell type, cancer information and source, were recorded. At last, 13 605 cell markers of 467 cell types in 158 human tissues/sub-tissues and 9148 cell makers of 389 cell types in 81 mouse tissues/sub-tissues were collected and deposited in CellMarker. CellMarker provides a user-friendly interface for browsing, searching and downloading markers of diverse cell types of different tissues. Furthermore, a summarized marker prevalence in each cell type is graphically and intuitively presented through a vivid statistical graph. We believe that CellMarker is a comprehensive and valuable resource for cell researches in precisely identifying and characterizing cells, especially at the single-cell level.


Subject(s)
Databases, Genetic , Sequence Analysis/methods , Single-Cell Analysis/methods , Software , Animals , Humans , Mice , Sequence Analysis/standards , Single-Cell Analysis/standards
3.
BMC Genomics ; 21(1): 863, 2020 Dec 04.
Article in English | MEDLINE | ID: mdl-33276717

ABSTRACT

BACKGROUND: The global COVID-19 pandemic has led to an urgent need for scalable methods for clinical diagnostics and viral tracking. Next generation sequencing technologies have enabled large-scale genomic surveillance of SARS-CoV-2 as thousands of isolates are being sequenced around the world and deposited in public data repositories. A number of methods using both short- and long-read technologies are currently being applied for SARS-CoV-2 sequencing, including amplicon approaches, metagenomic methods, and sequence capture or enrichment methods. Given the small genome size, the ability to sequence SARS-CoV-2 at scale is limited by the cost and labor associated with making sequencing libraries. RESULTS: Here we describe a low-cost, streamlined, all amplicon-based method for sequencing SARS-CoV-2, which bypasses costly and time-consuming library preparation steps. We benchmark this tailed amplicon method against both the ARTIC amplicon protocol and sequence capture approaches and show that an optimized tailed amplicon approach achieves comparable amplicon balance, coverage metrics, and variant calls to the ARTIC v3 approach. CONCLUSIONS: The tailed amplicon method we describe represents a cost-effective and highly scalable method for SARS-CoV-2 sequencing.


Subject(s)
COVID-19 Nucleic Acid Testing/methods , COVID-19/virology , Genome, Viral/genetics , SARS-CoV-2/genetics , Benchmarking , COVID-19/diagnosis , COVID-19/epidemiology , COVID-19 Nucleic Acid Testing/standards , Humans , Molecular Epidemiology , Mutation , RNA, Viral/genetics , SARS-CoV-2/isolation & purification , Sequence Analysis/methods , Sequence Analysis/standards
4.
Plant Dis ; 103(9): 2199-2203, 2019 Sep.
Article in English | MEDLINE | ID: mdl-31322493

ABSTRACT

Viral diseases are a limiting factor to wheat production. Viruses are difficult to diagnose in the early stages of disease development and are often confused with nutrient deficiencies or other abiotic problems. Immunological methods are useful to identify viruses, but specific antibodies may not be available or require high virus titer for detection. In 2015 and 2017, wheat plants containing Wheat streak mosaic virus (WSMV) resistance gene, Wsm2, were found to have symptoms characteristic of WSMV. Serologically, WSMV was detected in all four samples. Additionally, High Plains wheat mosaic virus (HPWMoV) was also detected in one of the samples. Barley yellow dwarf virus (BYDV) was not detected, and a detection kit was not readily available for Triticum mosaic virus (TriMV). Initially, cDNA cloning and Sanger sequencing were used to determine the presence of WSMV; however, the process was time-consuming and expensive. Subsequently, cDNA from infected wheat tissue was sequenced with single-strand, Oxford Nanopore sequencing technology (ONT). ONT was able to confirm the presence of WSMV. Additionally, TriMV was found in all of the samples and BYDV in three of the samples. Deep coverage sequencing of full-length, single-strand WSMV revealed variation compared with the WSMV Sidney-81 reference strain and may represent new variants which overcome Wsm2. These results demonstrate that ONT can more accurately identify causal virus agents and has sufficient resolution to provide evidence of causal variants.


Subject(s)
Plant Diseases , Plant Viruses , Sequence Analysis , Triticum , Bunyaviridae/classification , Bunyaviridae/genetics , Luteovirus/classification , Luteovirus/genetics , Nanopores , Plant Diseases/virology , Plant Viruses/classification , Plant Viruses/genetics , Potyviridae/classification , Potyviridae/genetics , Sequence Analysis/standards , Triticum/virology
5.
Nucleic Acids Res ; 40(Database issue): D130-5, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22121212

ABSTRACT

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).


Subject(s)
Databases, Genetic , Molecular Sequence Annotation , Sequence Analysis/standards , Genomics/standards , Humans , Reference Standards , Sequence Analysis, DNA/standards , Sequence Analysis, Protein/standards , Sequence Analysis, RNA/standards
6.
BMC Evol Biol ; 13: 161, 2013 Aug 01.
Article in English | MEDLINE | ID: mdl-23914788

ABSTRACT

The intention of this editorial is to steer researchers through methodological choices in molecular evolution, drawing on the combined expertise of the authors. Our aim is not to review the most advanced methods for a specific task. Rather, we define several general guidelines to help with methodology choices at different stages of a typical phylogenetic 'pipeline'. We are not able to provide exhaustive citation of a literature that is vast and plentiful, but we point the reader to a set of classical textbooks that reflect the state-of-the-art. We do not wish to appear overly critical of outdated methodology but rather provide some practical guidance on the sort of issues which should be considered. We stress that a reported study should be well-motivated and evaluate a specific hypothesis or scientific question. However, a publishable study should not be merely a compilation of available sequences for a protein family of interest followed by some standard analyses, unless it specifically addresses a scientific hypothesis or question. The rapid pace at which sequence data accumulate quickly outdates such publications. Although clearly, discoveries stemming from data mining, reports of new tools and databases and review papers are also desirable.


Subject(s)
Classification/methods , Phylogeny , Genetics, Population , Sequence Analysis/standards
8.
Genet Epidemiol ; 35 Suppl 1: S22-8, 2011.
Article in English | MEDLINE | ID: mdl-22128054

ABSTRACT

Next-generation sequencing of large numbers of individuals presents challenges in data preparation, quality control, and statistical analysis because of the rarity of the variants. The Genetic Analysis Workshop 17 (GAW17) data provide an opportunity to survey existing methods and compare these methods with novel ones. Specifically, the GAW17 Group 2 contributors investigate existing and newly proposed methods and study design strategies to identify rare variants, predict functional variants, and/or examine quality control. We introduce the eight Group 2 papers, summarize their approaches, and discuss their strengths and weaknesses. For these investigations, some groups used only the genotype data, whereas others also used the simulated phenotype data. Although the eight Group 2 contributions covered a wide variety of topics under the general idea of identifying rare variants, they can be grouped into three broad categories according to their common research interests: functionality of variants and quality control issues, family-based analyses, and association analyses of unrelated individuals. The aims of the first subgroup were quite different. These were population structure analyses that used rare variants to predict functionality and examine the accuracy of genotype calls. The aims of the family-based analyses were to select which families should be sequenced and to identify high-risk pedigrees; the aim of the association analyses was to identify variants or genes with regression-based methods. However, power to detect associations was low in all three association studies. Thus this work shows opportunities for incorporating rare variants into the genetic and statistical analyses of common diseases.


Subject(s)
Genetic Variation , Molecular Epidemiology/methods , Molecular Epidemiology/standards , Algorithms , Exome/genetics , Genetic Predisposition to Disease , Human Genome Project , Humans , Quality Control , Regression Analysis , Sequence Analysis/standards
9.
Nat Microbiol ; 7(1): 108-119, 2022 01.
Article in English | MEDLINE | ID: mdl-34907347

ABSTRACT

The global spread and continued evolution of SARS-CoV-2 has driven an unprecedented surge in viral genomic surveillance. Amplicon-based sequencing methods provide a sensitive, low-cost and rapid approach but suffer a high potential for contamination, which can undermine laboratory processes and results. This challenge will increase with the expanding global production of sequences across a variety of laboratories for epidemiological and clinical interpretation, as well as for genomic surveillance of emerging diseases in future outbreaks. We present SDSI + AmpSeq, an approach that uses 96 synthetic DNA spike-ins (SDSIs) to track samples and detect inter-sample contamination throughout the sequencing workflow. We apply SDSIs to the ARTIC Consortium's amplicon design, demonstrate their utility and efficiency in a real-time investigation of a suspected hospital cluster of SARS-CoV-2 cases and validate them across 6,676 diagnostic samples at multiple laboratories. We establish that SDSI + AmpSeq provides increased confidence in genomic data by detecting and correcting for relatively common, yet previously unobserved modes of error, including spillover and sample swaps, without impacting genome recovery.


Subject(s)
DNA Primers/standards , SARS-CoV-2/genetics , Sequence Analysis/standards , COVID-19/diagnosis , DNA Primers/chemical synthesis , Genome, Viral/genetics , Humans , Quality Control , RNA, Viral/genetics , Reproducibility of Results , Sequence Analysis/methods , Whole Genome Sequencing , Workflow
10.
Nucleic Acids Res ; 37(Database issue): D32-6, 2009 Jan.
Article in English | MEDLINE | ID: mdl-18927115

ABSTRACT

NCBI's Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. RefSeq records integrate information from multiple sources and represent a current description of the sequence, the gene and sequence features. The database includes over 5300 organisms spanning prokaryotes, eukaryotes and viruses, with records for more than 5.5 x 10(6) proteins (RefSeq release 30). Feature annotation is applied by a combination of curation, collaboration, propagation from other sources and computation. We report here on the recent growth of the database, recent changes to feature annotations and record types for eukaryotic (primarily vertebrate) species and policies regarding species inclusion and genome annotation. In addition, we introduce RefSeqGene, a new initiative to support reporting variation data on a stable genomic coordinate system.


Subject(s)
Databases, Genetic , Sequence Analysis/standards , Animals , Exons , Genomics/standards , Humans , Mice , Proteins/chemistry , Pseudogenes , RNA, Untranslated/chemistry , Reference Standards
11.
Nucleic Acids Res ; 37(Web Server issue): W634-42, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19483099

ABSTRACT

Human immunodeficiency virus type-1 (HIV-1), hepatitis B and C and other rapidly evolving viruses are characterized by extremely high levels of genetic diversity. To facilitate diagnosis and the development of prevention and treatment strategies that efficiently target the diversity of these viruses, and other pathogens such as human T-lymphotropic virus type-1 (HTLV-1), human herpes virus type-8 (HHV8) and human papillomavirus (HPV), we developed a rapid high-throughput-genotyping system. The method involves the alignment of a query sequence with a carefully selected set of pre-defined reference strains, followed by phylogenetic analysis of multiple overlapping segments of the alignment using a sliding window. Each segment of the query sequence is assigned the genotype and sub-genotype of the reference strain with the highest bootstrap (>70%) and bootscanning (>90%) scores. Results from all windows are combined and displayed graphically using color-coded genotypes. The new Virus-Genotyping Tools provide accurate classification of recombinant and non-recombinant viruses and are currently being assessed for their diagnostic utility. They have incorporated into several HIV drug resistance algorithms including the Stanford (http://hivdb.stanford.edu) and two European databases (http://www.umcutrecht.nl/subsite/spread-programme/ and http://www.hivrdb.org.uk/) and have been successfully used to genotype a large number of sequences in these and other databases. The tools are a PHP/JAVA web application and are freely accessible on a number of servers including: http://bioafrica.mrc.ac.za/rega-genotype/html/, http://lasp.cpqgm.fiocruz.br/virus-genotype/html/, http://jose.med.kuleuven.be/genotypetool/html/.


Subject(s)
Genetic Variation , Software , Viruses/classification , Base Sequence , Genotype , HIV-1/classification , HIV-1/genetics , Hepacivirus/classification , Hepacivirus/genetics , Hepatitis B virus/classification , Hepatitis B virus/genetics , Phylogeny , Recombination, Genetic , Reference Standards , Sequence Alignment , Sequence Analysis/standards , Viruses/genetics
12.
Adv Exp Med Biol ; 680: 693-700, 2010.
Article in English | MEDLINE | ID: mdl-20865556

ABSTRACT

Next Generation Sequencing technologies are limited by the lack of standard bioinformatics infrastructures that can reduce data storage, increase data processing performance, and integrate diverse information. HDF technologies address these requirements and have a long history of use in data-intensive science communities. They include general data file formats, libraries, and tools for working with the data. Compared to emerging standards, such as the SAM/BAM formats, HDF5-based systems demonstrate significantly better scalability, can support multiple indexes, store multiple data types, and are self-describing. For these reasons, HDF5 and its BioHDF extension are well suited for implementing data models to support the next generation of bioinformatics applications.


Subject(s)
Sequence Alignment/statistics & numerical data , Sequence Analysis/statistics & numerical data , Computational Biology , Computer Simulation , Database Management Systems , Databases, Genetic , Sequence Alignment/standards , Sequence Alignment/trends , Sequence Analysis/standards , Sequence Analysis/trends , Software/standards , Software/trends , Software Design , User-Computer Interface
13.
Gigascience ; 9(3)2020 03 01.
Article in English | MEDLINE | ID: mdl-32170312

ABSTRACT

BACKGROUND: Over the past few years the variety of experimental designs and protocols for sequencing experiments increased greatly. To ensure the wide usability of the produced data beyond an individual project, rich and systematic annotation of the underlying experiments is crucial. FINDINGS: We first developed an annotation structure that captures the overall experimental design as well as the relevant details of the steps from the biological sample to the library preparation, the sequencing procedure, and the sequencing and processed files. Through various design features, such as controlled vocabularies and different field requirements, we ensured a high annotation quality, comparability, and ease of annotation. The structure can be easily adapted to a large variety of species. We then implemented the annotation strategy in a user-hosted web platform with data import, query, and export functionality. CONCLUSIONS: We present here an annotation structure and user-hosted platform for sequencing experiment data, suitable for lab-internal documentation, collaborations, and large-scale annotation efforts.


Subject(s)
Molecular Sequence Annotation/methods , Sequence Analysis/methods , Software , Molecular Sequence Annotation/standards , Sequence Analysis/standards
14.
Genes (Basel) ; 10(9)2019 08 28.
Article in English | MEDLINE | ID: mdl-31466373

ABSTRACT

Shotgun metagenomics using next generation sequencing (NGS) is a promising technique to analyze both DNA and RNA microbial material from patient samples. Mostly used in a research setting, it is now increasingly being used in the clinical realm as well, notably to support diagnosis of viral infections, thereby calling for quality control and the implementation of ring trials (RT) to benchmark pipelines and ensure comparable results. The Swiss NGS clinical virology community therefore decided to conduct a RT in 2018, in order to benchmark current metagenomic workflows used at Swiss clinical virology laboratories, and thereby contribute to the definition of common best practices. The RT consisted of two parts (increments), in order to disentangle the variability arising from the experimental compared to the bioinformatics parts of the laboratory pipeline. In addition, the RT was also designed to assess the impact of databases compared to bioinformatics algorithms on the final results, by asking participants to perform the bioinformatics analysis with a common database, in addition to using their own in-house database. Five laboratories participated in the RT (seven pipelines were tested). We observed that the algorithms had a stronger impact on the overall performance than the choice of the reference database. Our results also suggest that differences in sample preparation can lead to significant differences in the performance, and that laboratories should aim for at least 5-10 Mio reads per sample and use depth of coverage in addition to other interpretation metrics such as the percent of coverage. Performance was generally lower when increasing the number of viruses per sample. The lessons learned from this pilot study will be useful for the development of larger-scale RTs to serve as regular quality control tests for laboratories performing NGS analyses of viruses in a clinical setting.


Subject(s)
Clinical Laboratory Services/standards , Genome, Viral , Laboratory Proficiency Testing/methods , Metagenome , Metagenomics/standards , Sequence Analysis/standards , Genome, Human , Humans , Metagenomics/methods , Sequence Analysis/methods , Switzerland
15.
Breast ; 45: 29-35, 2019 Jun.
Article in English | MEDLINE | ID: mdl-30822622

ABSTRACT

Multigene panel testing for breast and ovarian cancer predisposition diagnosis is a useful tool as it makes possible to sequence a considerable number of genes in a large number of individuals. More than 200 different multigene panels in which the two major BRCA1 and BRCA2 breast cancer predisposing genes are included are proposed by public or commercial laboratories. We review the clinical validity and clinical utility of the 26 genes most oftenly included in these panels. Because clinical validity and utility are not established for all genes and due to the heterogeneity of tumour risk levels, there is a substantial difficulty in the routine use of multigene panels if management guidelines and recommendations for testing relatives are not previously defined for each gene. Besides, the classification of variant of unknown significance (VUS) is a particular limitation and challenge. Efforts to classify VUSs and also to identify factors that modify cancer risks are now needed to produce personalised risk estimates. The complexity of information, the capacity to come back to patients when VUS are re-classified as pathogenic, and the expected large increase in the number of individuals to be tested especially when the aim of multigene panel testing is not only prevention but also treatment are challenging both for physicians and patients. Quality of tests, interpretation of results, information and accompaniment of patients must be at the heart of the guidelines of multigene panel testing.


Subject(s)
Breast Neoplasms/genetics , Early Detection of Cancer/standards , Genetic Predisposition to Disease , Genetic Testing/standards , Sequence Analysis/standards , Biomarkers, Tumor/genetics , Early Detection of Cancer/methods , Female , Genes, BRCA1 , Genes, BRCA2 , Genetic Testing/methods , Genetic Variation , Humans , Ovarian Neoplasms/genetics , Reproducibility of Results , Sequence Analysis/methods
16.
Gigascience ; 6(8): 1-11, 2017 08 01.
Article in English | MEDLINE | ID: mdl-28637310

ABSTRACT

Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonized way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (i) material sampling, (ii) material sequencing, (iii) data analysis, and (iv) data archiving and publishing. Taking examples from marine research, we summarize essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community, but greater awareness and adoption is still needed. We emphasize the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing.


Subject(s)
Computational Biology/methods , Computational Biology/standards , Metagenomics/methods , Metagenomics/standards , Data Mining/methods , Data Mining/standards , Databases, Genetic , Metagenome , Sequence Analysis/methods , Sequence Analysis/standards , Workflow
17.
ACS Synth Biol ; 5(6): 449-51, 2016 06 17.
Article in English | MEDLINE | ID: mdl-27267452

ABSTRACT

Research is communicated more effectively and reproducibly when articles depict genetic designs consistently and fully disclose the complete sequences of all reported constructs. ACS Synthetic Biology is now providing authors with updated guidance and piloting a new tool and publication workflow that facilitate compliance with these recommended practices and standards for visual representation and data exchange.


Subject(s)
Genetics/standards , Publishing/standards , Research/standards , Sequence Analysis/standards , Synthetic Biology/standards , Humans , Workflow
18.
Methods Mol Biol ; 1418: 3-17, 2016.
Article in English | MEDLINE | ID: mdl-27008007

ABSTRACT

Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data.


Subject(s)
Molecular Sequence Data , Sequence Analysis/methods , Sequence Analysis/standards , Computational Biology/methods , Computational Biology/standards , Genomics/methods , Sequence Alignment/methods , Sequence Alignment/standards
19.
Methods Mol Biol ; 1418: 39-66, 2016.
Article in English | MEDLINE | ID: mdl-27008009

ABSTRACT

Once a biochemical method has been devised to sample RNA or DNA of interest, sequencing can be used to identify the sampled molecules with high fidelity and low bias. High-throughput sequencing has therefore become the primary data acquisition method for many genomics studies and is being used more and more to address molecular biology questions. By applying principles of statistical experimental design, sequencing experiments can be made more sensitive to the effects under study as well as more biologically sound, hence more replicable.


Subject(s)
High-Throughput Nucleotide Sequencing , Research Design , Sequence Analysis , Animals , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/standards , Humans , Sequence Analysis/methods , Sequence Analysis/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards , Sequence Analysis, RNA/methods , Sequence Analysis, RNA/standards
20.
PLoS One ; 10(3): e0119123, 2015.
Article in English | MEDLINE | ID: mdl-25741706

ABSTRACT

Next generation sequencing technologies, like ultra-deep pyrosequencing (UDPS), allows detailed investigation of complex populations, like RNA viruses, but its utility is limited by errors introduced during sample preparation and sequencing. By tagging each individual cDNA molecule with barcodes, referred to as Primer IDs, before PCR and sequencing these errors could theoretically be removed. Here we evaluated the Primer ID methodology on 257,846 UDPS reads generated from a HIV-1 SG3Δenv plasmid clone and plasma samples from three HIV-infected patients. The Primer ID consisted of 11 randomized nucleotides, 4,194,304 combinations, in the primer for cDNA synthesis that introduced a unique sequence tag into each cDNA molecule. Consensus template sequences were constructed for reads with Primer IDs that were observed three or more times. Despite high numbers of input template molecules, the number of consensus template sequences was low. With 10,000 input molecules for the clone as few as 97 consensus template sequences were obtained due to highly skewed frequency of resampling. Furthermore, the number of sequenced templates was overestimated due to PCR errors in the Primer IDs. Finally, some consensus template sequences were erroneous due to hotspots for UDPS errors. The Primer ID methodology has the potential to provide highly accurate deep sequencing. However, it is important to be aware that there are remaining challenges with the methodology. In particular it is important to find ways to obtain a more even frequency of resampling of template molecules as well as to identify and remove artefactual consensus template sequences that have been generated by PCR errors in the Primer IDs.


Subject(s)
Sequence Analysis/methods , Base Sequence , DNA Primers , HIV-1/genetics , Molecular Sequence Data , Polymerase Chain Reaction , Sequence Analysis/standards , Sequence Homology, Nucleic Acid
SELECTION OF CITATIONS
SEARCH DETAIL