Search | VHL Regional Portal

Robust and scalable barcoding for massively parallel long-read sequencing.

Ezpeleta, Joaquín; Garcia Labari, Ignacio; Villanova, Gabriela Vanina; Bulacio, Pilar; Lavista-Llanos, Sofía; Posner, Victoria; Krsticevic, Flavia; Arranz, Silvia; Tapia, Elizabeth.

Sci Rep ; 12(1): 7619, 2022 05 10.

Article in English | MEDLINE | ID: mdl-35538127

ABSTRACT

Nucleic-acid barcoding is an enabling technique for many applications, but its use remains limited in emerging long-read sequencing technologies with intrinsically low raw accuracy. Here, we apply so-called NS-watermark barcodes, whose error correction capability was previously validated in silico, in a proof of concept where we synthesize 3840 NS-watermark barcodes and use them to asymmetrically tag and simultaneously sequence amplicons from two evolutionarily distant species (namely Bordetella pertussis and Drosophila mojavensis) on the ONT MinION platform. To our knowledge, this is the largest number of distinct, non-random tags ever sequenced in parallel and the first report of microarray-based synthesis as a source for large oligonucleotide pools for barcoding. We recovered the identity of more than 86% of the barcodes, with a crosstalk rate of 0.17% (i.e., one misassignment every 584 reads). This falls in the range of the index hopping rate of established, high-accuracy Illumina sequencing, despite the increased number of tags and the relatively low accuracy of both microarray-based synthesis and long-read sequencing. The robustness of NS-watermark barcodes, together with their scalable design and compatibility with low-cost massive synthesis, makes them promising for present and future sequencing applications requiring massive labeling, such as long-read single-cell RNA-Seq.

Subject(s)

DNA Barcoding, Taxonomic , High-Throughput Nucleotide Sequencing , DNA Barcoding, Taxonomic/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods

FGGA-lnc: automatic gene ontology annotation of lncRNA sequences based on secondary structures.

Spetale, Flavio E; Murillo, Javier; Villanova, Gabriela V; Bulacio, Pilar; Tapia, Elizabeth.

Interface Focus ; 11(4): 20200064, 2021 Jun.

Article in English | MEDLINE | ID: mdl-34123354

ABSTRACT

The study of long non-coding RNAs (lncRNAs), greater than 200 nucleotides, is central to understanding the development and progression of many complex diseases. Unlike proteins, the functionality of lncRNAs is only subtly encoded in their primary sequence. Current in-silico lncRNA annotation methods mostly rely on annotations inferred from interaction networks. But extensive experimental studies are required to build these networks. In this work, we present a graph-based machine learning method called FGGA-lnc for the automatic gene ontology (GO) annotation of lncRNAs across the three GO subdomains. We build upon FGGA (factor graph GO annotation), a computational method originally developed to annotate protein sequences from non-model organisms. In the FGGA-lnc version, a coding-based approach is introduced to fuse primary sequence and secondary structure information of lncRNA molecules. As a result, lncRNA sequences become sequences of a higher-order alphabet allowing supervised learning methods to assess individual GO-term annotations. Raw GO annotations obtained in this way are unaware of the GO structure and therefore likely to be inconsistent with it. The message-passing algorithm embodied by factor graph models overcomes this problem. Evaluations of the FGGA-lnc method on lncRNA data, from model and non-model organisms, showed promising results suggesting it as a candidate to satisfy the huge demand for functional annotations arising from high-throughput sequencing technologies.

Consistency of the Tools That Predict the Impact of Single Nucleotide Variants (SNVs) on Gene Functionality: The BRCA1 Gene.

Murillo, Javier; Spetale, Flavio; Guillaume, Serge; Bulacio, Pilar; Garcia Labari, Ignacio; Cailloux, Olivier; Destercke, Sebastien; Tapia, Elizabeth.

Biomolecules ; 10(3)2020 03 20.

Article in English | MEDLINE | ID: mdl-32244891

ABSTRACT

Single nucleotide variants (SNVs) occurring in a protein coding gene may disrupt its function in multiple ways. Predicting this disruption has been recognized as an important problem in bioinformatics research. Many tools, hereafter p-tools, have been designed to perform these predictions and many of them are now of common use in scientific research, even in clinical applications. This highlights the importance of understanding the semantics of their outputs. To shed light on this issue, two questions are formulated, (i) do p-tools provide similar predictions? (inner consistency), and (ii) are these predictions consistent with the literature? (outer consistency). To answer these, six p-tools are evaluated with exhaustive SNV datasets from the BRCA1 gene. Two indices, called K a l l and K s t r o n g , are proposed to quantify the inner consistency of pairs of p-tools while the outer consistency is quantified by standard information retrieval metrics. While the inner consistency analysis reveals that most of the p-tools are not consistent with each other, the outer consistency analysis reveals they are characterized by a low prediction performance. Although this result highlights the need of improving the prediction performance of individual p-tools, the inner consistency results pave the way to the systematic design of truly diverse ensembles of p-tools that can overcome the limitations of individual members.

Subject(s)

BRCA1 Protein , Computational Biology , Models, Genetic , Polymorphism, Single Nucleotide , BRCA1 Protein/genetics , BRCA1 Protein/metabolism , Humans

Consistent prediction of GO protein localization.

Spetale, Flavio E; Arce, Debora; Krsticevic, Flavia; Bulacio, Pilar; Tapia, Elizabeth.

Sci Rep ; 8(1): 7757, 2018 05 17.

Article in English | MEDLINE | ID: mdl-29773825

ABSTRACT

The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC+, a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC+ classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC+ classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC+ classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.

Subject(s)

Arabidopsis/metabolism , Computational Biology/methods , Drosophila melanogaster/metabolism , Gene Ontology , Proteins/metabolism , Saccharomyces cerevisiae/metabolism , Solanum lycopersicum/metabolism , Animals , Databases, Protein , Molecular Sequence Annotation , Proteins/analysis , Proteomics , Software

Designing robust watermark barcodes for multiplex long-read sequencing.

Ezpeleta, Joaquín; Krsticevic, Flavia J; Bulacio, Pilar; Tapia, Elizabeth.

Bioinformatics ; 33(6): 807-813, 2017 03 15.

Article in English | MEDLINE | ID: mdl-27259539

ABSTRACT

Motivation: To attain acceptable sample misassignment rates, current approaches to multiplex single-molecule real-time sequencing require upstream quality improvement, which is obtained from multiple passes over the sequenced insert and significantly reduces the effective read length. In order to fully exploit the raw read length on multiplex applications, robust barcodes capable of dealing with the full single-pass error rates are needed. Results: We present a method for designing sequencing barcodes that can withstand a large number of insertion, deletion and substitution errors and are suitable for use in multiplex single-molecule real-time sequencing. The manuscript focuses on the design of barcodes for full-length single-pass reads, impaired by challenging error rates in the order of 11%. The proposed barcodes can multiplex hundreds or thousands of samples while achieving sample misassignment probabilities as low as 10-7 under the above conditions, and are designed to be compatible with chemical constraints imposed by the sequencing process. Availability and Implementation: Software tools for constructing watermark barcode sets and demultiplexing barcoded reads, together with example sets of barcodes and synthetic barcoded reads, are freely available at www.cifasis-conicet.gov.ar/ezpeleta/NS-watermark . Contact: ezpeleta@cifasis-conicet.gov.ar.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Computer Simulation

A Factor Graph Approach to Automated GO Annotation.

Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar.

PLoS One ; 11(1): e0146986, 2016.

Article in English | MEDLINE | ID: mdl-26771463

ABSTRACT

As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.

Subject(s)

Drosophila melanogaster/genetics , Gene Ontology , Algorithms , Animals , Arabidopsis/genetics , Computational Biology , Solanum lycopersicum/genetics , Saccharomyces cerevisiae/genetics , Software

DNA Barcoding through Quaternary LDPC Codes.

Tapia, Elizabeth; Spetale, Flavio; Krsticevic, Flavia; Angelone, Laura; Bulacio, Pilar.

PLoS One ; 10(10): e0140459, 2015.

Article in English | MEDLINE | ID: mdl-26492348

ABSTRACT

For many parallel applications of Next-Generation Sequencing (NGS) technologies short barcodes able to accurately multiplex a large number of samples are demanded. To address these competitive requirements, the use of error-correcting codes is advised. Current barcoding systems are mostly built from short random error-correcting codes, a feature that strongly limits their multiplexing accuracy and experimental scalability. To overcome these problems on sequencing systems impaired by mismatch errors, the alternative use of binary BCH and pseudo-quaternary Hamming codes has been proposed. However, these codes either fail to provide a fine-scale with regard to size of barcodes (BCH) or have intrinsic poor error correcting abilities (Hamming). Here, the design of barcodes from shortened binary BCH codes and quaternary Low Density Parity Check (LDPC) codes is introduced. Simulation results show that although accurate barcoding systems of high multiplexing capacity can be obtained with any of these codes, using quaternary LDPC codes may be particularly advantageous due to the lower rates of read losses and undetected sample misidentification errors. Even at mismatch error rates of 10(-2) per base, 24-nt LDPC barcodes can be used to multiplex roughly 2000 samples with a sample misidentification error rate in the order of 10(-9) at the expense of a rate of read losses just in the order of 10(-6).

Subject(s)

DNA Barcoding, Taxonomic/methods , Probability

Multiclass classification of microarray data samples with a reduced number of genes.

Tapia, Elizabeth; Ornella, Leonardo; Bulacio, Pilar; Angelone, Laura.

BMC Bioinformatics ; 12: 59, 2011 Feb 22.

Article in English | MEDLINE | ID: mdl-21342522

ABSTRACT

BACKGROUND: Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. RESULTS: A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. CONCLUSIONS: A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.

Subject(s)

Algorithms , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , Computational Biology/methods , Humans , Neoplasms/genetics

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL