Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 5 de 5
Filter
Add more filters










Database
Language
Publication year range
1.
Genome Biol ; 21(1): 261, 2020 10 13.
Article in English | MEDLINE | ID: mdl-33050927

ABSTRACT

iMOKA (interactive multi-objective k-mer analysis) is a software that enables comprehensive analysis of sequencing data from large cohorts to generate robust classification models or explore specific genetic elements associated with disease etiology. iMOKA uses a fast and accurate feature reduction step that combines a Naïve Bayes classifier augmented by an adaptive entropy filter and a graph-based filter to rapidly reduce the search space. By using a flexible file format and distributed indexing, iMOKA can easily integrate data from multiple experiments and also reduces disk space requirements and identifies changes in transcript levels and single nucleotide variants. iMOKA is available at https://github.com/RitchieLabIGH/iMOKA and Zenodo https://doi.org/10.5281/zenodo.4008947 .


Subject(s)
Sequence Analysis, DNA , Software , Algorithms , Breast Neoplasms/classification , Breast Neoplasms/drug therapy , Breast Neoplasms/genetics , Drug Resistance, Neoplasm/genetics , Female , Humans , Ovarian Neoplasms/drug therapy , Ovarian Neoplasms/genetics , Pharmacogenomic Variants
2.
Commun Biol ; 2: 222, 2019.
Article in English | MEDLINE | ID: mdl-31240260

ABSTRACT

Comparative analysis of high throughput sequencing data between multiple conditions often involves mapping of sequencing reads to a reference and downstream bioinformatics analyses. Both of these steps may introduce heavy bias and potential data loss. This is especially true in studies where patient transcriptomes or genomes may vary from their references, such as in cancer. Here we describe a novel approach and associated software that makes use of advances in genetic algorithms and feature selection to comprehensively explore massive volumes of sequencing data to classify and discover new sequences of interest without a mapping step and without intensive use of specialized bioinformatics pipelines. We demonstrate that our approach called GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.


Subject(s)
Algorithms , High-Throughput Nucleotide Sequencing/methods , Blood Cells , Breast Neoplasms/classification , Breast Neoplasms/genetics , Computational Biology/methods , DNA Methylation , Humans , MicroRNAs , Mutation , RNA, Messenger , Software
3.
Nucleic Acids Res ; 42(5): 2820-32, 2014 Mar.
Article in English | MEDLINE | ID: mdl-24357408

ABSTRACT

Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as 'TranscriRef'). We then annotated 750,000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34,000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.


Subject(s)
Gene Expression Profiling/methods , Genome, Human , RNA, Untranslated/analysis , Sequence Analysis, RNA/methods , Cell Line , Humans , Molecular Sequence Annotation , Poly A/analysis , Software , Transcription, Genetic
4.
J Comput Biol ; 18(9): 1141-54, 2011 Sep.
Article in English | MEDLINE | ID: mdl-21899421

ABSTRACT

Chaining fragments is a crucial step in genome alignment. Existing chaining algorithms compute a maximum weighted chain with no overlaps allowed between adjacent fragments. In practice, using local alignments as fragments, instead of Maximal Exact Matches (MEMs), generates frequent overlaps between fragments, due to combinatorial reasons and biological factors, i.e., variable tandem repeat structures that differ in number of copies between genomic sequences. In this article, in order to raise this limitation, we formulate a novel definition of a chain, allowing overlaps proportional to the fragments lengths, and exhibit an efficient algorithm for computing such a maximum weighted chain. We tested our algorithm on a dataset composed of 694 genome pairs and accounted for significant improvements in terms of coverage, while keeping the running times below reasonable limits. Moreover, experiments with different ratios of allowed overlaps showed the robustness of the chains with respect to these ratios. Our algorithm is implemented in a tool called OverlapChainer (OC), which is available upon request to the authors.


Subject(s)
Algorithms , Genome, Bacterial , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software
5.
Nucleic Acids Res ; 39(15): e101, 2011 Aug.
Article in English | MEDLINE | ID: mdl-21646341

ABSTRACT

Genome comparison is now a crucial step for genome annotation and identification of regulatory motifs. Genome comparison aims for instance at finding genomic regions either specific to or in one-to-one correspondence between individuals/strains/species. It serves e.g. to pre-annotate a new genome by automatically transferring annotations from a known one. However, efficiency, flexibility and objectives of current methods do not suit the whole spectrum of applications, genome sizes and organizations. Innovative approaches are still needed. Hence, we propose an alternative way of comparing multiple genomes based on segmentation by similarity. In this framework, rather than being formulated as a complex optimization problem, genome comparison is seen as a segmentation question for which a single optimal solution can be found in almost linear time. We apply our method to analyse three strains of a virulent pathogenic bacteria, Ehrlichia ruminantium, and identify 92 new genes. We also find out that a substantial number of genes thought to be strain specific have potential orthologs in the other strains. Our solution is implemented in an efficient program, qod, equipped with a user-friendly interface, and enables the automatic transfer of annotations between compared genomes or contigs (Video in Supplementary Data). Because it somehow disregards the relative order of genomic blocks, qod can handle unfinished genomes, which due to the difficulty of sequencing completion may become an interesting characteristic for the future. Availabilty: http://www.atgc-montpellier.fr/qod.


Subject(s)
Genomics/methods , Software , Algorithms , Ehrlichia ruminantium/classification , Ehrlichia ruminantium/genetics , Genes, Bacterial , Genome, Bacterial , Species Specificity
SELECTION OF CITATIONS
SEARCH DETAIL
...