Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 82
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 38(Suppl 1): i404-i412, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758819

RESUMO

MOTIVATION: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. RESULTS: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover's Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. AVAILABILITY AND IMPLEMENTATION: Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Genômica , Análise de Sequência de DNA , Software
2.
Bioinformatics ; 37(Suppl_1): i205-i213, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252955

RESUMO

MOTIVATION: The size of a genome graph-the space required to store the nodes, node labels and edges-affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs. RESULTS: We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit. AVAILABILITY: The RLZ-Graph software is available at: https://github.com/Kingsford-Group/rlzgraph. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Software , Algoritmos , Genoma Humano , Humanos , Análise de Sequência de DNA
3.
Bioinformatics ; 37(Suppl_1): i334-i341, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252927

RESUMO

MOTIVATION: Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the large numbers of available RNA-seq samples and of k-mers/sequences in each sample, computing the full similarity matrix using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources. RESULTS: We developed a novel computational method called 'hierarchical representative set selection' to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4× runtime reduction and 5.35× memory reduction for 10 000 and 12 000 samples respectively that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases like the SRA. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/Kingsford-Group/hierrepsetselection and https://github.com/Kingsford-Group/jellyfishsim. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Bases de Dados Factuais , RNA-Seq , Análise de Sequência de RNA , Sequenciamento do Exoma
4.
Bioinformatics ; 37(8): 1083-1092, 2021 05 23.
Artigo em Inglês | MEDLINE | ID: mdl-33135733

RESUMO

MOTIVATION: The study of the evolutionary history of biological networks enables deep functional understanding of various bio-molecular processes. Network growth models, such as the Duplication-Mutation with Complementarity (DMC) model, provide a principled approach to characterizing the evolution of protein-protein interactions (PPIs) based on duplication and divergence. Current methods for model-based ancestral network reconstruction primarily use greedy heuristics and yield sub-optimal solutions. RESULTS: We present a new Integer Linear Programming (ILP) solution for maximum likelihood reconstruction of ancestral PPI networks using the DMC model. We prove the correctness of our solution that is designed to find the optimal solution. It can also use efficient heuristics from general-purpose ILP solvers to obtain multiple optimal and near-optimal solutions that may be useful in many applications. Experiments on synthetic data show that our ILP obtains solutions with higher likelihood than those from previous methods, and is robust to noise and model mismatch. We evaluate our algorithm on two real PPI networks, with proteins from the families of bZIP transcription factors and the Commander complex. On both the networks, solutions from our ILP have higher likelihood and are in better agreement with independent biological evidence from other studies. AVAILABILITY AND IMPLEMENTATION: A Python implementation is available at https://bitbucket.org/cdal/network-reconstruction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Programação Linear , Probabilidade , Proteínas
5.
Bioinformatics ; 37(Suppl_1): i187-i195, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252928

RESUMO

MOTIVATION: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. RESULTS: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. AVAILABILITY AND IMPLEMENTATION: A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Genoma Humano , Genômica , Humanos , Análise de Sequência de DNA
6.
BMC Bioinformatics ; 22(1): 174, 2021 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-33794760

RESUMO

BACKGROUND: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. RESULTS: We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. CONCLUSION: HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.


Assuntos
Neoplasias da Mama , Aprendizado Profundo , Sequenciamento Completo do Genoma , Neoplasias da Mama/genética , Genoma , Genômica , Humanos
7.
Bioinformatics ; 36(Suppl_1): i119-i127, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657376

RESUMO

MOTIVATION: Minimizers are methods to sample k-mers from a string, with the guarantee that similar set of k-mers will be chosen on similar strings. It is parameterized by the k-mer length k, a window length w and an order on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage. Despite the method's popularity, many theoretical questions regarding its performance remain open. The core metric for measuring performance of a minimizer is the density, which measures the sparsity of sampled k-mers. The theoretical optimal density for a minimizer is 1/w, provably not achievable in general. For given k and w, little is known about asymptotically optimal minimizers, that is minimizers with density O(1/w). RESULTS: We derive a necessary and sufficient condition for existence of asymptotically optimal minimizers. We also provide a randomized algorithm, called the Miniception, to design minimizers with the best theoretical guarantee to date on density in practical scenarios. Constructing and using the Miniception is as easy as constructing and using a random minimizer, which allows the design of efficient minimizers that scale to the values of k and w used in current bioinformatics software programs. AVAILABILITY AND IMPLEMENTATION: Reference implementation of the Miniception and the codes for analysis can be found at https://github.com/kingsford-group/miniception. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA
8.
Nat Methods ; 14(4): 417-419, 2017 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-28263959

RESUMO

We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.


Assuntos
Algoritmos , Análise de Sequência de RNA/métodos , Composição de Bases , Teorema de Bayes , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos
9.
Bioinformatics ; 35(14): i127-i135, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510667

RESUMO

MOTIVATION: Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. RESULTS: We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. AVAILABILITY AND IMPLEMENTATION: The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Alinhamento de Sequência , Software
10.
Bioinformatics ; 34(13): i475-i483, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949963

RESUMO

Motivation: Three-dimensional chromosome structure has been increasingly shown to influence various levels of cellular and genomic functions. Through Hi-C data, which maps contact frequency on chromosomes, it has been found that structural elements termed topologically associating domains (TADs) are involved in many regulatory mechanisms. However, we have little understanding of the level of similarity or variability of chromosome structure across cell types and disease states. In this study, we present a method to quantify resemblance and identify structurally similar regions between any two sets of TADs. Results: We present an analysis of 23 human Hi-C samples representing various tissue types in normal and cancer cell lines. We quantify global and chromosome-level structural similarity, and compare the relative similarity between cancer and non-cancer cells. We find that cancer cells show higher structural variability around commonly mutated pan-cancer genes than normal cells at these same locations. Availability and implementation: Software for the methods and analysis can be found at https://github.com/Kingsford-Group/localtadsim.


Assuntos
Cromatina/ultraestrutura , Genômica/métodos , Software , Linhagem Celular , Linhagem Celular Tumoral , Cromatina/metabolismo , Cromossomos Humanos/metabolismo , Cromossomos Humanos/ultraestrutura , Humanos , Neoplasias/metabolismo , Neoplasias/ultraestrutura , Análise de Sequência de DNA/métodos
11.
Bioinformatics ; 34(13): i13-i22, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949995

RESUMO

Motivation: The minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density and thereby making existing and future bioinformatics tools even more efficient. Results: From the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the three type of schemes.


Assuntos
Algoritmos , Biologia Computacional/métodos
12.
Methods ; 137: 67-70, 2018 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-29330118

RESUMO

Ribosome profiling has emerged as a powerful technique to study mRNA translation. Ribosome profiling has the potential to determine the relative quantities and locations of ribosomes on mRNA genome wide. Taking full advantage of this approach requires accurate measurement of ribosome locations. However, experimental inconsistencies often obscure the positional information encoded in ribosome profiling data. Here, we describe the Ribodeblur pipeline, a computational analysis tool that uses a maximum likelihood framework to infer ribosome positions from heterogeneous datasets. Ribodeblur is simple to install, and can be run on an average modern Mac or Linux-based laptop. We detail the process of applying the pipeline to high-coverage ribosome profiling data in yeast, and discuss important considerations for potential extension to other organisms.


Assuntos
Biologia Computacional/métodos , Biossíntese de Proteínas , Ribossomos/genética , Saccharomyces cerevisiae/genética , Sequenciamento de Nucleotídeos em Larga Escala , RNA Mensageiro/genética , Software
13.
Nucleic Acids Res ; 45(7): 3663-3673, 2017 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-28334818

RESUMO

Understanding the three-dimensional (3D) architecture of chromatin and its relation to gene expression and regulation is fundamental to understanding how the genome functions. Advances in Hi-C technology now permit us to study 3D genome organization, but we still lack an understanding of the structural dynamics of chromosomes. The dynamic couplings between regions separated by large genomic distances (>50 Mb) have yet to be characterized. We adapted a well-established protein-modeling framework, the Gaussian Network Model (GNM), to model chromatin dynamics using Hi-C data. We show that the GNM can identify spatial couplings at multiple scales: it can quantify the correlated fluctuations in the positions of gene loci, find large genomic compartments and smaller topologically-associating domains (TADs) that undergo en bloc movements, and identify dynamically coupled distal regions along the chromosomes. We show that the predictions of the GNM correlate well with genome-wide experimental measurements. We use the GNM to identify novel cross-correlated distal domains (CCDDs) representing pairs of regions distinguished by their long-range dynamic coupling and show that CCDDs are associated with increased gene co-expression. Together, these results show that GNM provides a mathematically well-founded unified framework for modeling chromatin dynamics and assessing the structural basis of genome-wide observations.


Assuntos
Cromatina/química , Modelos Genéticos , Linhagem Celular , Expressão Gênica , Loci Gênicos , Genoma , Genômica , Humanos , Análise de Sequência de DNA
14.
Bioinformatics ; 33(14): i110-i117, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881970

RESUMO

MOTIVATION: The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k -mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k -mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. RESULTS: We provide an in-depth analysis of the effect of k -mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al. ) on the expected density of minimizers in a random sequence. AVAILABILITY AND IMPLEMENTATION: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git . CONTACT: gmarcais@cs.cmu.edu or carlk@cs.cmu.edu.


Assuntos
Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos
15.
PLoS Comput Biol ; 13(10): e1005777, 2017 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-28968408

RESUMO

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma Bacteriano , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Caenorhabditis elegans/genética , Heurística Computacional , Humanos
16.
Bioinformatics ; 32(12): 1880-2, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27153676

RESUMO

UNLABELLED: : Ribosome profiling is a recently developed high-throughput sequencing technique that captures approximately 30 bp long ribosome-protected mRNA fragments during translation. Because of alternative splicing and repetitive sequences, a ribosome-protected read may map to many places in the transcriptome, leading to discarded or arbitrary mappings when standard approaches are used. We present a technique and software that addresses this problem by assigning reads to potential origins proportional to estimated transcript abundance. This yields a more accurate estimate of ribosome profiles compared with a naïve mapping. AVAILABILITY AND IMPLEMENTATION: Ribomap is available as open source at http://www.cs.cmu.edu/∼ckingsf/software/ribomap CONTACT: carlk@cs.cmu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Ribossomos , Processamento Alternativo , Sequenciamento de Nucleotídeos em Larga Escala , Isoformas de Proteínas , Análise de Sequência de RNA , Software
17.
Bioinformatics ; 32(11): 1632-42, 2016 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-26568624

RESUMO

MOTIVATION: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds. RESULTS: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-base-pair read in [Formula: see text] operations on average and in [Formula: see text] operations in the worst case, while generating a maximum of [Formula: see text] seed frequency database lookups. We compare OSS against four state-of-the-art seed selection schemes and observe that OSS provides a 3-fold reduction in average seed frequency over the best previous seed selection optimizations. AVAILABILITY AND IMPLEMENTATION: We provide an implementation of the Optimal Seed Solver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-Solver CONTACT: hxin@cmu.edu, calkan@cs.bilkent.edu.tr or onur@cmu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos
18.
BMC Bioinformatics ; 17: 155, 2016 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-27059896

RESUMO

BACKGROUND: Understanding the interactions between antibodies and the linear epitopes that they recognize is an important task in the study of immunological diseases. We present a novel computational method for the design of linear epitopes of specified binding affinity to Intravenous Immunoglobulin (IVIg). RESULTS: We show that the method, called Pythia-design can accurately design peptides with both high-binding affinity and low binding affinity to IVIg. To show this, we experimentally constructed and tested the computationally constructed designs. We further show experimentally that these designed peptides are more accurate that those produced by a recent method for the same task. Pythia-design is based on combining random walks with an ensemble of probabilistic support vector machines (SVM) classifiers, and we show that it produces a diverse set of designed peptides, an important property to develop robust sets of candidates for construction. We show that by combining Pythia-design and the method of (PloS ONE 6(8):23616, 2011), we are able to produce an even more accurate collection of designed peptides. Analysis of the experimental validation of Pythia-design peptides indicates that binding of IVIg is favored by epitopes that contain trypthophan and cysteine. CONCLUSIONS: Our method, Pythia-design, is able to generate a diverse set of binding and non-binding peptides, and its designs have been experimentally shown to be accurate.


Assuntos
Biologia Computacional/métodos , Epitopos/química , Imunoglobulinas Intravenosas/química , Peptídeos Cíclicos/química , Citrulina/química , Cisteína/química , Humanos , Modelos Moleculares , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte , Triptofano/química
19.
Bioinformatics ; 31(17): 2770-7, 2015 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25910696

RESUMO

MOTIVATION: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. RESULTS: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. AVAILABILITY AND IMPLEMENTATION: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. CONTACT: carlk@cs.cmu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional/métodos , Compressão de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Genômica , Humanos , Alinhamento de Sequência
20.
Bioinformatics ; 31(12): 1920-8, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-25649622

RESUMO

MOTIVATION: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. RESULTS: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. AVAILABILITY AND IMPLEMENTATION: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.


Assuntos
Algoritmos , Biologia Computacional/métodos , Compressão de Dados/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Animais , Humanos , Camundongos , Linguagens de Programação , Pseudomonas aeruginosa/genética , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa