Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 88
Filtrar
1.
J Comput Biol ; 29(1): 19-22, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34985990

RESUMEN

Although the availability of various sequencing technologies allows us to capture different genome properties at single-cell resolution, with the exception of a few co-assaying technologies, applying different sequencing assays on the same single cell is impossible. Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data. First, it preserves the local geometry by constructing a k-nearest neighbor (k-NN) graph for each data set (or domain) to capture the intra-domain distances. SCOT then finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection, thus aligning them. SCOT requires tuning only two hyperparameters and is robust to the choice of one. Furthermore, the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available. Thus, SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario. We provide a tutorial for SCOT and make its source code publicly available on GitHub.


Asunto(s)
Algoritmos , Alineación de Secuencia/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Genómica/estadística & datos numéricos , Heurística , Humanos , Redes Neurales de la Computación , Análisis de Secuencia/estadística & datos numéricos , Programas Informáticos , Aprendizaje Automático no Supervisado
2.
Clin Transl Med ; 11(11): e589, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34842356

RESUMEN

BACKGROUND: Few studies have discussed the contradictory roles of mutated-PI3Kα in HER2-positive (HER2+) breast cancer. Thus, we characterised the adaptive roles of PI3Kα mutations among HER2+ tumour progression. METHODS: We conducted prospective clinical sequencing of 1923 Chinese breast cancer patients and illustrated the clinical significance of PIK3CA mutations in locally advanced and advanced HER2+ cohort. A high-throughput PIK3CA mutations-barcoding screen was performed to reveal impactful mutation sites in tumour growth and drug responses. RESULTS: PIK3CA mutations acted as a protective factor in treatment-naïve patients; however, advanced/locally advanced patients harbouring mutated-PI3Kα exhibited a higher progressive disease rate (100% vs. 15%, p = .000053) and a lower objective response rate (81.7% vs. 95.4%, p = .0008) in response to trastuzumab-based therapy. Meanwhile, patients exhibiting anti-HER2 resistance had a relatively high variant allele fraction (VAF) of PIK3CA mutations; we defined the VAF > 12.23% as a predictor of poor anti-HER2 neoadjuvant treatment efficacy. Pooled mutations screen revealed that specific PI3Kα mutation alleles mediated own biological effects. PIK3CA functional mutations suppressed the growth of HER2+ cells, but conferred anti-HER2 resistance, which can be reversed by the PI3Kα-specific inhibitor BYL719. CONCLUSIONS: We proposed adaptive treatment strategies that the mutated PIK3CA and amplified ERBB2 should be concomitantly inhibited when exposing to continuous anti-HER2 therapy, while the combination of anti-HER2 and anti-PI3Kα treatment was not essential for anti-HER2 treatment-naïve patients. These findings improve the understanding of genomics-guided treatment in the different progressions of HER2+ breast cancer.


Asunto(s)
Neoplasias de la Mama/tratamiento farmacológico , Receptor ErbB-2/genética , Análisis de Secuencia/estadística & datos numéricos , Adaptación Fisiológica/efectos de los fármacos , Adaptación Fisiológica/genética , Neoplasias de la Mama/genética , Neoplasias de la Mama/fisiopatología , China , Estudios de Cohortes , Femenino , Humanos , Estudios Prospectivos , Análisis de Secuencia/métodos
3.
Methods Mol Biol ; 2212: 277-289, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33733362

RESUMEN

We report a step-by-step protocol to use pysster, a TensorFlow-based package for building deep neural networks on a broad range of epistatic sequences such as DNA, RNA, or annotated secondary structure sequences. Pysster provides users comprehensive supports for developing, training, and evaluating the self-defined deep neural networks on sequence data. Moreover, pysster allows users to easily visualize the resulting perditions, which is helpful to uncover the "black box" of deep neural networks. Here, we describe a step-by-step application of pysster to classify the RNA A-to-I editing regions and interpret the model predictions. To further demonstrate the generalizability of pysster, we utilized it to build and evaluated a new deep neural network on an artificial epistatic sequence dataset.


Asunto(s)
Aprendizaje Profundo , Epistasis Genética , Modelos Genéticos , ARN/genética , Programas Informáticos , Secuencia de Bases , Conjuntos de Datos como Asunto , Humanos , Edición de ARN , Curva ROC , Análisis de Secuencia/estadística & datos numéricos
4.
Microbiome ; 8(1): 134, 2020 09 16.
Artículo en Inglés | MEDLINE | ID: mdl-32938501

RESUMEN

BACKGROUND: Sequencing prokaryotic genomes has revolutionized our understanding of the many roles played by microorganisms. However, the cell and taxon proportions of genome-sequenced bacteria or archaea on earth remain unknown. This study aimed to explore this basic question using large-scale alignment between the sequences released by the Earth Microbiome Project and 155,810 prokaryotic genomes from public databases. RESULTS: Our results showed that the median proportions of the genome-sequenced cells and taxa (at 100% identities in the 16S-V4 region) in different biomes reached 38.1% (16.4-86.3%) and 18.8% (9.1-52.6%), respectively. The sequenced proportions of the prokaryotic genomes in biomes were significantly negatively correlated with the alpha diversity indices, and the proportions sequenced in host-associated biomes were significantly higher than those in free-living biomes. Due to a set of cosmopolitan OTUs that are found in multiple samples and preferentially sequenced, only 2.1% of the global prokaryotic taxa are represented by sequenced genomes. Most of the biomes were occupied by a few predominant taxa with a high relative abundance and much higher genome-sequenced proportions than numerous rare taxa. CONCLUSIONS: These results reveal the current situation of prokaryotic genome sequencing for earth biomes, provide a more reasonable and efficient exploration of prokaryotic genomes, and promote our understanding of microbial ecological functions. Video Abstract.


Asunto(s)
Planeta Tierra , Genoma/genética , Genómica/estadística & datos numéricos , Microbiota/genética , Células Procariotas/clasificación , Células Procariotas/metabolismo , Análisis de Secuencia/estadística & datos numéricos , Archaea/clasificación , Archaea/genética , Archaea/aislamiento & purificación , Bacterias/clasificación , Bacterias/genética , Bacterias/aislamiento & purificación , Bases de Datos Genéticas , Alineación de Secuencia
5.
Brief Bioinform ; 20(1): 222-234, 2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-29028876

RESUMEN

High-throughput sequencing technologies have exposed the possibilities for the in-depth evaluation of T-cell receptor (TCR) repertoires. These studies are highly relevant to gain insights into human adaptive immunity and to decipher the composition and diversity of antigen receptors in physiological and disease conditions. The major objective of TCR sequencing data analysis is the identification of V, D and J gene segments, complementarity-determining region 3 (CDR3) sequence extraction and clonality analysis. With the advancement in sequencing technologies, new TCR analysis approaches and programs have been developed. However, there is still a deficit of systematic comparative studies to assist in the selection of an optimal analysis approach. Here, we present a detailed comparison of 10 state-of-the-art TCR analysis tools on samples with different complexities by taking into account many aspects such as clonotype detection [unique V(D)J combination], CDR3 identification or accuracy in error correction. We used our in silico and experimental data sets with known clonalities enabling the identification of potential tool biases. We also established a new strategy, named clonal plane, which allows quantifying and comparing the clonality of multiple samples. Our results provide new insights into the effect of method selection on analysis results, and it will assist users in the selection of an appropriate analysis method.


Asunto(s)
Receptores de Antígenos de Linfocitos T/genética , Secuencia de Aminoácidos , Secuencia de Bases , Biología Computacional/métodos , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Células HeLa , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Células Jurkat , Análisis de Secuencia/estadística & datos numéricos , Linfocitos T/inmunología
6.
Brief Bioinform ; 20(4): 1280-1294, 2019 07 19.
Artículo en Inglés | MEDLINE | ID: mdl-29272359

RESUMEN

With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems is how to computationally analyze their structures and functions. Machine learning techniques are playing key roles in this field. Typically, predictors based on machine learning techniques contain three main steps: feature extraction, predictor construction and performance evaluation. Although several Web servers and stand-alone tools have been developed to facilitate the biological sequence analysis, they only focus on individual step. In this regard, in this study a powerful Web server called BioSeq-Analysis (http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/) has been proposed to automatically complete the three main steps for constructing a predictor. The user only needs to upload the benchmark data set. BioSeq-Analysis can generate the optimized predictor based on the benchmark data set, and the performance measures can be reported as well. Furthermore, to maximize user's convenience, its stand-alone program was also released, which can be downloaded from http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/download/, and can be directly run on Windows, Linux and UNIX. Applied to three sequence analysis tasks, experimental results showed that the predictors generated by BioSeq-Analysis even outperformed some state-of-the-art methods. It is anticipated that BioSeq-Analysis will become a useful tool for biological sequence analysis.


Asunto(s)
Aprendizaje Automático , Análisis de Secuencia/métodos , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Bases de Datos de Proteínas/estadística & datos numéricos , Humanos , Internet , Análisis de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de ARN/métodos
7.
Bioinformatics ; 34(16): 2870-2878, 2018 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-29608657

RESUMEN

Motivation: Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models. Results: The purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de Secuencia , Biblioteca de Genes , Humanos , Modelos Estadísticos , Análisis de Secuencia/estadística & datos numéricos
8.
Brief Bioinform ; 15(3): 354-68, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24096012

RESUMEN

With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient.


Asunto(s)
Biología Computacional/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis de Secuencia/métodos , Genómica/estadística & datos numéricos , Cadenas de Markov , Modelos Estadísticos , Filogenia , Proteómica/estadística & datos numéricos , Alineación de Secuencia , Análisis de Secuencia/estadística & datos numéricos , Programas Informáticos
9.
Brief Bioinform ; 15(3): 343-53, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24064230

RESUMEN

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia/métodos , Algoritmos , Biología Computacional/tendencias , Genómica/métodos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento , Cadenas de Markov , Modelos Estadísticos , Alineación de Secuencia , Análisis de Secuencia/estadística & datos numéricos
10.
Brief Bioinform ; 15(3): 376-89, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24058049

RESUMEN

Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.


Asunto(s)
Biología Computacional/métodos , Teoría de la Información , Análisis de Secuencia/métodos , Sitios de Unión/genética , Genómica/métodos , Genómica/estadística & datos numéricos , Humanos , Modelos Estadísticos , Dinámicas no Lineales , Filogenia , Saccharomyces cerevisiae/genética , Alineación de Secuencia , Análisis de Secuencia/estadística & datos numéricos , Programas Informáticos , Factores de Transcripción/metabolismo
11.
Brief Bioinform ; 15(3): 369-75, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24162172

RESUMEN

Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia/métodos , Biología Computacional/tendencias , Fractales , Modelos Estadísticos , Dinámicas no Lineales , Alineación de Secuencia , Análisis de Secuencia/estadística & datos numéricos
12.
Pac Symp Biocomput ; : 320-31, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23424137

RESUMEN

We have developed a novel approach called ChIPModule to systematically discover transcription factors and their cofactors from ChIP-seq data. Given a ChIP-seq dataset and the binding patterns of a large number of transcription factors, ChIPModule can efficiently identify groups of transcription factors, whose binding sites significantly co-occur in the ChIP-seq peak regions. By testing ChIPModule on simulated data and experimental data, we have shown that ChIPModule identifies known cofactors of transcription factors, and predicts new cofactors that are supported by literature. ChIPModule provides a useful tool for studying gene transcriptional regulation.


Asunto(s)
Inmunoprecipitación de Cromatina/estadística & datos numéricos , Análisis de Secuencia/estadística & datos numéricos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Sitios de Unión/genética , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Humanos
13.
Pac Symp Biocomput ; : 356-67, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23424140

RESUMEN

Human genetics recently transitioned from GWAS to studies based on NGS data. For GWAS, small effects dictated large sample sizes, typically made possible through meta-analysis by exchanging summary statistics across consortia. NGS studies groupwise-test for association of multiple potentially-causal alleles along each gene. They are subject to similar power constraints and therefore likely to resort to meta-analysis as well. The problem arises when considering privacy of the genetic information during the data-exchange process. Many scoring schemes for NGS association rely on the frequency of each variant thus requiring the exchange of identity of the sequenced variant. As such variants are often rare, potentially revealing the identity of their carriers and jeopardizing privacy. We have thus developed MetaSeq, a protocol for meta-analysis of genome-wide sequencing data by multiple collaborating parties, scoring association for rare variants pooled per gene across all parties. We tackle the challenge of tallying frequency counts of rare, sequenced alleles, for metaanalysis of sequencing data without disclosing the allele identity and counts, thereby protecting sample identity. This apparent paradoxical exchange of information is achieved through cryptographic means. The key idea is that parties encrypt identity of genes and variants. When they transfer information about frequency counts in cases and controls, the exchanged data does not convey the identity of a mutation and therefore does not expose carrier identity. The exchange relies on a 3rd party, trusted to follow the protocol although not trusted to learn about the raw data. We show applicability of this method to publicly available exome-sequencing data from multiple studies, simulating phenotypic information for powerful meta-analysis. The MetaSeq software is publicly available as open source.


Asunto(s)
Privacidad Genética , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Metaanálisis como Asunto , Biología Computacional , Seguridad Computacional/estadística & datos numéricos , Simulación por Computador , Frecuencia de los Genes , Humanos , Análisis de Secuencia/estadística & datos numéricos , Programas Informáticos
14.
Brief Bioinform ; 14(2): 193-202, 2013 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-22445902

RESUMEN

The advent of second-generation sequencing (2GS) has provided a range of significant new challenges for the visualization of sequence assemblies. These include the large volume of data being generated, short-read lengths and different data types and data formats associated with the diversity of new sequencing technologies. This article illustrates how Tablet-a high-performance graphical viewer for visualization of 2GS assemblies and read mappings-plays an important role in the analysis of these data. We present Tablet, and through a selection of use cases, demonstrate its value in quality assurance and scientific discovery, through features such as whole-reference coverage overviews, variant highlighting, paired-end read mark-up, GFF3-based feature tracks and protein translations. We discuss the computing and visualization techniques utilized to provide a rich and responsive graphical environment that enables users to view a range of file formats with ease. Tablet installers can be freely downloaded from http://bioinf.hutton.ac.uk/tablet in 32 or 64-bit versions for Windows, OS X, Linux or Solaris. For further details on the Tablet, contact tablet@hutton.ac.uk.


Asunto(s)
Gráficos por Computador , Presentación de Datos , Bases de Datos Genéticas/estadística & datos numéricos , Animales , Biología Computacional , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Internet , Análisis de Secuencia/estadística & datos numéricos , Programas Informáticos
15.
PLoS Comput Biol ; 8(6): e1002541, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22685393

RESUMEN

We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.


Asunto(s)
Metagenómica/estadística & datos numéricos , Análisis de Secuencia/estadística & datos numéricos , Biología Computacional , Interpretación Estadística de Datos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos
16.
Pac Symp Biocomput ; : 259-70, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22174281

RESUMEN

Homology-based approaches are often used for the annotation of microbial communities, providing functional profiles that are used to characterize and compare the content and the functionality of microbial communities. Metagenomic reads are the starting data for these studies, however considerable differences are observed between the functional profiles-built from sequencing reads produced by different sequencing techniques-for even the same microbial community. Using simulation experiments, we show that such functional differences are likely to be caused by the actual difference in read lengths, and are not the results of a sampling bias of the sequencing techniques. Furthermore, the functional differences derived from different sequencing techniques cannot be fully explained by the read-count bias, i.e. 1) the higher fraction of unannotated shorter reads (i.e., "read length matters"), and 2) the different lengths of proteins in different functional categories. Instead, we show here that specific functional categories are under-annotated, because similarity-search-based functional annotation tools tend to miss more reads from functional categories that contain less conserved genes/proteins. In addition, the accuracy of functional annotation of short reads for different functions varies, further skewing the functional profiles. To address these issues, we present a simple yet efficient method to improve the frequency estimates of different functional categories in the functional profiles of metagenomes, based on the functional annotation of simulated reads from complete microbial genomes.


Asunto(s)
Metagenómica/estadística & datos numéricos , Microbiota/genética , Análisis de Secuencia/estadística & datos numéricos , Animales , Bacterias/genética , Bacterias/aislamiento & purificación , Proteínas Bacterianas/clasificación , Proteínas Bacterianas/genética , Biología Computacional , Heces/microbiología , Ratones , Obesidad/microbiología , Delgadez/microbiología
18.
Adv Exp Med Biol ; 680: 411-7, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20865526

RESUMEN

Efforts have been devoted to accelerating the construction of suffix trees. However, little attention has been given to post-construction operations on suffix trees. Therefore, we investigate the effects of improved spatial locality on certain post-construction operations on suffix trees. We used a maximal exact repeat finding algorithm, MERF, on which software REPuter is based, as an example, and conducted experiments on the 16 chromosomes of the yeast Saccharomyces cerevisiae. Two versions of suffix trees were customized for the algorithm and two variants of MERF were implemented accordingly. We showed that in all cases, the optimal cache-oblivious MERF is faster and displays consistently lower cache miss rates than their non-optimized counterparts.


Asunto(s)
Algoritmos , Análisis de Secuencia/estadística & datos numéricos , Cromosomas Fúngicos/genética , Biología Computacional , Genoma Fúngico , Secuencias Repetitivas de Ácidos Nucleicos , Saccharomyces cerevisiae/genética , Programas Informáticos
19.
Adv Exp Med Biol ; 680: 693-700, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-20865556

RESUMEN

Next Generation Sequencing technologies are limited by the lack of standard bioinformatics infrastructures that can reduce data storage, increase data processing performance, and integrate diverse information. HDF technologies address these requirements and have a long history of use in data-intensive science communities. They include general data file formats, libraries, and tools for working with the data. Compared to emerging standards, such as the SAM/BAM formats, HDF5-based systems demonstrate significantly better scalability, can support multiple indexes, store multiple data types, and are self-describing. For these reasons, HDF5 and its BioHDF extension are well suited for implementing data models to support the next generation of bioinformatics applications.


Asunto(s)
Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia/estadística & datos numéricos , Biología Computacional , Simulación por Computador , Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Alineación de Secuencia/normas , Alineación de Secuencia/tendencias , Análisis de Secuencia/normas , Análisis de Secuencia/tendencias , Programas Informáticos/normas , Programas Informáticos/tendencias , Diseño de Software , Interfaz Usuario-Computador
20.
J Mol Biol ; 396(5): 1439-50, 2010 Mar 12.
Artículo en Inglés | MEDLINE | ID: mdl-20043919

RESUMEN

Chimeric, humanized and human antibodies have successively been exploited as therapeutics because their increasing human ('self') character is expected to correspond with decreased immunogenicity, which is critical for their clinical development. Thus, humanness has been inferred to predict antibody immunogenicity. Humanness of antibody variable regions (V-regions) has recently been studied using a parameter (here referred to as the H-score) that evaluates similarity to expressed human sequences. Macaque (Macaca fascicularis) antibody sequences are of particular interest because they have been suggested to have extremely human-like character and, recently, macaque single-chain variable fragments with very high affinity for various antigens have been isolated. In this study, the H-scores of all macaque antibody V-regions available in sequence data banks were compared with those of their human counterparts using statistical tests. The results were found to be influenced by the relative size of the human families to which the macaque V-regions are related. As the relevance of families to immunogenicity is suspected but unproven, a new parameter (the 'G-score') was derived from the H-score to avoid this influence, and macaque V-region sequences were reanalyzed using the G-score. Both parameters show that these regions cannot be regarded as human when they derive from heavy chains, but the humanness of light chains is variable. It was shown that 'germline humanization' of a macaque V-region favourably influenced its humanness, as evaluated by both H-score and G-score. In addition, the humanness of macaque sequences presented in patents has been analyzed. The H-score and G-score define objectively the humanness of antibody V-regions, and their use is exemplified here.


Asunto(s)
Genes de Inmunoglobulinas , Inmunoglobulinas/genética , Macaca fascicularis/genética , Macaca fascicularis/inmunología , Animales , Diversidad de Anticuerpos , Bases de Datos Genéticas , Genes de las Cadenas Pesadas de las Inmunoglobulinas , Humanos , Fragmentos Fab de Inmunoglobulinas/genética , Cadenas Pesadas de Inmunoglobulina/genética , Región Variable de Inmunoglobulina/genética , Cadenas kappa de Inmunoglobulina/genética , Cadenas lambda de Inmunoglobulina/genética , Familia de Multigenes , Análisis de Secuencia/estadística & datos numéricos , Especificidad de la Especie
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA