Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
1.
Bioinformatics ; 33(17): 2750-2752, 2017 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-28482046

RESUMEN

MOTIVATION: Long read sequencing technologies provide new opportunities to investigate genome structural variations (SVs) more accurately. However, the state-of-the-art SV calling pipelines are computational intensive and the applications of long reads are restricted. RESULTS: We propose a local region match-based filter (rMFilter) to efficiently nail down chimeric noisy long reads based on short token matches within local genomic regions. rMFilter is able to substantially accelerate long read-based SV calling pipelines without loss of effectiveness. It can be easily integrated into current long read-based pipelines to facilitate SV studies. AVAILABILITY AND IMPLEMENTATION: The C ++ source code of rMFilter is available at https://github.com/hitbc/rMFilter . CONTACT: ydwang@hit.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Variación Estructural del Genoma , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genoma Humano , Humanos
2.
Bioinformatics ; 28(18): i356-i362, 2012 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-22962452

RESUMEN

MOTIVATION: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for next-generation sequencing (NGS) reads do not perform well on (i) samples with low-abundance species or (ii) samples (even with high abundance) when there are many extremely low-abundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable. RESULTS: We proposed a two-round binning method (MetaCluster 5.0) that aims at identifying both low-abundance and high-abundance species in the presence of a large amount of noise due to many extremely low-abundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely low-abundance species. It separate reads of high-abundance species from those of low-abundance species in two different rounds. To overcome the issue of low coverage for low-abundance species, multiple w values are used to group reads with overlapping w-mers, whereas reads from high-abundance species are grouped with high confidence based on a large w and then binning expands to low-abundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6× to 10×) and can achieve better sensitivity and specificity using less memory and running time. AVAILABILITY: http://i.cs.hku.hk/~alse/MetaCluster/ CONTACT: chin@cs.hku.hk.


Asunto(s)
Metagenómica/métodos , Programas Informáticos , Algoritmos , Sensibilidad y Especificidad , Análisis de Secuencia de ADN/métodos
3.
Bioinformatics ; 28(11): 1420-8, 2012 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-22495754

RESUMEN

MOTIVATION: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. RESULTS: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. AVAILABILITY: The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud


Asunto(s)
Algoritmos , Metagenómica/métodos , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Bacterias/genética , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento
4.
Bioinformatics ; 27(13): i94-101, 2011 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-21685107

RESUMEN

MOTIVATION: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated. RESULTS: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy. AVAILABILITY: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba. CONTACT: chin@cs.hku.hk.


Asunto(s)
Algoritmos , Metagenómica/métodos , Programas Informáticos , Escherichia coli/clasificación , Escherichia coli/genética , Escherichia coli/aislamiento & purificación , Genoma Bacteriano , Análisis de Secuencia de ADN/métodos
5.
Bioinformatics ; 27(11): 1489-95, 2011 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-21493653

RESUMEN

MOTIVATION: With the rapid development of next-generation sequencing techniques, metagenomics, also known as environmental genomics, has emerged as an exciting research area that enables us to analyze the microbial environment in which we live. An important step for metagenomic data analysis is the identification and taxonomic characterization of DNA fragments (reads or contigs) resulting from sequencing a sample of mixed species. This step is referred to as 'binning'. Binning algorithms that are based on sequence similarity and sequence composition markers rely heavily on the reference genomes of known microorganisms or phylogenetic markers. Due to the limited availability of reference genomes and the bias and low availability of markers, these algorithms may not be applicable in all cases. Unsupervised binning algorithms which can handle fragments from unknown species provide an alternative approach. However, existing unsupervised binning algorithms only work on datasets either with balanced species abundance ratios or rather different abundance ratios, but not both. RESULTS: In this article, we present MetaCluster 3.0, an integrated binning method based on the unsupervised top--down separation and bottom--up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios (say 1:1) to very different abundance ratios (e.g. 1:24) with consistently higher accuracy than existing methods. AVAILABILITY: MetaCluster 3.0 can be downloaded at http://i.cs.hku.hk/~alse/MetaCluster/.


Asunto(s)
Algoritmos , Metagenómica/métodos , Análisis de Secuencia de ADN , Análisis por Conglomerados
6.
Bioinformatics ; 24(6): 791-7, 2008 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-18227115

RESUMEN

MOTIVATION: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O(|T||P|) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments. RESULTS: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O(/T/(0.628)/P/) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). AVAILABILITY: www.cs.hku.hk/~ckwong3/bwtsw CONTACT: twlam@cs.hku.hk.


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , ADN/genética , Genoma Humano/genética , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Compresión de Datos/métodos , Humanos , Datos de Secuencia Molecular
7.
BMC Bioinformatics ; 9 Suppl 12: S3, 2008 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-19091026

RESUMEN

BACKGROUND: MicroRNAs are small non-coding RNA gene products that play diversified roles from species to species. The explosive growth of microRNA researches in recent years proves the importance of microRNAs in the biological system and it is believed that microRNAs have valuable therapeutic potentials in human diseases. Continual efforts are therefore required to locate and verify the unknown microRNAs in various genomes. As many miRNAs are found to be arranged in clusters, meaning that they are in close proximity with their neighboring miRNAs, we are interested in utilizing the concept of microRNA clustering and applying it in microRNA computational prediction. RESULTS: We first validate the microRNA clustering phenomenon in the human, mouse and rat genomes. There are 45.45%, 51.86% and 48.67% of the total miRNAs that are clustered in the three genomes, respectively. We then conduct sequence and secondary structure similarity analyses among clustered miRNAs, non-clustered miRNAs, neighboring sequences of clustered miRNAs and random sequences, and find that clustered miRNAs are structurally more similar to one another, and the RNAdistance score can be used to assess the structural similarity between two sequences. We therefore design a clustering-based approach which utilizes this observation to filter false positives from a list of candidates generated by a selected microRNA prediction program, and successfully raise the positive predictive value by a considerable amount ranging from 15.23% to 23.19% in the human, mouse and rat genomes, while keeping a reasonably high sensitivity. CONCLUSION: Our clustering-based approach is able to increase the effectiveness of currently available microRNA prediction program by raising the positive predictive value while maintaining a high sensitivity, and hence can serve as a filtering step. We believe that it is worthwhile to carry out further experiments and tests with our approach using data from other genomes and other prediction software tools. Better results may be achieved with fine-tuning of parameters.


Asunto(s)
Biología Computacional/métodos , MicroARNs/química , Algoritmos , Animales , Análisis por Conglomerados , Simulación por Computador , Reacciones Falso Positivas , Genoma , Humanos , Ratones , MicroARNs/genética , Valor Predictivo de las Pruebas , Ratas , Programas Informáticos
8.
J Bioinform Comput Biol ; 6(5): 1021-33, 2008 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-18942164

RESUMEN

We consider the problem of predicting alternative splicing patterns from a set of expressed sequences (cDNAs and ESTs). Some of these expressed sequences may be errorous, thus forming incorrect exons/introns. These incorrect exons/introns may cause a lot of false positives. For example, we examined a popular alternative splicing database, ECgene, which predicts alternate splicing patterns from expressed sequences. The result shows that about 81.3%-81.6% (sensitivity) of known patterns are found, but the specificity can be as low as 5.9%. Based on the idea that errorous sequences are usually not consistent with other sequences, in this paper we provide an alternative approach for finding alternative splicing patterns which ensures that individual exons/introns of the reported patterns have enough support from the expressed sequences. On the same dataset, our approach can achieve a much higher specificity and a slight increase in sensitivity (38.9% and 84.9%, respectively). Our approach also gives better results compared with popular alternative splicing databases (ASD, ECgene, SpliceNest) and the software ClusterMerge.


Asunto(s)
Algoritmos , Empalme Alternativo/genética , Exones/genética , Expresión Génica/genética , Intrones/genética , Sitios de Empalme de ARN/genética , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Datos de Secuencia Molecular
9.
Forensic Sci Int ; 259: 200-9, 2016 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26804669

RESUMEN

Most of the existing image modification detection methods which are based on DCT coefficient analysis model the distribution of DCT coefficients as a mixture of a modified and an unchanged component. To separate the two components, two parameters, which are the primary quantization step, Q1, and the portion of the modified region, α, have to be estimated, and more accurate estimations of α and Q1 lead to better detection and localization results. Existing methods estimate α and Q1 in a completely blind manner, without considering the characteristics of the mixture model and the constraints to which α should conform. In this paper, we propose a more effective scheme for estimating α and Q1, based on the observations that, the curves on the surface of the likelihood function corresponding to the mixture model is largely smooth, and α can take values only in a discrete set. We conduct extensive experiments to evaluate the proposed method, and the experimental results confirm the efficacy of our method.

10.
J Comput Biol ; 12(6): 686-701, 2005.
Artículo en Inglés | MEDLINE | ID: mdl-16108711

RESUMEN

A molecule called transcription factor usually binds to a set of promoter sequences of coexpressed genes. As a result, these promoter sequences contain some short substrings, or binding sites, with similar patterns. The motif discovering problem is to find these similar patterns and motifs in a set of sequences. Most existing algorithms find the motifs based on strong-signal sequences only (i.e., those containing binding sites very similar to the motif). In this paper, we use a probability matrix to represent a motif to calculate the minimum total number of binding sites required to be in the input dataset in order to confirm that the discovered motifs are not artifacts. Next, we introduce a more general and realistic energy-based model, which considers all sequences with varying degrees of binding strength to the transcription factors (as measured experimentally). By treating sequences with varying degrees of binding strength, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding Algorithm) to find the motif, which can handle sequences ranging from those that contain more than one binding site to those that contain none. EBMF can find motifs for datasets that do not even have the required minimum number of binding sites as previously derived. EBMF compares favorably with common motif-finding programs AlignACE and MEME. In particular, for some simulated and real datasets, EBMF finds the motif when both AlignACE and MEME fail to do so.


Asunto(s)
Algoritmos , Sitios de Unión , Modelos Estadísticos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Factores de Transcripción/química , Secuencias de Aminoácidos , Modelos Químicos , Unión Proteica
11.
Sci China Life Sci ; 57(11): 1140-8, 2014 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-25326069

RESUMEN

Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.


Asunto(s)
Biología Computacional/métodos , ADN/química , ARN/química , Análisis de Secuencia de ADN/métodos , Algoritmos , Mapeo Contig/métodos , Escherichia coli/genética , Reacciones Falso Positivas , Genoma , Genoma Bacteriano , Humanos , Lactobacillus plantarum/genética , Metagenómica , Programas Informáticos , Transcripción Genética , Transcriptoma
12.
J Comput Biol ; 19(4): 365-78, 2012 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-22468707

RESUMEN

Structural alignment is useful in identifying members of ncRNAs. Existing tools are all based on the secondary structures of the molecules. There is evidence showing that tertiary interactions (the interaction between a single-stranded nucleotide and a base-pair) in triple helix structures are critical in some functions of ncRNAs. In this article, we address the problem of structural alignment of RNAs with the triple helix. We provide a formal definition to capture a simplified model of a triple helix structure, then develop an algorithm of O(mn(3)) time to align a query sequence (of length m) with known triple helix structure with a target sequence (of length n) with an unknown structure. The resulting algorithm is shown to be useful in identifying ncRNA members in a simulated genome.


Asunto(s)
Algoritmos , ARN no Traducido/química , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos , Secuencia de Bases , Modelos Moleculares , Conformación de Ácido Nucleico
13.
Artículo en Inglés | MEDLINE | ID: mdl-21464506

RESUMEN

In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query RNA sequence of length m with known secondary structure that may contain simple pseudoknots or embedded simple pseudoknots. The best known algorithm for solving this problem runs in O(mn3) time for simple pseudoknot or O(mn4) time for embedded simple pseudoknot with space complexity of O(mn3) for both structures, which require too much memory making it infeasible for comparing noncoding RNAs (ncRNAs) with length several hundreds or more. We propose memory efficient algorithms to solve the same problem. We reduce the space complexity to O(n3) for simple pseudoknot and O(mn2 + n3) for embedded simple pseudoknot while maintaining the same time complexity. We also show how to modify our algorithm to handle a restricted class of recursive simple pseudoknot which is found abundant in real data with space complexity of O(mn2 + n3) and time complexity of O(mn4). Experimental results show that our algorithms are feasible for comparing ncRNAs of length more than 500.


Asunto(s)
Algoritmos , Conformación de Ácido Nucleico , ARN no Traducido/química , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos
14.
J Comput Biol ; 19(2): 241-9, 2012 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-22300323

RESUMEN

Next-generation sequencing (NGS) technologies allow the sequencing of microbial communities directly from the environment without prior culturing. The output of environmental DNA sequencing consists of many reads from genomes of different unknown species, making the clustering together reads from the same (or similar) species (also known as binning) a crucial step. The difficulties of the binning problem are due to the following four factors: (1) the lack of reference genomes; (2) uneven abundance ratio of species; (3) short NGS reads; and (4) a large number of species (can be more than a hundred). None of the existing binning tools can handle all four factors. No tools, including both AbundanceBin and MetaCluster 3.0, have demonstrated reasonable performance on a sample with more than 20 species. In this article, we introduce MetaCluster 4.0, an unsupervised binning algorithm that can accurately (with about 80% precision and sensitivity in all cases and at least 90% in some cases) and efficiently bin short reads with varying abundance ratios and is able to handle datasets with 100 species. The novelty of MetaCluster 4.0 stems from solving a few important problems: how to divide reads into groups by a probabilistic approach, how to estimate the 4-mer distribution of each group, how to estimate the number of species, and how to modify MetaCluster 3.0 to handle a large number of species. We show that Meta Cluster 4.0 is effective for both simulated and real datasets. Supplementary Material is available at www.liebertonline.com/cmb.


Asunto(s)
Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Bacterias/genética , Secuencia de Bases , Análisis por Conglomerados , Interpretación Estadística de Datos , Genoma Bacteriano , Modelos Estadísticos
15.
Artículo en Inglés | MEDLINE | ID: mdl-22848134

RESUMEN

Structural alignment has been shown to be an effective computational method to identify structural noncoding RNA(ncRNA) candidates as ncRNAs are known to be conserved in secondary structures. However, the complexity of the structural alignment algorithms becomes higher when the structure has pseudoknots. Even for the simplest type of pseudoknots (simple pseudoknots), the fastest algorithm runs in O(mn3) time, where m, n are the length of the query ncRNA (with known structure) and the length of the target sequence (with unknown structure), respectively. In practice, we are usually given a long DNA sequence and we try to locate regions in the sequence for possible candidates of a particular ncRNA. Thus, we need to run the structural alignment algorithm on every possible region in the long sequence. For example, finding candidates for a known ncRNA of length 100 on a sequence of length 50,000, it takes more than one day. In this paper, we provide an efficient algorithm to solve the problem for simple pseudoknots and it is shown to be 10 times faster. The speedup stems from an effective pruning strategy consisting of the computation of a lower bound score for the optimal alignment and an estimation of the maximum score that a candidate can achieve to decide whether to prune the current candidate or not.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Genoma , Conformación de Ácido Nucleico , Análisis de Secuencia de ADN/métodos , ADN/química , ADN/genética , Modelos Genéticos , ARN no Traducido/química , ARN no Traducido/genética , Programas Informáticos
16.
J Comput Biol ; 18(1): 97-108, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21210732

RESUMEN

The secondary structure of an ncRNA molecule is known to play an important role in its biological functions. Aligning a known ncRNA to a target candidate to determine the sequence and structural similarity helps in identifying de novo ncRNA molecules that are in the same family of the known ncRNA. However, existing algorithms cannot handle complex pseudoknot structures which are found in nature. In this article, we propose algorithms to handle two types of complex pseudoknots: simple non-standard pseudoknots and recursive pseudoknots. Although our methods are not designed for general pseudoknots, it already covers all known ncRNAs in both Rfam and PseudoBase databases. An evaluation of our algorithms shows that it is useful to identify ncRNA molecules in other species which are in the same family of a known ncRNA.


Asunto(s)
Simulación por Computador , Modelos Moleculares , ARN no Traducido/química , Algoritmos , Secuencia de Bases , Humanos , Datos de Secuencia Molecular , Conformación de Ácido Nucleico , ARN no Traducido/clasificación , Alineación de Secuencia
17.
Int J Bioinform Res Appl ; 6(6): 542-55, 2010.
Artículo en Inglés | MEDLINE | ID: mdl-21354961

RESUMEN

In this paper, we consider the problem of reconstructing a pathway for a given set of proteins based on available genomics and proteomics information such as gene expression data. In all previous approaches, the scoring function for a candidate pathway usually only depends on adjacent proteins in the pathway. We propose to also consider proteins that are of distance two in the pathway (we call them Level-2 neighbours). We derive a scoring function based on both adjacent proteins and Level-2 neighbours in the pathway and show that our scoring function can increase the accuracy of the predicted pathways through a set of experiments. The problem of computing the pathway with optimal score, in general, is NP-hard. We thus extend a randomised algorithm to make it work on our scoring function to compute the optimal pathway with high probability.


Asunto(s)
Algoritmos , Péptidos y Proteínas de Señalización Intracelular/metabolismo , Transducción de Señal , Biología Computacional , Bases de Datos de Proteínas , Genómica , Péptidos y Proteínas de Señalización Intracelular/química , Péptidos y Proteínas de Señalización Intracelular/genética , Proteómica
18.
J Comput Biol ; 16(2): 133-44, 2009 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-19193141

RESUMEN

UNLABELLED: Protein complexes play a critical role in many biological processes. Identifying the component proteins in a protein complex is an important step in understanding the complex as well as the related biological activities. This paper addresses the problem of predicting protein complexes from the protein-protein interaction (PPI) network of one species using a computational approach. Most of the previous methods rely on the assumption that proteins within the same complex would have relatively more interactions. This translates into dense subgraphs in the PPI network. However, the existing software tools have limited success. Recently, Gavin et al. (2006) provided a detailed study on the organization of protein complexes and suggested that a complex consists of two parts: a core and an attachment. Based on this core-attachment concept, we developed a novel approach to identify complexes from the PPI network by identifying their cores and attachments separately. We evaluated the effectiveness of our proposed approach using three different datasets and compared the quality of our predicted complexes with three existing tools. The evaluation results show that we can predict many more complexes and with higher accuracy than these tools with an improvement of over 30%. To verify the cores we identified in each complex, we compared our cores with the mediators produced by Andreopoulos et al. (2007), which were claimed to be the cores, based on the benchmark result produced by Gavin et al. (2006). We found that the cores we produced are of much higher quality ranging from 10- to 30-fold more correctly predicted cores and with better accuracy. AVAILABILITY: (http://alse.cs.hku.hk/complexes/).


Asunto(s)
Modelos Teóricos , Complejos Multiproteicos , Mapeo de Interacción de Proteínas , Programas Informáticos , Cadenas de Markov , Matemática , Complejos Multiproteicos/química , Complejos Multiproteicos/metabolismo , Proteínas/química , Proteínas/metabolismo
19.
Int J Bioinform Res Appl ; 5(2): 224-37, 2009.
Artículo en Inglés | MEDLINE | ID: mdl-19324607

RESUMEN

In the sequencing process, reads of the sequence are generated, then assembled to form contigs. New technologies can produce reads faster with lower cost and higher coverage. However, these reads are shorter. With errors, short reads make the assembly step more difficult. Chaisson et al. (2004) proposed an algorithm to correct the reads prior to the assembly step. The result is not satisfactory when the error rate is high (e.g., >or=3%). We improve their approach to handle reads of higher error rates. Experimental results show that our approach is much more effective in correcting errors, producing contigs of higher quality.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , ADN/química , Bases de Datos Genéticas , Alineación de Secuencia
20.
Artículo en Inglés | MEDLINE | ID: mdl-17951817

RESUMEN

Finding motif pairs from a set of protein sequences based on the protein-protein interaction data is a challenging computational problem. Existing effective approaches usually rely on additional information such as some prior knowledge on protein groupings based on protein domains. In reality, this kind of knowledge is not always available. Novel approaches without using this knowledge is much desirable. Recently, Tan et al. proposed such an approach. However, there are two problems with their approach. The scoring function (using chi(2) testing) used in their approach is not adequate. Random motif pairs may have higher scores than the correct ones. Their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. In this paper, our contribution is two-fold. We first introduce a new scoring method, which is shown to be more accurate than the chi-score used in Tan et al. Then, we present two efficient algorithms, one exact algorithm and a heuristic version of it, to solve the problem of finding motif pairs. Based on experiments on real datasets, we show that our algorithms are efficient and can accurately locate the motif pairs. We have also evaluated the sensitivity and efficiency of our heuristics algorithm using simulated datasets, the results show that the algorithm is very efficient with reasonably high sensitivity.


Asunto(s)
Modelos Biológicos , Modelos Químicos , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Proteínas/metabolismo , Análisis de Secuencia de Proteína/métodos , Transducción de Señal/fisiología , Secuencias de Aminoácidos , Secuencia de Aminoácidos , Sitios de Unión , Simulación por Computador , Interpretación Estadística de Datos , Modelos Estadísticos , Datos de Secuencia Molecular , Unión Proteica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA