Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
Front Cell Dev Biol ; 11: 1224069, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37655157

RESUMO

Background: An increasing number of patients are being diagnosed with lung adenocarcinoma, but there remains limited progress in enhancing prognostic outcomes and improving survival rates for these patients. Genome instability is considered a contributing factor, as it enables other hallmarks of cancer to acquire functional capabilities, thus allowing cancer cells to survive, proliferate, and disseminate. Despite the importance of genome instability in cancer development, few studies have explored the prognostic signature associated with genome instability for lung adenocarcinoma. Methods: In the study, we randomly divided 397 lung adenocarcinoma patients from The Cancer Genome Atlas database into a training group (n = 199) and a testing group (n = 198). By calculating the cumulative counts of genomic alterations for each patient in the training group, we distinguished the top 25% and bottom 25% of patients. We then compared their gene expressions to identify genome instability-related genes. Next, we used univariate and multivariate Cox regression analyses to identify the prognostic signature. We also performed the Kaplan-Meier survival analysis and the log-rank test to evaluate the performance of the identified prognostic signature. The performance of the signature was further validated in the testing group, in The Cancer Genome Atlas dataset, and in external datasets. We also conducted a time-dependent receiver operating characteristic analysis to compare our signature with established prognostic signatures to demonstrate its potential clinical value. Results: We identified GULPsig, which includes IGF2BP1, IGF2BP3, SMC1B, CLDN6, and LY6K, as a prognostic signature for lung adenocarcinoma patients from 42 genome instability-related genes. Based on the risk score of the risk model with GULPsig, we successfully stratified the patients into high- and low-risk groups according to the results of the Kaplan-Meier survival analysis and the log-rank test. We further validated the performance of GULPsig as an independent prognostic signature and observed that it outperformed established prognostic signatures. Conclusion: We provided new insights to explore the clinical application of genome instability and identified GULPsig as a potential prognostic signature for lung adenocarcinoma patients.

2.
Artigo em Inglês | MEDLINE | ID: mdl-35120007

RESUMO

In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity ≤ 50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications - Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction - and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Filogenia , Biologia Computacional/métodos , Proteínas/genética , Software
3.
NAR Genom Bioinform ; 3(3): lqab062, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34235433

RESUMO

Relation extraction (RE) is a fundamental task for extracting gene-disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene-disease associations only from single sentences or abstract texts. A few studies have explored extracting gene-disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene-disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene-disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene-disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene-disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.

4.
BMC Genomics ; 21(Suppl 6): 500, 2020 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-33349238

RESUMO

BACKGROUND: Next-generation sequencing (NGS) enables unbiased detection of pathogens by mapping the sequencing reads of a patient sample to the known reference sequence of bacteria and viruses. However, for a new pathogen without a reference sequence of a close relative, or with a high load of mutations compared to its predecessors, read mapping fails due to a low similarity between the pathogen and reference sequence, which in turn leads to insensitive and inaccurate pathogen detection outcomes. RESULTS: We developed MegaPath, which runs fast and provides high sensitivity in detecting new pathogens. In MegaPath, we have implemented and tested a combination of polishing techniques to remove non-informative human reads and spurious alignments. MegaPath applies a global optimization to the read alignments and reassigns the reads incorrectly aligned to multiple species to a unique species. The reassignment not only significantly increased the number of reads aligned to distant pathogens, but also significantly reduced incorrect alignments. MegaPath implements an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm to run fast. CONCLUSIONS: In our benchmarks, MegaPath demonstrated superior sensitivity by detecting eight times more reads from a low-similarity pathogen than other tools. Meanwhile, MegaPath ran much faster than the other state-of-the-art alignment-based pathogen detection tools (and compariable with the less sensitivity profile-based pathogen detection tools). The running time of MegaPath is about 20 min on a typical 1 Gb dataset.


Assuntos
Metagenômica , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metagenoma , Alinhamento de Sequência , Análise de Sequência de DNA
5.
Bioinformatics ; 34(21): 3744-3746, 2018 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-29771282

RESUMO

Summary: AC-DIAMOND (v1) is a DNA-protein alignment tool designed to tackle the efficiency challenge of aligning large amount of reads or contigs to protein databases. When compared with the previously most efficient method DIAMOND, AC-DIAMOND gains a 6- to 7-fold speed-up, while retaining a similar degree of sensitivity. The improvement is rooted at two aspects: first, using a compressed index of seeds with adaptive-length to speed-up the matching between query and reference sequences; second, adopting a compact form of dynamic programing to fully utilize the parallelism of the SIMD capability. Availability and implementation: Software source codes and binaries available at https://github.com/Maihj/AC-DIAMOND/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Software , DNA , Bases de Dados de Proteínas , Proteínas , Análise de Sequência de DNA
6.
BMC Bioinformatics ; 18(Suppl 12): 408, 2017 Oct 16.
Artigo em Inglês | MEDLINE | ID: mdl-29072142

RESUMO

BACKGROUND: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. RESULTS: In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7-19.3% more contigs than Xander, and these contigs were assigned to 10-25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. CONCLUSION: MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta .


Assuntos
Algoritmos , Genes , Metagenômica/métodos , Software , Bases de Dados Genéticas , Humanos , Projetos Piloto , Padrões de Referência , Rizosfera , Solo , Estatística como Assunto
7.
BMC Genomics ; 18(Suppl 4): 362, 2017 05 24.
Artigo em Inglês | MEDLINE | ID: mdl-28589863

RESUMO

BACKGROUND: The recent advancement of whole genome alignment software has made it possible to align two genomes very efficiently and with only a small sacrifice in sensitivity. Yet it becomes very slow if the extra sensitivity is needed. This paper proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without paying much extra running time. RESULTS AND CONCLUSIONS: We have applied our method to a popular whole genome alignment tool LAST, and we called the resulting tool LASTM. Experimental results showed that LASTM could find more high quality alignments with a little extra running time. For example, when comparing human and mouse genomes, to produce the similar number of alignments with similar average length and similarity, LASTM was about three times faster than LAST. We conclude that our method can be used to improve the sensitivity, and the extra time it takes is small, and thus it is worthwhile to be implemented in existing tools.


Assuntos
Alinhamento de Sequência/métodos , Sequenciamento Completo do Genoma/métodos , Animais , Humanos , Fatores de Tempo
8.
BMC Bioinformatics ; 17 Suppl 8: 285, 2016 Aug 31.
Artigo em Inglês | MEDLINE | ID: mdl-27585754

RESUMO

BACKGROUND: This paper describes a new MSA tool called PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies sequences into two types: normally related and distantly related. For normally related sequences, it uses an adaptive approach to construct the guide tree needed for progressive alignment; it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the better method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree and uses instead some non-progressive alignment method to generate the alignment. RESULTS: To evaluate PnpProbs, we have compared it with thirteen other popular MSA tools, and PnpProbs has the best alignment scores in all but one test. We have also used it for phylogenetic analysis, and found that the phylogenetic trees constructed from PnpProbs' alignments are closest to the model trees. CONCLUSIONS: By combining the strength of the progressive and non-progressive alignment methods, we have developed an MSA tool called PnpProbs. We have compared PnpProbs with thirteen other popular MSA tools and our results showed that our tool usually constructed the best alignments.


Assuntos
Algoritmos , Filogenia , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Simulação por Computador , Bases de Dados de Proteínas , Software , Fatores de Tempo
9.
BMC Genomics ; 17 Suppl 5: 499, 2016 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-27586129

RESUMO

BACKGROUND: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. METHODS: This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. RESULTS: Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. CONCLUSIONS: BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Algoritmos , Humanos , Software , Staphylococcus aureus/genética , Vibrio parahaemolyticus/genética
10.
Methods ; 102: 3-11, 2016 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-27012178

RESUMO

The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU). In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484Gbp), the largest publicly available dataset, can now be assembled using no more than 500GB of memory in 7.5days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.


Assuntos
Metagenoma , Análise de Sequência/métodos , Software , Algoritmos , Conjuntos de Dados como Assunto , Metagenômica/métodos , Solo
11.
PLoS One ; 10(9): e0136264, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26368134

RESUMO

The problem of finding k-edge-connected components is a fundamental problem in computer science. Given a graph G = (V, E), the problem is to partition the vertex set V into {V1, V2,…, Vh}, where each Vi is maximized, such that for any two vertices x and y in Vi, there are k edge-disjoint paths connecting them. In this paper, we present an algorithm to solve this problem for all k. The algorithm preprocesses the input graph to construct an Auxiliary Graph to store information concerning edge-connectivity among every vertex pair in O(Fn) time, where F is the time complexity to find the maximum flow between two vertices in graph G and n = ∣V∣. For any value of k, the k-edge-connected components can then be determined by traversing the auxiliary graph in O(n) time. The input graph can be a directed or undirected, simple graph or multigraph. Previous works on this problem mainly focus on fixed value of k.


Assuntos
Algoritmos , Informática/métodos
12.
Artigo em Inglês | MEDLINE | ID: mdl-26357079

RESUMO

This paper introduces a simple and effective approach to improve the accuracy of multiple sequence alignment. We use a natural measure to estimate the similarity of the input sequences, and based on this measure, we align the input sequences differently. For example, for inputs with high similarity, we consider the whole sequences and align them globally, while for those with moderately low similarity, we may ignore the flank regions and align them locally. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with about one dozen leading alignment tools on three benchmark alignment databases, and GLProbs's alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic trees construction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Sequência de Aminoácidos , Cadeias de Markov , Dados de Sequência Molecular , Filogenia , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/classificação
13.
BMC Bioinformatics ; 16 Suppl 5: S4, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25859903

RESUMO

Progressive sequence alignment is one of the most commonly used method for multiple sequence alignment. Roughly speaking, the method first builds a guide tree, and then aligns the sequences progressively according to the topology of the tree. It is believed that guide trees are very important to progressive alignment; a better guide tree will give an alignment with higher accuracy. Recently, we have proposed an adaptive method for constructing guide trees. This paper studies the quality of the guide trees constructed by such method. Our study showed that our adaptive method can be used to improve the accuracy of many different progressive MSA tools. In fact, we give evidences showing that the guide trees constructed by the adaptive method are among the best.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA , Simulação por Computador , Bases de Dados Genéticas , Evolução Molecular , Humanos , Filogenia , Software
14.
PLoS One ; 8(5): e65632, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23741504

RESUMO

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.


Assuntos
Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Fluxo de Trabalho
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA