Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 164
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37930023

RESUMO

Local associations refer to spatial-temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.


Assuntos
Algoritmos , Perfilação da Expressão Gênica , Fatores de Tempo , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Benchmarking
2.
Cereb Cortex ; 34(1)2024 01 14.
Artigo em Inglês | MEDLINE | ID: mdl-38037843

RESUMO

Human brain structure shows heterogeneous patterns of change across adults aging and is associated with cognition. However, the relationship between cortical structural changes during aging and gene transcription signatures remains unclear. Here, using structural magnetic resonance imaging data of two separate cohorts of healthy participants from the Cambridge Centre for Aging and Neuroscience (n = 454, 18-87 years) and Dallas Lifespan Brain Study (n = 304, 20-89 years) and a transcriptome dataset, we investigated the link between cortical morphometric similarity network and brain-wide gene transcription. In two cohorts, we found reproducible morphometric similarity network change patterns of decreased morphological similarity with age in cognitive related areas (mainly located in superior frontal and temporal cortices), and increased morphological similarity in sensorimotor related areas (postcentral and lateral occipital cortices). Changes in morphometric similarity network showed significant spatial correlation with the expression of age-related genes that enriched to synaptic-related biological processes, synaptic abnormalities likely accounting for cognitive decline. Transcription changes in astrocytes, microglia, and neuronal cells interpreted most of the age-related morphometric similarity network changes, which suggest potential intervention and therapeutic targets for cognitive decline. Taken together, by linking gene transcription signatures to cortical morphometric similarity network, our findings might provide molecular and cellular substrates for cortical structural changes related to cognitive decline across adults aging.


Assuntos
Envelhecimento , Encéfalo , Adulto , Humanos , Encéfalo/fisiologia , Envelhecimento/fisiologia , Cognição/fisiologia , Lobo Temporal , Imageamento por Ressonância Magnética/métodos
3.
Cereb Cortex ; 34(1)2024 01 14.
Artigo em Inglês | MEDLINE | ID: mdl-38044469

RESUMO

Brain function changes affect cognitive functions in older adults, yet the relationship between cognition and the dynamic changes of brain networks during naturalistic stimulation is not clear. Here, we recruited the young, middle-aged and older groups from the Cambridge Center for Aging and Neuroscience to investigate the relationship between dynamic metrics of brain networks and cognition using functional magnetic resonance imaging data during movie-watching. We found six reliable co-activation pattern (CAP) states of brain networks grouped into three pairs with opposite activation patterns in three age groups. Compared with young and middle-aged adults, older adults dwelled shorter time in CAP state 4 with deactivated default mode network (DMN) and activated salience, frontoparietal and dorsal-attention networks (DAN), and longer time in state 6 with deactivated DMN and activated DAN and visual network, suggesting altered dynamic interaction between DMN and other brain networks might contribute to cognitive decline in older adults. Meanwhile, older adults showed easier transfer from state 6 to state 3 (activated DMN and deactivated sensorimotor network), suggesting that the fragile antagonism between DMN and other cognitive networks might contribute to cognitive decline in older adults. Our findings provided novel insights into aberrant brain network dynamics associated with cognitive decline.


Assuntos
Encéfalo , Imageamento por Ressonância Magnética , Imageamento por Ressonância Magnética/métodos , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Cognição/fisiologia , Mapeamento Encefálico , Rede Nervosa/diagnóstico por imagem , Rede Nervosa/fisiologia
4.
PLoS Comput Biol ; 19(10): e1010608, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37844077

RESUMO

Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.


Assuntos
Algoritmos , Aprendizado de Máquina , Reprodutibilidade dos Testes , Genômica , Fenótipo
5.
Cereb Cortex ; 33(13): 8645-8653, 2023 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-37143182

RESUMO

Sex differences in episodic memory (EM), remembering past events based on when and where they occurred, have been reported, but the neural mechanisms are unclear. T1-weighted images of 111 females and 61 males were acquired from the Dallas Lifespan Brain Study. Using surface-based morphometry and structural covariance (SC) analysis, we constructed structural covariance networks (SCN) based on cortical volume, and the global efficiency (Eglob) was computed to characterize network integration. The relationship between SCN and EM was examined by SC analysis among the top-n brain regions that were most relevant to EM performance. The number of SC connections (females: 3306; males: 437, P = 0.0212) and Eglob (females: 0.1845; males: 0.0417, P = 0.0408) of SCN in females were higher than those in males. The top-n brain regions with the strongest SC in females were located in auditory network, cingulo-opercular network (CON), and default mode network (DMN), and in males, they were located in frontoparietal network, CON, and DMN. These results confirmed that the Eglob of SCN in females was higher than males, sex differences in EM performance might be related to the differences in network-level integration. Our study highlights the importance of sex as a research variable in brain science.


Assuntos
Memória Episódica , Humanos , Masculino , Feminino , Caracteres Sexuais , Encéfalo , Imageamento por Ressonância Magnética , Mapeamento Encefálico
6.
Proc Natl Acad Sci U S A ; 118(36)2021 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-34480002

RESUMO

We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.


Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Genômica , Algoritmos , Simulação por Computador , Redes Neurais de Computação
7.
Bioinformatics ; 38(11): 2973-2979, 2022 05 26.
Artigo em Inglês | MEDLINE | ID: mdl-35482530

RESUMO

MOTIVATION: Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs' composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample. RESULTS: We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs. AVAILABILITY AND IMPLEMENTATION: HiFine is available at https://github.com/dyxstat/HiFine. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Ecossistema , Metagenômica , Metagenômica/métodos , Metagenoma , Análise por Conglomerados , Genoma Microbiano , Algoritmos , Análise de Sequência de DNA/métodos
8.
Bioinformatics ; 38(Suppl 1): i45-i52, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758806

RESUMO

MOTIVATION: Phage-host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage-host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH. RESULTS: On the validation set, ContigNet achieves 72-85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60-70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts. AVAILABILITY AND IMPLEMENTATION: The source code of ContigNet and related datasets can be downloaded from https://github.com/tianqitang1/ContigNet.


Assuntos
Bacteriófagos , Bactérias/genética , Bacteriófagos/genética , Metagenoma , Metagenômica , Redes Neurais de Computação
9.
PLoS Comput Biol ; 18(7): e1010184, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35830390

RESUMO

Confounding factors exist widely in various biological data owing to technical variations, population structures and experimental conditions. Such factors may mask the true signals and lead to spurious associations in the respective biological data, making it necessary to adjust confounding factors accordingly. However, existing confounder correction methods were mainly developed based on the original data or the pairwise Euclidean distance, either one of which is inadequate for analyzing different types of data, such as sequencing data. In this work, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, which reduces data dimension and extracts the information from different distance measures using principal coordinate analysis, and adjusts confounding factors across multiple datasets by minimizing the associations between lower-dimensional representations and confounding variables. Application of the proposed method was further extended to classification and prediction. We demonstrated the efficacy of AC-PCoA on three simulated datasets and five real datasets. Compared to the existing methods, AC-PCoA shows better results in visualization, statistical testing, clustering, and classification.


Assuntos
Projetos de Pesquisa , Fatores de Confusão Epidemiológicos
10.
Neuroimage ; 255: 119166, 2022 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-35398282

RESUMO

Magnetic Resonance Imaging (MRI) technology has been increasingly used in neuroscience studies. Reproducibility of statistically significant findings generated by MRI-based studies, especially association studies (phenotype vs. MRI metric) and task-induced brain activation, has been recently heavily debated. However, most currently available reproducibility measures depend on thresholds for the test statistics and cannot be use to evaluate overall study reproducibility. It is also crucial to elucidate the relationship between overall study reproducibility and sample size in an experimental design. In this study, we proposed a model-based reproducibility index to quantify reproducibility which could be used in large-scale high-throughput MRI-based studies including both association studies and task-induced brain activation. We performed the model-based reproducibility assessments for a few association studies and task-induced brain activation by using several recent large sMRI/fMRI databases. For large sample size association studies between brain structure/function features and some basic physiological phenotypes (i.e. Sex, BMI), we demonstrated that the model-based reproducibility of these studies is more than 0.99. For MID task activation, similar results could be observed. Furthermore, we proposed a model-based analytical tool to evaluate minimal sample size for the purpose of achieving a desirable model-based reproducibility. Additionally, we evaluated the model-based reproducibility of gray matter volume (GMV) changes for UK Biobank (UKB) vs. Parkinson Progression Marker Initiative (PPMI) and UK Biobank (UKB) vs. Human Connectome Project (HCP). We demonstrated that both sample size and study-specific experimental factors play important roles in the model-based reproducibility assessments for different experiments. In summary, a systematic assessment of reproducibility is fundamental and important in the current large-scale high-throughput MRI-based studies.


Assuntos
Conectoma , Imageamento por Ressonância Magnética , Encéfalo/diagnóstico por imagem , Substância Cinzenta , Humanos , Imageamento por Ressonância Magnética/métodos , Reprodutibilidade dos Testes
11.
Brief Bioinform ; 21(3): 777-790, 2020 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-30860572

RESUMO

In metagenomic studies of microbial communities, the short reads come from mixtures of genomes. Read assembly is usually an essential first step for the follow-up studies in metagenomic research. Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in their investigations. Many studies evaluating different assembly programs used either simulated metagenomes or real metagenomes with unknown genome compositions. However, the simulated datasets may not reflect the real complexities of metagenomic samples and the estimated assembly accuracy could be misleading due to the unknown genomes in real metagenomes. Therefore, hybrid strategies are required to evaluate the various read assemblers for metagenomic studies. In this paper, we benchmark the metagenomic read assemblers by mixing reads from real metagenomic datasets with reads from known genomes and evaluating the integrity, contiguity and accuracy of the assembly using the reads from the known genomes. We selected four advanced metagenome assemblers, MEGAHIT, MetaSPAdes, IDBA-UD and Faucet, for evaluation. We showed the strengths and weaknesses of these assemblers in terms of integrity, contiguity and accuracy for different variables, including the genetic difference of the real genomes with the genome sequences in the real metagenomic datasets and the sequencing depth of the simulated datasets. Overall, MetaSPAdes performs best in terms of integrity and continuity at the species-level, followed by MEGAHIT. Faucet performs best in terms of accuracy at the cost of worst integrity and continuity, especially at low sequencing depth. MEGAHIT has the highest genome fractions at the strain-level and MetaSPAdes has the overall best performance at the strain-level. MEGAHIT is the most efficient in our experiments. Availability: The source code is available at https://github.com/ziyewang/MetaAssemblyEval.


Assuntos
Biologia Computacional/métodos , Metagenômica , Algoritmos , Conjuntos de Dados como Assunto , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Microbiota/genética
12.
Bioinformatics ; 37(6): 759-766, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33119059

RESUMO

MOTIVATION: The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. RESULTS: To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini-Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. AVAILABILITYAND IMPLEMENTATION: Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica , Microbiota , Algoritmos , Simulação por Computador , Metagenoma , Reprodutibilidade dos Testes , Análise de Sequência de DNA
13.
Bioinformatics ; 37(2): 155-161, 2021 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-32766810

RESUMO

MOTIVATION: Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. RESULTS: We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102-104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. AVAILABILITY AND IMPLEMENTATION: CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
14.
Virol J ; 19(1): 114, 2022 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-35765099

RESUMO

BACKGROUND: Chronic infection with hepatitis B virus (HBV) has been proved highly associated with the development of hepatocellular carcinoma (HCC). AIMS: The purpose of the study is to investigate the association between HBV preS region quasispecies and HCC development, as well as to develop HCC diagnosis model using HBV preS region quasispecies. METHODS: A total of 104 chronic hepatitis B (CHB) patients and 117 HBV-related HCC patients were enrolled. HBV preS region was sequenced using next generation sequencing (NGS) and the nucleotide entropy was calculated for quasispecies evaluation. Sparse logistic regression (SLR) was used to predict HCC development and prediction performances were evaluated using receiver operating characteristic curves. RESULTS: Entropy of HBV preS1, preS2 regions and several nucleotide points showed significant divergence between CHB and HCC patients. Using SLR, the classification of HCC/CHB groups achieved a mean area under the receiver operating characteristic curve (AUC) of 0.883 in the training data and 0.795 in the test data. The prediction model was also validated by a completely independent dataset from Hong Kong. The 10 selected nucleotide positions showed significantly different entropy between CHB and HCC patients. The HBV quasispecies also classified three clinical parameters, including HBeAg, HBVDNA, and Alkaline phosphatase (ALP) with the AUC value greater than 0.6 in the test data. CONCLUSIONS: Using NGS and SLR, the association between HBV preS region nucleotide entropy and HCC development was validated in our study and this could promote the understanding of HCC progression mechanism.


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Antígenos de Superfície da Hepatite B/genética , Vírus da Hepatite B/genética , Humanos , Modelos Logísticos , Nucleotídeos , Quase-Espécies
15.
Nucleic Acids Res ; 47(W1): W379-W387, 2019 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-31106361

RESUMO

Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler-a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Anotação de Sequência Molecular , Proteínas/química , Software , Sequência de Aminoácidos , Animais , Benchmarking , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Internet , Modelos Moleculares , Plantas/genética , Células Procarióticas/metabolismo , Mapeamento de Interação de Proteínas , Proteínas/fisiologia , Alinhamento de Sequência , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos , Relação Estrutura-Atividade
16.
PLoS Genet ; 14(2): e1007206, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29474353

RESUMO

Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60-80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.


Assuntos
Carcinoma Hepatocelular/virologia , Heterogeneidade Genética , Antígenos de Superfície da Hepatite B/genética , Vírus da Hepatite B/genética , Hepatite B Crônica/virologia , Neoplasias Hepáticas/virologia , Carcinoma Hepatocelular/epidemiologia , Carcinoma Hepatocelular/genética , Impressões Digitais de DNA , DNA Viral/análise , Frequência do Gene , Estudos de Associação Genética/métodos , Genótipo , Vírus da Hepatite B/classificação , Hepatite B Crônica/complicações , Hepatite B Crônica/epidemiologia , Hepatite B Crônica/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias Hepáticas/epidemiologia , Neoplasias Hepáticas/genética , Filogenia , Precursores de Proteínas/genética
17.
Bioinformatics ; 35(22): 4596-4606, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-30993316

RESUMO

MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Genômica , Algoritmos , Genoma Bacteriano , Metagenômica
18.
Bioinformatics ; 35(21): 4229-4238, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-30977806

RESUMO

MOTIVATION: Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs. RESULTS: We developed a novel contig binning method, Semi-supervised Spectral Normalized Cut for Binning (SolidBin), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut, for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, Adjusted Rand Index and Normalized Mutual Information, especially while using the real datasets and the single-sample dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/sufforest/SolidBin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenoma , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga Escala , Metagenômica , Análise de Sequência de DNA , Software
19.
J Transl Med ; 18(1): 5, 2020 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-31906978

RESUMO

BACKGROUND: Sepsis remains a major challenge in intensive care units, causing unacceptably high mortality rates due to the lack of rapid diagnostic tools with sufficient sensitivity. Therefore, there is an urgent need to replace time-consuming blood cultures with a new method. Ideally, such a method also provides comprehensive profiling of pathogenic bacteria to facilitate the treatment decision. METHODS: We developed a Random Forest with balanced subsampling to screen for pathogenic bacteria and diagnose sepsis based on cell-free DNA (cfDNA) sequencing data in a small blood sample. In addition, we constructed a bacterial co-occurrence network, based on a set of normal and sepsis samples, to infer unobserved bacteria. RESULTS: Based solely on cfDNA sequencing information from three independent datasets of sepsis, we distinguish sepsis from healthy samples with a satisfactory performance. This strategy also provides comprehensive bacteria profiling, permitting doctors to choose the best treatment strategy for a sepsis case. CONCLUSIONS: The combination of sepsis identification and bacteria-inferring strategies is a success for noninvasive cfDNA-based diagnosis, which has the potential to greatly enhance efficiency in disease detection and provide a comprehensive understanding of pathogens. For comparison, where a culture-based analysis of pathogens takes up to 5 days and is effective for only a third to a half of patients, cfDNA sequencing can be completed in just 1 day and our method can identify the majority of pathogens in all patients.


Assuntos
Ácidos Nucleicos Livres , Sepse , Bactérias/genética , DNA Bacteriano/genética , Humanos , Unidades de Terapia Intensiva , Sepse/diagnóstico
20.
BMC Bioinformatics ; 20(1): 53, 2019 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-30691412

RESUMO

BACKGROUND: Local similarity analysis (LSA) of time series data has been extensively used to investigate the dynamics of biological systems in a wide range of environments. Recently, a theoretical method was proposed to approximately calculate the statistical significance of local similarity (LS) scores. However, the method assumes that the time series data are independent identically distributed, which can be violated in many problems. RESULTS: In this paper, we develop a novel approach to accurately approximate statistical significance of LSA for dependent time series data using nonparametric kernel estimated long-run variance. We also investigate an alternative method for LSA statistical significance approximation by computing the local similarity score of the residuals based on a predefined statistical model. We show by simulations that both methods have controllable type I errors for dependent time series, while other approaches for statistical significance can be grossly oversized. We apply both methods to human and marine microbial datasets, where most of possible significant associations are captured and false positives are efficiently controlled. CONCLUSIONS: Our methods provide fast and effective approaches for evaluating statistical significance of dependent time series data with controllable type I error. They can be applied to a variety of time series data to reveal inherent relationships among the different factors.


Assuntos
Algoritmos , Modelos Estatísticos , Organismos Aquáticos/microbiologia , Bases de Dados como Assunto , Feminino , Humanos , Masculino , Microbiota , Fatores de Tempo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA