Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 168
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37930023

RESUMEN

Local associations refer to spatial-temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica , Factores de Tiempo , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Benchmarking
2.
Cereb Cortex ; 34(1)2024 01 14.
Artículo en Inglés | MEDLINE | ID: mdl-38044469

RESUMEN

Brain function changes affect cognitive functions in older adults, yet the relationship between cognition and the dynamic changes of brain networks during naturalistic stimulation is not clear. Here, we recruited the young, middle-aged and older groups from the Cambridge Center for Aging and Neuroscience to investigate the relationship between dynamic metrics of brain networks and cognition using functional magnetic resonance imaging data during movie-watching. We found six reliable co-activation pattern (CAP) states of brain networks grouped into three pairs with opposite activation patterns in three age groups. Compared with young and middle-aged adults, older adults dwelled shorter time in CAP state 4 with deactivated default mode network (DMN) and activated salience, frontoparietal and dorsal-attention networks (DAN), and longer time in state 6 with deactivated DMN and activated DAN and visual network, suggesting altered dynamic interaction between DMN and other brain networks might contribute to cognitive decline in older adults. Meanwhile, older adults showed easier transfer from state 6 to state 3 (activated DMN and deactivated sensorimotor network), suggesting that the fragile antagonism between DMN and other cognitive networks might contribute to cognitive decline in older adults. Our findings provided novel insights into aberrant brain network dynamics associated with cognitive decline.


Asunto(s)
Encéfalo , Imagen por Resonancia Magnética , Imagen por Resonancia Magnética/métodos , Encéfalo/diagnóstico por imagen , Encéfalo/fisiología , Cognición/fisiología , Mapeo Encefálico , Red Nerviosa/diagnóstico por imagen , Red Nerviosa/fisiología
3.
Cereb Cortex ; 34(1)2024 01 14.
Artículo en Inglés | MEDLINE | ID: mdl-38037843

RESUMEN

Human brain structure shows heterogeneous patterns of change across adults aging and is associated with cognition. However, the relationship between cortical structural changes during aging and gene transcription signatures remains unclear. Here, using structural magnetic resonance imaging data of two separate cohorts of healthy participants from the Cambridge Centre for Aging and Neuroscience (n = 454, 18-87 years) and Dallas Lifespan Brain Study (n = 304, 20-89 years) and a transcriptome dataset, we investigated the link between cortical morphometric similarity network and brain-wide gene transcription. In two cohorts, we found reproducible morphometric similarity network change patterns of decreased morphological similarity with age in cognitive related areas (mainly located in superior frontal and temporal cortices), and increased morphological similarity in sensorimotor related areas (postcentral and lateral occipital cortices). Changes in morphometric similarity network showed significant spatial correlation with the expression of age-related genes that enriched to synaptic-related biological processes, synaptic abnormalities likely accounting for cognitive decline. Transcription changes in astrocytes, microglia, and neuronal cells interpreted most of the age-related morphometric similarity network changes, which suggest potential intervention and therapeutic targets for cognitive decline. Taken together, by linking gene transcription signatures to cortical morphometric similarity network, our findings might provide molecular and cellular substrates for cortical structural changes related to cognitive decline across adults aging.


Asunto(s)
Envejecimiento , Encéfalo , Adulto , Humanos , Encéfalo/fisiología , Envejecimiento/fisiología , Cognición/fisiología , Lóbulo Temporal , Imagen por Resonancia Magnética/métodos
4.
BMC Bioinformatics ; 25(1): 266, 2024 Aug 14.
Artículo en Inglés | MEDLINE | ID: mdl-39143554

RESUMEN

BACKGROUND: Construction of co-occurrence networks in metagenomic data often employs correlation to infer pairwise relationships between microbes. However, biological systems are complex and often display qualities non-linear in nature. Therefore, the reliance on correlation alone may overlook important relationships and fail to capture the full breadth of intricacies presented in underlying interaction networks. It is of interest to incorporate metrics that are not only robust in detecting linear relationships, but non-linear ones as well. RESULTS: In this paper, we explore the use of various mutual information (MI) estimation approaches for quantifying pairwise relationships in biological data and compare their performances against two traditional measures-Pearson's correlation coefficient, r, and Spearman's rank correlation coefficient, ρ. Metrics are tested on both simulated data designed to mimic pairwise relationships that may be found in ecological systems and real data from a previous study on C. diff infection. The results demonstrate that, in the case of asymmetric relationships, mutual information estimators can provide better detection ability than Pearson's or Spearman's correlation coefficients. Specifically, we find that these estimators have elevated performances in the detection of exploitative relationships, demonstrating the potential benefit of including them in future metagenomic studies. CONCLUSIONS: Mutual information (MI) can uncover complex pairwise relationships in biological data that may be missed by traditional measures of association. The inclusion of such relationships when constructing co-occurrence networks can result in a more comprehensive analysis than the use of correlation alone.


Asunto(s)
Metagenómica , Metagenómica/métodos , Algoritmos , Metagenoma/genética
5.
PLoS Comput Biol ; 19(10): e1010608, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-37844077

RESUMEN

Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.


Asunto(s)
Algoritmos , Aprendizaje Automático , Reproducibilidad de los Resultados , Genómica , Fenotipo
6.
Cereb Cortex ; 33(13): 8645-8653, 2023 06 20.
Artículo en Inglés | MEDLINE | ID: mdl-37143182

RESUMEN

Sex differences in episodic memory (EM), remembering past events based on when and where they occurred, have been reported, but the neural mechanisms are unclear. T1-weighted images of 111 females and 61 males were acquired from the Dallas Lifespan Brain Study. Using surface-based morphometry and structural covariance (SC) analysis, we constructed structural covariance networks (SCN) based on cortical volume, and the global efficiency (Eglob) was computed to characterize network integration. The relationship between SCN and EM was examined by SC analysis among the top-n brain regions that were most relevant to EM performance. The number of SC connections (females: 3306; males: 437, P = 0.0212) and Eglob (females: 0.1845; males: 0.0417, P = 0.0408) of SCN in females were higher than those in males. The top-n brain regions with the strongest SC in females were located in auditory network, cingulo-opercular network (CON), and default mode network (DMN), and in males, they were located in frontoparietal network, CON, and DMN. These results confirmed that the Eglob of SCN in females was higher than males, sex differences in EM performance might be related to the differences in network-level integration. Our study highlights the importance of sex as a research variable in brain science.


Asunto(s)
Memoria Episódica , Humanos , Masculino , Femenino , Caracteres Sexuales , Encéfalo , Imagen por Resonancia Magnética , Mapeo Encefálico
7.
Proc Natl Acad Sci U S A ; 118(36)2021 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-34480002

RESUMEN

We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.


Asunto(s)
Biología Computacional/métodos , Aprendizaje Profundo , Genómica , Algoritmos , Simulación por Computador , Redes Neurales de la Computación
8.
Bioinformatics ; 38(11): 2973-2979, 2022 05 26.
Artículo en Inglés | MEDLINE | ID: mdl-35482530

RESUMEN

MOTIVATION: Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs' composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample. RESULTS: We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs. AVAILABILITY AND IMPLEMENTATION: HiFine is available at https://github.com/dyxstat/HiFine. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ecosistema , Metagenómica , Metagenómica/métodos , Metagenoma , Análisis por Conglomerados , Genoma Microbiano , Algoritmos , Análisis de Secuencia de ADN/métodos
9.
Bioinformatics ; 38(Suppl 1): i45-i52, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758806

RESUMEN

MOTIVATION: Phage-host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage-host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH. RESULTS: On the validation set, ContigNet achieves 72-85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60-70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts. AVAILABILITY AND IMPLEMENTATION: The source code of ContigNet and related datasets can be downloaded from https://github.com/tianqitang1/ContigNet.


Asunto(s)
Bacteriófagos , Bacterias/genética , Bacteriófagos/genética , Metagenoma , Metagenómica , Redes Neurales de la Computación
10.
PLoS Comput Biol ; 18(7): e1010184, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35830390

RESUMEN

Confounding factors exist widely in various biological data owing to technical variations, population structures and experimental conditions. Such factors may mask the true signals and lead to spurious associations in the respective biological data, making it necessary to adjust confounding factors accordingly. However, existing confounder correction methods were mainly developed based on the original data or the pairwise Euclidean distance, either one of which is inadequate for analyzing different types of data, such as sequencing data. In this work, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, which reduces data dimension and extracts the information from different distance measures using principal coordinate analysis, and adjusts confounding factors across multiple datasets by minimizing the associations between lower-dimensional representations and confounding variables. Application of the proposed method was further extended to classification and prediction. We demonstrated the efficacy of AC-PCoA on three simulated datasets and five real datasets. Compared to the existing methods, AC-PCoA shows better results in visualization, statistical testing, clustering, and classification.


Asunto(s)
Proyectos de Investigación , Factores de Confusión Epidemiológicos
11.
Neuroimage ; 255: 119166, 2022 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-35398282

RESUMEN

Magnetic Resonance Imaging (MRI) technology has been increasingly used in neuroscience studies. Reproducibility of statistically significant findings generated by MRI-based studies, especially association studies (phenotype vs. MRI metric) and task-induced brain activation, has been recently heavily debated. However, most currently available reproducibility measures depend on thresholds for the test statistics and cannot be use to evaluate overall study reproducibility. It is also crucial to elucidate the relationship between overall study reproducibility and sample size in an experimental design. In this study, we proposed a model-based reproducibility index to quantify reproducibility which could be used in large-scale high-throughput MRI-based studies including both association studies and task-induced brain activation. We performed the model-based reproducibility assessments for a few association studies and task-induced brain activation by using several recent large sMRI/fMRI databases. For large sample size association studies between brain structure/function features and some basic physiological phenotypes (i.e. Sex, BMI), we demonstrated that the model-based reproducibility of these studies is more than 0.99. For MID task activation, similar results could be observed. Furthermore, we proposed a model-based analytical tool to evaluate minimal sample size for the purpose of achieving a desirable model-based reproducibility. Additionally, we evaluated the model-based reproducibility of gray matter volume (GMV) changes for UK Biobank (UKB) vs. Parkinson Progression Marker Initiative (PPMI) and UK Biobank (UKB) vs. Human Connectome Project (HCP). We demonstrated that both sample size and study-specific experimental factors play important roles in the model-based reproducibility assessments for different experiments. In summary, a systematic assessment of reproducibility is fundamental and important in the current large-scale high-throughput MRI-based studies.


Asunto(s)
Conectoma , Imagen por Resonancia Magnética , Encéfalo/diagnóstico por imagen , Sustancia Gris , Humanos , Imagen por Resonancia Magnética/métodos , Reproducibilidad de los Resultados
12.
Brief Bioinform ; 21(3): 777-790, 2020 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-30860572

RESUMEN

In metagenomic studies of microbial communities, the short reads come from mixtures of genomes. Read assembly is usually an essential first step for the follow-up studies in metagenomic research. Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in their investigations. Many studies evaluating different assembly programs used either simulated metagenomes or real metagenomes with unknown genome compositions. However, the simulated datasets may not reflect the real complexities of metagenomic samples and the estimated assembly accuracy could be misleading due to the unknown genomes in real metagenomes. Therefore, hybrid strategies are required to evaluate the various read assemblers for metagenomic studies. In this paper, we benchmark the metagenomic read assemblers by mixing reads from real metagenomic datasets with reads from known genomes and evaluating the integrity, contiguity and accuracy of the assembly using the reads from the known genomes. We selected four advanced metagenome assemblers, MEGAHIT, MetaSPAdes, IDBA-UD and Faucet, for evaluation. We showed the strengths and weaknesses of these assemblers in terms of integrity, contiguity and accuracy for different variables, including the genetic difference of the real genomes with the genome sequences in the real metagenomic datasets and the sequencing depth of the simulated datasets. Overall, MetaSPAdes performs best in terms of integrity and continuity at the species-level, followed by MEGAHIT. Faucet performs best in terms of accuracy at the cost of worst integrity and continuity, especially at low sequencing depth. MEGAHIT has the highest genome fractions at the strain-level and MetaSPAdes has the overall best performance at the strain-level. MEGAHIT is the most efficient in our experiments. Availability: The source code is available at https://github.com/ziyewang/MetaAssemblyEval.


Asunto(s)
Biología Computacional/métodos , Metagenómica , Algoritmos , Conjuntos de Datos como Asunto , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Microbiota/genética
13.
Bioinformatics ; 37(6): 759-766, 2021 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-33119059

RESUMEN

MOTIVATION: The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. RESULTS: To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini-Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. AVAILABILITYAND IMPLEMENTATION: Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metagenómica , Microbiota , Algoritmos , Simulación por Computador , Metagenoma , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN
14.
Bioinformatics ; 37(2): 155-161, 2021 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-32766810

RESUMEN

MOTIVATION: Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. RESULTS: We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102-104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. AVAILABILITY AND IMPLEMENTATION: CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN
15.
Virol J ; 19(1): 114, 2022 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-35765099

RESUMEN

BACKGROUND: Chronic infection with hepatitis B virus (HBV) has been proved highly associated with the development of hepatocellular carcinoma (HCC). AIMS: The purpose of the study is to investigate the association between HBV preS region quasispecies and HCC development, as well as to develop HCC diagnosis model using HBV preS region quasispecies. METHODS: A total of 104 chronic hepatitis B (CHB) patients and 117 HBV-related HCC patients were enrolled. HBV preS region was sequenced using next generation sequencing (NGS) and the nucleotide entropy was calculated for quasispecies evaluation. Sparse logistic regression (SLR) was used to predict HCC development and prediction performances were evaluated using receiver operating characteristic curves. RESULTS: Entropy of HBV preS1, preS2 regions and several nucleotide points showed significant divergence between CHB and HCC patients. Using SLR, the classification of HCC/CHB groups achieved a mean area under the receiver operating characteristic curve (AUC) of 0.883 in the training data and 0.795 in the test data. The prediction model was also validated by a completely independent dataset from Hong Kong. The 10 selected nucleotide positions showed significantly different entropy between CHB and HCC patients. The HBV quasispecies also classified three clinical parameters, including HBeAg, HBVDNA, and Alkaline phosphatase (ALP) with the AUC value greater than 0.6 in the test data. CONCLUSIONS: Using NGS and SLR, the association between HBV preS region nucleotide entropy and HCC development was validated in our study and this could promote the understanding of HCC progression mechanism.


Asunto(s)
Carcinoma Hepatocelular , Neoplasias Hepáticas , Antígenos de Superficie de la Hepatitis B/genética , Virus de la Hepatitis B/genética , Humanos , Modelos Logísticos , Nucleótidos , Cuasiespecies
16.
Nucleic Acids Res ; 47(W1): W379-W387, 2019 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-31106361

RESUMEN

Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler-a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.


Asunto(s)
Biología Computacional/métodos , Aprendizaje Automático , Anotación de Secuencia Molecular , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Animales , Benchmarking , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Internet , Modelos Moleculares , Plantas/genética , Células Procariotas/metabolismo , Mapeo de Interacción de Proteínas , Proteínas/fisiología , Alineación de Secuencia , Análisis de Secuencia de Proteína , Homología de Secuencia de Aminoácido , Relación Estructura-Actividad
17.
PLoS Genet ; 14(2): e1007206, 2018 02.
Artículo en Inglés | MEDLINE | ID: mdl-29474353

RESUMEN

Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60-80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.


Asunto(s)
Carcinoma Hepatocelular/virología , Heterogeneidad Genética , Antígenos de Superficie de la Hepatitis B/genética , Virus de la Hepatitis B/genética , Hepatitis B Crónica/virología , Neoplasias Hepáticas/virología , Carcinoma Hepatocelular/epidemiología , Carcinoma Hepatocelular/genética , Dermatoglifia del ADN , ADN Viral/análisis , Frecuencia de los Genes , Estudios de Asociación Genética/métodos , Genotipo , Virus de la Hepatitis B/clasificación , Hepatitis B Crónica/complicaciones , Hepatitis B Crónica/epidemiología , Hepatitis B Crónica/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Neoplasias Hepáticas/epidemiología , Neoplasias Hepáticas/genética , Filogenia , Precursores de Proteínas/genética
18.
Bioinformatics ; 35(22): 4596-4606, 2019 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-30993316

RESUMEN

MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas , Genómica , Algoritmos , Genoma Bacteriano , Metagenómica
19.
Bioinformatics ; 35(21): 4229-4238, 2019 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-30977806

RESUMEN

MOTIVATION: Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs. RESULTS: We developed a novel contig binning method, Semi-supervised Spectral Normalized Cut for Binning (SolidBin), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut, for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, Adjusted Rand Index and Normalized Mutual Information, especially while using the real datasets and the single-sample dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/sufforest/SolidBin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metagenoma , Análisis por Conglomerados , Secuenciación de Nucleótidos de Alto Rendimiento , Metagenómica , Análisis de Secuencia de ADN , Programas Informáticos
20.
J Transl Med ; 18(1): 5, 2020 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-31906978

RESUMEN

BACKGROUND: Sepsis remains a major challenge in intensive care units, causing unacceptably high mortality rates due to the lack of rapid diagnostic tools with sufficient sensitivity. Therefore, there is an urgent need to replace time-consuming blood cultures with a new method. Ideally, such a method also provides comprehensive profiling of pathogenic bacteria to facilitate the treatment decision. METHODS: We developed a Random Forest with balanced subsampling to screen for pathogenic bacteria and diagnose sepsis based on cell-free DNA (cfDNA) sequencing data in a small blood sample. In addition, we constructed a bacterial co-occurrence network, based on a set of normal and sepsis samples, to infer unobserved bacteria. RESULTS: Based solely on cfDNA sequencing information from three independent datasets of sepsis, we distinguish sepsis from healthy samples with a satisfactory performance. This strategy also provides comprehensive bacteria profiling, permitting doctors to choose the best treatment strategy for a sepsis case. CONCLUSIONS: The combination of sepsis identification and bacteria-inferring strategies is a success for noninvasive cfDNA-based diagnosis, which has the potential to greatly enhance efficiency in disease detection and provide a comprehensive understanding of pathogens. For comparison, where a culture-based analysis of pathogens takes up to 5 days and is effective for only a third to a half of patients, cfDNA sequencing can be completed in just 1 day and our method can identify the majority of pathogens in all patients.


Asunto(s)
Ácidos Nucleicos Libres de Células , Sepsis , Bacterias/genética , ADN Bacteriano/genética , Humanos , Unidades de Cuidados Intensivos , Sepsis/diagnóstico
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA