Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
1.
Biom J ; 66(6): e202300185, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39101657

RESUMEN

There has been growing research interest in developing methodology to evaluate the health care providers' performance with respect to a patient outcome. Random and fixed effects models are traditionally used for such a purpose. We propose a new method, using a fusion penalty to cluster health care providers based on quasi-likelihood. Without any priori knowledge of grouping information, our method provides a desirable data-driven approach for automatically clustering health care providers into different groups based on their performance. Further, the quasi-likelihood is more flexible and robust than the regular likelihood in that no distributional assumption is needed. An efficient alternating direction method of multipliers algorithm is developed to implement the proposed method. We show that the proposed method enjoys the oracle properties; namely, it performs as well as if the true group structure were known in advance. The consistency and asymptotic normality of the estimators are established. Simulation studies and analysis of the national kidney transplant registry data demonstrate the utility and validity of our method.


Asunto(s)
Biometría , Personal de Salud , Análisis por Conglomerados , Funciones de Verosimilitud , Humanos , Personal de Salud/estadística & datos numéricos , Biometría/métodos , Trasplante de Riñón , Algoritmos
2.
Front Genet ; 15: 1369628, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38903761

RESUMEN

Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes.

3.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38597887

RESUMEN

MOTIVATION: Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS: We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2000-5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150-350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. ∼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY AND IMPLEMENTATION: www.dllab.org/software/VirusPredictor.html.


Asunto(s)
Genoma Viral , Programas Informáticos , Humanos , Virus/genética , Análisis de Secuencia de ADN/métodos , Alineación de Secuencia/métodos , Aprendizaje Automático
4.
Sci Rep ; 14(1): 7024, 2024 03 25.
Artículo en Inglés | MEDLINE | ID: mdl-38528097

RESUMEN

The human microbiome, comprising microorganisms residing within and on the human body, plays a crucial role in various physiological processes and has been linked to numerous diseases. To analyze microbiome data, it is essential to account for inherent heterogeneity and variability across samples. Normalization methods have been proposed to mitigate these variations and enhance comparability. However, the performance of these methods in predicting binary phenotypes remains understudied. This study systematically evaluates different normalization methods in microbiome data analysis and their impact on disease prediction. Our findings highlight the strengths and limitations of scaling, compositional data analysis, transformation, and batch correction methods. Scaling methods like TMM show consistent performance, while compositional data analysis methods exhibit mixed results. Transformation methods, such as Blom and NPN, demonstrate promise in capturing complex associations. Batch correction methods, including BMC and Limma, consistently outperform other approaches. However, the influence of normalization methods is constrained by population effects, disease effects, and batch effects. These results provide insights for selecting appropriate normalization approaches in microbiome research, improving predictive models, and advancing personalized medicine. Future research should explore larger and more diverse datasets and develop tailored normalization strategies for microbiome data analysis.


Asunto(s)
Microbiota , Humanos , Microbiota/genética , Metagenoma , Metagenómica , Proyectos de Investigación , Fenotipo
5.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37930023

RESUMEN

Local associations refer to spatial-temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica , Factores de Tiempo , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Benchmarking
6.
Artículo en Inglés | MEDLINE | ID: mdl-39165756

RESUMEN

Ulcerative colitis (UC) is an immune-mediated inflammation of the colonic mucosa. Gut microbiota dysbiosis may play a significant role in disease pathogenesis by causing shifts in metabolomic profiles within the gut. To identify differences and trends in the metabolomic profile of paediatric UC patients pre- and post-faecal microbiota transplants (FMT). Forty-six paediatric patients with mild-to-moderate UC and 30 healthy paediatric patients were enrolled in this study. Baseline stool samples were collected prior to FMT initiation and at months 1, 3, 6, and 12 post-FMT. Pediatric Ulcerative Colitis Activity Index (PUCAI) scores were calculated at baseline and months 1, 3, 6, and 12 after FMT. The average Bray-Curtis dissimilarities to healthy subjects decreased after FMT. In principal coordinate analysis plots, UC patients' centroids drew nearer to healthy individuals. The variance explained by phenotype (Healthy versus UC) reduced and remained significant. From 1 to 3 months after FMT, PUCAI trends were statistically significant and decreasing. PUCAI scores remain flat starting 6 months after FMT. This study concludes that paediatric UC patients have a significantly different baseline metabolite profile than healthy controls. Although being time limited, FMT significantly altered these metabolite profiles and shifted them towards that of healthy controls.

7.
Front Genet ; 13: 729011, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35559007

RESUMEN

Biological time series data plays an important role in exploring the dynamic changes of biological systems, while the determinate patterns of association between various biological factors can further deepen the understanding of biological system functions and the interactions between them. At present, local trend analysis (LTA) has been commonly conducted in many biological fields, where the biological time series data can be the sequence at either the level of gene expression or OTU abundance, etc., A local trend score can be obtained by taking the similarity degree of the upward, constant or downward trend of time series data as an indicator of the correlation between different biological factors. However, a major limitation facing local trend analysis is that the permutation test conducted to calculate its statistical significance requires a time-consuming process. Therefore, the problem attracting much attention from bioinformatics scientists is to develop a method of evaluating the statistical significance of local trend scores quickly and effectively. In this paper, a new approach is proposed to evaluate the efficient approximation of statistical significance in the local trend analysis of dependent time series, and the effectiveness of the new method is demonstrated through simulation and real data set analysis.

8.
Sci Rep ; 12(1): 6421, 2022 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-35440670

RESUMEN

Dysbiosis of human gut microbiota has been reported in association with ulcerative colitis (UC) in both children and adults using either 16S rRNA gene or shotgun sequencing data. However, these studies used either 16S rRNA or metagenomic shotgun sequencing but not both. We sequenced feces samples from 19 pediatric UC and 23 healthy children ages between 7 to 21 years using both 16S rRNA and metagenomic shotgun sequencing. The samples were analyzed using three different types of data: 16S rRNA genus level abundance, microbial species and pathway abundance profiles. We demonstrated that (a) the alpha diversity of pediatric UC cases is lower than that of healthy controls; (b) the beta diversity within children with UC is more variable than within the healthy children; (c) several microbial families including Akkermansiaceae, Clostridiaceae, Eggerthellaceae, Lachnospiraceae, and Oscillospiraceae, contain species that are depleted in pediatric UC compared to controls; (d) a few associated species unique to pediatric UC, but not adult UC, were also identified, e.g. some species in the Christensenellaceae family were found to be depleted and some species in the Enterobacteriaceae family were found to be enriched in pediatric UC; and (e) both 16S rRNA and shotgun sequencing data can predict pediatric UC status with area under the receiver operating characteristic curve (AUROC) of close to 0.90 based on cross validation. We showed that 16S rRNA data yielded similar results as shotgun data in terms of alpha diversity, beta diversity, and prediction accuracy. Our study demonstrated that pediatric UC subjects harbor a dysbiotic and less diverse gut microbial population with distinct differences from healthy children. We also showed that 16S rRNA data yielded accurate disease prediction results in comparison to shotgun data, which can be more expensive and laborious. These conclusions were confirmed in an independent data set of 7 pediatric UC cases and 8 controls.


Asunto(s)
Colitis Ulcerosa , Microbioma Gastrointestinal , Adolescente , Adulto , Niño , Colitis Ulcerosa/genética , Disbiosis/genética , Heces , Microbioma Gastrointestinal/genética , Humanos , Metagenoma , ARN Ribosómico 16S/genética , Adulto Joven
9.
Physiol Rep ; 9(14): e14918, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-34278738

RESUMEN

BACKGROUND: It is known that patients with ulcerative colitis (UC) have reduced numbers of short-chain fatty acid (SCFA) producing bacteria and reduced SCFA concentration in feces. There is also evidence that Hispanic patients have increased incidence of UC and increased likelihood of developing disease at a younger age. To understand why this might be, we compared fiber intake and fecal SCFA concentrations in Hispanic children with UC and non-Hispanic children with UC. METHODS: In this cross-sectional study conducted at the Children's Hospital of Los Angeles, stool was collected from 22 Hispanic and 31 non-Hispanic children with UC. SCFAs in the stool were quantified using mass spectrometry. Diet information was collected at the time of stool collection using food frequency questionnaires. RESULTS: Acetic acid, butyric acid, isovaleric acid, and propionic acid concentrations are significantly lower in Hispanic children with UC compared to age, gender, and disease activity matched non-Hispanic children with UC (p < 0.001). Butyric acid showed the most significant decrease (p = 1.6e-7) There was no significant difference in fiber intake between Hispanic and non-Hispanic children with UC. CONCLUSION: To our knowledge, this is the first study to find that Hispanic children with UC had further reduced SCFAs, independent of disease activity and fiber intake. It is possible that the reduction in SCFAs is related to the colonic disease in Hispanic patients with UC. This may provide more evidence to support the use of SCFA targeted therapies for UC.


Asunto(s)
Colitis Ulcerosa/epidemiología , Colitis Ulcerosa/metabolismo , Ácidos Grasos Volátiles/análisis , Ácidos Grasos Volátiles/metabolismo , Heces/química , Hispánicos o Latinos , Adolescente , Niño , Colitis Ulcerosa/diagnóstico , Colitis Ulcerosa/dietoterapia , Estudios Transversales , Fibras de la Dieta/administración & dosificación , Femenino , Humanos , Los Angeles/epidemiología , Masculino
10.
Aliment Pharmacol Ther ; 54(6): 792-804, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34218431

RESUMEN

BACKGROUND: Patients with ulcerative colitis (UC) have an increased risk of Clostridioides difficile infection (CDI). There is a well-documented relationship between bile acids and CDI. AIMS: To evaluate faecal bile acid profiles and gut microbial changes associated with CDI in children with UC. METHODS: This study was conducted at Children's Hospital Los Angeles. Faecal bile acids and gut microbial genes related to bile acid metabolism were measured in 29 healthy children, 23 children with mild to moderate UC without prior CDI (UC group), 16 children with mild to moderate UC with prior CDI (UC+CDI group) and 10 children without UC with prior CDI (CDI group). RESULTS: Secondary faecal bile acids, especially lithocholic acid (3.296 vs 10.793, P ≤ 0.001) and ursodeoxycholic acid (7.414 vs 10.617, P ≤ 0.0001), were significantly lower in children with UC+CDI when compared to UC alone. Secondary faecal bile acids can predict disease status between these groups with 84.6% accuracy. Additionally, gut microbial genes coding for bile salt hydrolase, 7α-hydroxysteroid dehydrogenase and 7α/ß-dehydroxylation were all diminished in children with UC+CDI compared to children with UC alone. CONCLUSIONS: Bile acids can distinguish between children with UC based on their prior CDI status. Bile acid profile changes can be explained by gut microbial genes encoding for bile salt hydrolase, 7α-hydroxysteroid dehydrogenase and 7α/ß-dehydroxylation. Bile acid profiles may be helpful as biomarkers to identify UC children who have had CDI and may serve as future therapeutic targets.


Asunto(s)
Clostridioides difficile , Infecciones por Clostridium , Colitis Ulcerosa , Ácidos y Sales Biliares , Niño , Clostridioides , Infecciones por Clostridium/diagnóstico , Colitis Ulcerosa/diagnóstico , Humanos
11.
BMC Bioinformatics ; 20(1): 53, 2019 Jan 28.
Artículo en Inglés | MEDLINE | ID: mdl-30691412

RESUMEN

BACKGROUND: Local similarity analysis (LSA) of time series data has been extensively used to investigate the dynamics of biological systems in a wide range of environments. Recently, a theoretical method was proposed to approximately calculate the statistical significance of local similarity (LS) scores. However, the method assumes that the time series data are independent identically distributed, which can be violated in many problems. RESULTS: In this paper, we develop a novel approach to accurately approximate statistical significance of LSA for dependent time series data using nonparametric kernel estimated long-run variance. We also investigate an alternative method for LSA statistical significance approximation by computing the local similarity score of the residuals based on a predefined statistical model. We show by simulations that both methods have controllable type I errors for dependent time series, while other approaches for statistical significance can be grossly oversized. We apply both methods to human and marine microbial datasets, where most of possible significant associations are captured and false positives are efficiently controlled. CONCLUSIONS: Our methods provide fast and effective approaches for evaluating statistical significance of dependent time series data with controllable type I error. They can be applied to a variety of time series data to reveal inherent relationships among the different factors.


Asunto(s)
Algoritmos , Modelos Estadísticos , Organismos Acuáticos/microbiología , Bases de Datos como Asunto , Femenino , Humanos , Masculino , Microbiota , Factores de Tiempo
12.
Stat Appl Genet Mol Biol ; 17(6)2018 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-30447151

RESUMEN

In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.


Asunto(s)
Metagenómica , Modelos Estadísticos , Algoritmos , Bases de Datos Genéticas , Humanos , Metagenómica/métodos , Metagenómica/normas
13.
Med Biol Eng Comput ; 53(11): 1113-27, 2015 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-26429348

RESUMEN

High-resolution fetal electrocardiogram (FECG) plays an important role in assisting physicians to detect fetal changes in the womb and to make clinical decisions. However, in real situations, clear FECG is difficult to extract because it is usually overwhelmed by the dominant maternal ECG and other contaminated noise such as baseline wander, high-frequency noise. In this paper, we proposed a novel integrated adaptive algorithm based on independent component analysis (ICA), ensemble empirical mode decomposition (EEMD), and wavelet shrinkage (WS) denoising, denoted as ICA-EEMD-WS, for FECG separation and noise reduction. First, ICA algorithm was used to separate the mixed abdominal ECG signal and to obtain the noisy FECG. Second, the noise in FECG was reduced by a three-step integrated algorithm comprised of EEMD, useful subcomponents statistical inference and WS processing, and partial reconstruction for baseline wander reduction. Finally, we evaluate the proposed algorithm using simulated data sets. The results indicated that the proposed ICA-EEMD-WS outperformed the conventional algorithms in signal denoising.


Asunto(s)
Algoritmos , Electrocardiografía/métodos , Monitoreo Fetal/métodos , Análisis de Ondículas , Simulación por Computador , Femenino , Humanos , Modelos Estadísticos , Embarazo
14.
Comput Math Methods Med ; 2014: 203871, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24899916

RESUMEN

Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins.


Asunto(s)
Aminoácidos/química , Proteínas/química , Algoritmos , Animales , Biología Computacional/métodos , Gráficos por Computador , Simulación por Computador , Coronavirus/metabolismo , ADN/química , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , Ratones , Modelos Teóricos , Filogenia , Ratas , Reproducibilidad de los Resultados
15.
J Comput Biol ; 19(6): 839-54, 2012 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-22697250

RESUMEN

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).


Asunto(s)
Algoritmos , Mapeo Cromosómico/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas Informáticos , Mapeo Cromosómico/métodos , Factor de Transcripción de la Proteína de Unión a GA/genética , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Modelos Estadísticos , Distribución de Poisson , Análisis de Secuencia de ADN/métodos , Transactivadores/genética
16.
Front Biosci (Schol Ed) ; 4(4): 1333-43, 2012 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-22652875

RESUMEN

Metagenomics commonly refers to the study of genetic materials directly derived from environments without culturing. Several ongoing large-scale metagenomics projects related to human and marine life, as well as pedology studies, have generated enormous amounts of data, posing a key challenge for efficient analysis, as we try to 1) understand microbial organism assemblage under different conditions, 2) compare different communities, and 3) understand how microbial organisms associate with each other and the environment.To address such questions, investigators are using new sequencing technologies, including Sanger, Illumina Solexa, and Roche 454, to sequence either particular genes, called tag sequences, mostly 16S or 18S ribosomal RNA sequences or other conserved genes, or whole metagenome shotgun sequences of all the genetic materials in a given community. In this paper, we review computational methods used for the analysis of tag sequences.


Asunto(s)
Etiquetas de Secuencia Expresada , Metagenómica/métodos , Algoritmos , Animales , Ambiente , Humanos , Metagenoma , ARN Ribosómico 16S/genética
17.
BMC Bioinformatics ; 12: 118, 2011 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-21518444

RESUMEN

BACKGROUND: Beta diversity, which involves the assessment of differences between communities, is an important problem in ecological studies. Many statistical methods have been developed to quantify beta diversity, and among them, UniFrac and weighted-UniFrac (W-UniFrac) are widely used. The W-UniFrac is a weighted sum of branch lengths in a phylogenetic tree of the sequences from the communities. However, W-UniFrac does not consider the variation of the weights under random sampling resulting in less power detecting the differences between communities. RESULTS: We develop a new statistic termed variance adjusted weighted UniFrac (VAW-UniFrac) to compare two communities based on the phylogenetic relationships of the individuals. The VAW-UniFrac is used to test if the two communities are different. To test the power of VAW-UniFrac, we first ran a series of simulations which revealed that it always outperforms W-UniFrac, as well as UniFrac when the individuals are not uniformly distributed. Next, all three methods were applied to analyze three large 16S rRNA sequence collections, including human skin bacteria, mouse gut microbial communities, microbial communities from hypersaline soil and sediments, and a tropical forest census data. Both simulations and applications to real data show that VAW-UniFrac can satisfactorily measure differences between communities, considering not only the species composition but also abundance information. CONCLUSIONS: VAW-UniFrac can recover biological insights that cannot be revealed by other beta diversity measures, and it provides a novel alternative for comparing communities.


Asunto(s)
Bacterias/clasificación , Metagenoma , Animales , Bacterias/genética , Ecología , Tracto Gastrointestinal/microbiología , Humanos , Ratones , Filogenia , ARN Bacteriano/genética , ARN Ribosómico 16S/genética , Piel/microbiología , Microbiología del Suelo
18.
J Comput Biol ; 17(4): 581-92, 2010 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-20426691

RESUMEN

The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.


Asunto(s)
Cadenas de Markov , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis de Secuencia de ADN/métodos , Composición de Base/genética , Secuencia de Bases , Islas de CpG/genética , Internet , Análisis Numérico Asistido por Computador , Distribución de Poisson
19.
BMC Bioinformatics ; 10: 277, 2009 Sep 03.
Artículo en Inglés | MEDLINE | ID: mdl-19728875

RESUMEN

BACKGROUND: Many aspects of biological functions can be modeled by biological networks, such as protein interaction networks, metabolic networks, and gene coexpression networks. Studying the statistical properties of these networks in turn allows us to infer biological function. Complex statistical network models can potentially more accurately describe the networks, but it is not clear whether such complex models are better suited to find biologically meaningful subnetworks. RESULTS: Recent studies have shown that the degree distribution of the nodes is not an adequate statistic in many molecular networks. We sought to extend this statistic with 2nd and 3rd order degree correlations and developed a pseudo-likelihood approach to estimate the parameters. The approach was used to analyze the MIPS and BIOGRID yeast protein interaction networks, and two yeast coexpression networks. We showed that 2nd order degree correlation information gave better predictions of gene interactions in both protein interaction and gene coexpression networks. However, in the biologically important task of predicting functionally homogeneous modules, degree correlation information performs marginally better in the case of the MIPS and BIOGRID protein interaction networks, but worse in the case of gene coexpression networks. CONCLUSION: Our use of dK models showed that incorporation of degree correlations could increase predictive power in some contexts, albeit sometimes marginally, but, in all contexts, the use of third-order degree correlations decreased accuracy. However, it is possible that other parameter estimation methods, such as maximum likelihood, will show the usefulness of incorporating 2nd and 3rd degree correlations in predicting functionally homogeneous modules.


Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Modelos Estadísticos , Mapeo de Interacción de Proteínas/métodos
20.
Biostatistics ; 9(1): 100-13, 2008 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-17513311

RESUMEN

One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.


Asunto(s)
Interpretación Estadística de Datos , Genoma Humano , Modelos Genéticos , Análisis de Regresión , Neoplasias de la Mama/enzimología , Neoplasias de la Mama/genética , Femenino , Perfilación de la Expresión Génica/métodos , Humanos , Inhibidores de la Metaloproteinasa de la Matriz , Metaloproteinasas de la Matriz/genética , Polimorfismo de Nucleótido Simple , Análisis de Supervivencia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...