Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 170
Filtrar
1.
Artif Life ; 30(1): 16-27, 2024 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-38358121

RESUMEN

In the mid-20th century, two new scientific disciplines emerged forcefully: molecular biology and information-communication theory. At the beginning, cross-fertilization was so deep that the term genetic code was universally accepted for describing the meaning of triplets of mRNA (codons) as amino acids. However, today, such synergy has not taken advantage of the vertiginous advances in the two disciplines and presents more challenges than answers. These challenges not only are of great theoretical relevance but also represent unavoidable milestones for next-generation biology: from personalized genetic therapy and diagnosis to Artificial Life to the production of biologically active proteins. Moreover, the matter is intimately connected to a paradigm shift needed in theoretical biology, pioneered a long time ago, that requires combined contributions from disciplines well beyond the biological realm. The use of information as a conceptual metaphor needs to be turned into quantitative and predictive models that can be tested empirically and integrated in a unified view. Successfully achieving these tasks requires a wide multidisciplinary approach, including Artificial Life researchers, to address such an endeavour.


Asunto(s)
Biología , Código Genético
2.
PLoS Comput Biol ; 20(1): e1011809, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38295113

RESUMEN

Data integration methods are used to obtain a unified summary of multiple datasets. For multi-modal data, we propose a computational workflow to jointly analyze datasets from cell lines. The workflow comprises a novel probabilistic data integration method, named POPLS-DA, for multi-omics data. The workflow is motivated by a study on synucleinopathies where transcriptomics, proteomics, and drug screening data are measured in affected LUHMES cell lines and controls. The aim is to highlight potentially druggable pathways and genes involved in synucleinopathies. First, POPLS-DA is used to prioritize genes and proteins that best distinguish cases and controls. For these genes, an integrated interaction network is constructed where the drug screen data is incorporated to highlight druggable genes and pathways in the network. Finally, functional enrichment analyses are performed to identify clusters of synaptic and lysosome-related genes and proteins targeted by the protective drugs. POPLS-DA is compared to other single- and multi-omics approaches. We found that HSPA5, a member of the heat shock protein 70 family, was one of the most targeted genes by the validated drugs, in particular by AT1-blockers. HSPA5 and AT1-blockers have been previously linked to α-synuclein pathology and Parkinson's disease, showing the relevance of our findings. Our computational workflow identified new directions for therapeutic targets for synucleinopathies. POPLS-DA provided a larger interpretable gene set than other single- and multi-omic approaches. An implementation based on R and markdown is freely available online.


Asunto(s)
Biología Computacional , Sinucleinopatías , Humanos , Biología Computacional/métodos , Multiómica , Evaluación Preclínica de Medicamentos , Proteómica/métodos
3.
J Am Soc Mass Spectrom ; 33(5): 813-822, 2022 May 04.
Artículo en Inglés | MEDLINE | ID: mdl-35385652

RESUMEN

Experimental measurement of time-dependent spontaneous exchange of amide protons with deuterium of the solvent provides information on the structure and dynamical structural variation in proteins. Two experimental techniques are used to probe the exchange: NMR, which relies on different magnetic properties of hydrogen and deuterium, and MS, which exploits the change in mass due to deuteration. NMR provides residue-specific information, that is, the rate of exchange or, analogously, the protection factor (i.e., the unitless ratio between the rate of exchange for a completely unstructured state and the observed rate). MS provides information that is specific to peptides obtained by proteolytic digestion. The spatial resolution of HDX-MS measurements depends on the proteolytic pattern of the protein, the fragmentation method used, and the overlap between peptides. Different computational approaches have been proposed to extract residue-specific information from peptide-level HDX-MS measurements. Here, we demonstrate the advantages of a method recently proposed that exploits self-consistency and classifies the possible sets of protection factors into a finite number of alternative solutions compatible with experimental data. The degeneracy of the solutions can be reduced (or completely removed) by exploiting the additional information encoded in the shape of the isotopic envelopes. We show how sparse and noisy MS data can provide high-resolution protection factors that correlate with NMR measurements probing the same protein under the same conditions.


Asunto(s)
Medición de Intercambio de Deuterio , Hidrógeno , Deuterio/química , Medición de Intercambio de Deuterio/métodos , Hidrógeno/química , Espectroscopía de Resonancia Magnética , Espectrometría de Masas/métodos , Péptidos/química , Proteínas/química
4.
Biochemistry ; 60(39): 2932-2942, 2021 10 05.
Artículo en Inglés | MEDLINE | ID: mdl-34519197

RESUMEN

Cytochrome P450cam (CYP101A1) catalyzes the regio- and stereo-specific 5-exo-hydroxylation of camphor via a multistep catalytic cycle that involves two-electron transfer steps, with an absolute requirement that the second electron be donated by the ferrodoxin, putidaredoxin (Pdx). Whether P450cam, once camphor has bound to the active site and the substrate entry channel has closed, opens up upon Pdx binding, during the second electron transfer step, or it remains closed is still a matter of debate. A potential allosteric site for camphor binding has been identified and postulated to play a role in the binding of Pdx. Here, we have revisited paramagnetic NMR spectroscopy data and determined a heterogeneous ensemble of structures that explains the data, provides a complete representation of the P450cam/Pdx complex in solution, and reconciles alternative hypotheses. The allosteric camphor binding site is always present, and the conformational changes induced by camphor binding to this site facilitates Pdx binding. We also determined that the state to which Pdx binds comprises an ensemble of structures that have features of both the open and closed state. These results demonstrate that there is a finely balanced interaction between allosteric camphor binding and the binding of Pdx at high camphor concentrations.


Asunto(s)
Proteínas Bacterianas/química , Proteínas Bacterianas/metabolismo , Alcanfor 5-Monooxigenasa/química , Alcanfor 5-Monooxigenasa/metabolismo , Alcanfor/química , Ferredoxinas/metabolismo , Pseudomonas putida/enzimología , Regulación Alostérica , Alcanfor/metabolismo , Dominio Catalítico , Cristalografía por Rayos X/métodos , Espectroscopía de Resonancia Magnética/métodos , Modelos Moleculares , Unión Proteica , Conformación Proteica , Pseudomonas putida/química
5.
BMC Bioinformatics ; 22(1): 131, 2021 Mar 18.
Artículo en Inglés | MEDLINE | ID: mdl-33736604

RESUMEN

BACKGROUND: Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace. RESULTS: The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease. CONCLUSIONS: GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.


Asunto(s)
Biología Computacional , Genómica , Estudios de Casos y Controles , Análisis de los Mínimos Cuadrados
6.
Theor Biol Forum ; 114(1-2): 15-28, 2021 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-35502728

RESUMEN

Longitudinal functional data are increasingly common in the health domain. The motivated dataset for this paper comprises H-NMR spectra of kidney transplant patients [8]. Our aim is to cluster patients into different clinical outcome subgoups to reveal the success of the transplantation. The NMR spectra of each patient at each time point are functional data and the data are longitudinally collected at up to nine different time points. Existing methods are available for functional data collected at one time point, but not for longitudinal functional data collected at a grid of time points subject to missingness. We therefore first apply a method to extract the same number of functional feactures for each subject. Next we propose a novel nonparametric clustering method for mulitivariate functional data. We applied our proposed clustering method to the kidney transplant dataset both to a subset of the raw data with only two time points and the extacted functional features. It appeared that the proposed method achieves better clustering performance on the extracted functional features than on the subset of raw data. A data simulation study was performed to further evaluate the method. The design mimiced the kidney transplant dataset but with a larger sample size. Scenarios which have different levels of noise were considered. The simulation study showed the accuarcy of our proposed method.


Asunto(s)
Trasplante de Riñón , Análisis por Conglomerados , Humanos , Imagen por Resonancia Magnética , Espectroscopía de Resonancia Magnética , Proyectos de Investigación
7.
Theor Biol Forum ; 114(1-2): 29-44, 2021 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-35502729

RESUMEN

Down syndrome (DS) is a condition that leads to precocious and accelerated aging in affected subjects. Several alterations in DS cases have been reported at a molecular level, particularly in methylation and glycosylation. Investigating the relation between methylation, glycomics and DS can lead to new insights underlying the atypical aging. We consider a data integration approach, where we investigate how DS affects the parts of glycomics and methylation which are correlated, and which CpG sites and glycans are relevant. Our motivating datasets consist of methylation and glycomics data, measured on 29 DS patients and their unaffected siblings and mothers. The family-based case-control design needs to be taken into account when studying the relationship between methylation, glycomics and DS. We propose a two-stage approach to first integrate methylation and glycomics data, and then link the joint information to Down syndrome. For the data integration step, we consider probabilistic two-way orthogonal partial least squares (PO2PLS). PO2PLS models two omics datasets in terms of low-dimensional joint and omic-specific latent components, and takes into account heterogeneity across the omics data. The relationship between the omics data can be statistically tested. The joint components represent the joint information in methylation and glycomics. In the second stage, we apply a linear mixed model to the relationship between DS and the joint methylation and glycomics components. For the components that are significantly as sociated with DS, we identify the most important CpG sites and glycans. A simulation study is conducted to evaluate the performance of our approach. The results showed that the effects of DS on the omics data can be detected in a large sample size, and the accuracy of the feature selection was high in both small and large sample sizes. Our approach is applied to the DS datasets, a significant effect of DS on the joint components is found. The identified CpG sites and glycans appeared to be related to DS. Our proposed method that jointly analyzes multiple omics data with an outcome variable may provide new insight into the molecular implications of DS at different omics levels.


Asunto(s)
Síndrome de Down , Glicómica , Metilación de ADN , Síndrome de Down/genética , Femenino , Glicómica/métodos , Humanos , Polisacáridos , Procesamiento Proteico-Postraduccional
8.
Theor Biol Forum ; 114(1-2): 45-58, 2021 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-35502730

RESUMEN

In survival analysis, the effect of a covariate on the outcome is reported in a hazard rate. However, hazards rates are hard to interpret. Here we consider differences in survival probabilities instead. Using data on twins is interesting due to the fact that many observed and unobserved factors are controlled or matched. To model the correlation between twins, some authors have proposed survival models with frailties or random effects. However, there is a potential danger of bias in the estimation if the frailty distribution is misspecified. Frailties are often assumed to follow a gamma distribution. To safeguard us from the impact of the misspecification of this distribution, we consider a flexible non-parametric baseline hazard in addition to a parametric one. We will apply this methodology to the TwinsUK cohort to predict the probability of experiencing a fracture in the next five or ten years, given their bone mineral densities (BMD) and their frailty index. The models with parametric and non-parametric baseline hazards yield very close results in estimating survival probabilities and thus a choice of parametric baseline hazard is generally preferred. We find that bone mineral density is a significant predictor in the model whereas frailty index is not. Low BMD leads to a larger probability of fracture; e.g, in 10 years, the probability of fracture is 21% for low BMD group, 16% for medium BMD group and 8% for high BMD group.


Asunto(s)
Fragilidad , Humanos , Funciones de Verosimilitud , Modelos Estadísticos , Modelos de Riesgos Proporcionales , Análisis de Supervivencia
9.
Theor Biol Forum ; 114(1-2): 59-73, 2021 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-35502731

RESUMEN

Multiple technologies which measure the same omics data set but are based on different aspects of the molecules exist. In practice, studies use different technologies and have therefore different biomarkers. An example is the glycan age index, which is constructed by three different ultra-performance liquid chromatography (UPLC) IgG glycans, and is a biomarker for biological age. A second technology is liquid chromatography- mass spectrometry (LCMS). To estimate the effect of a biomarker on an outcome variable, two issues need to be addressed. Firstly, a measurement error is needed to map one technology to the other one using a calibration study. Here, we consider two approaches, namely one based on the chemical properties of the two technologies and one based on the estimation of this relationship using O2PLS. Secondly, the use of an approximation of the biomarker in the main study needs to be taken into account by use of a regression calibration method. The performance of the two approaches is studied via simulations. The methods are used to estimate the relationship between glycan age and menopause. We have data from two cohorts, namely Korcula and Vis. In conclusion, (1) both measurement error models give similar results and suggest that there is an association between the glycan age index and the menopause status, (2) the chemical mapping approach outperforms O2PLS in the low measurement error variance, while on the larger measurement error variance, O2PLS works better, (3) statistical efficiency is lost due to increased noise level by adding irrelevant information.


Asunto(s)
Polisacáridos , Biomarcadores , Calibración , Femenino , Humanos , Espectrometría de Masas/métodos , Análisis de Regresión
10.
Theor Biol Forum ; 114(1-2): 75-88, 2021 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-35502732

RESUMEN

The use of statistical methods to predict outcomes using high dimensional datasets in medicine is becoming increasingly popular for forecasting and monitoring patient health. Our work is motivated by a longitudinal dataset containing 1H NMR spectra of metabolites of 18 patients undergoing a kidney transplant alongside their graft outcomes that fall into one of three categories: acute rejection, delayed graft function and primary function. We proposed a functional partial least squares (FPLS) model that extends existing PLS methods for the analysis of longitudinally measured scalar omics datasets to the case of longitudinally measured functional datasets. We designed an iterative algorithm to link multiple time points, and then applied our proposed method to analyse the data from kidney transplant patients. Finally, we compared the AUC of our method to the AUC of the univariate methods which only use the information of one time-point information. It appeared that our method outperforms the existing methods. A simulation study was performed to mimic the kidney transplant dataset but with a larger sample size and different scenarios performed to evaluate the performance of the new method in larger datasets. We consider scenarios which vary in the difficulty to distinguish the two groups. It appeared that the three time-points model performs better than any of the individual models with average AUCs of 0.909 and 0.811 respectively.


Asunto(s)
Algoritmos , Simulación por Computador , Humanos , Análisis de los Mínimos Cuadrados , Espectroscopía de Protones por Resonancia Magnética , Tamaño de la Muestra
11.
Biom J ; 63(4): 745-760, 2021 04.
Artículo en Inglés | MEDLINE | ID: mdl-33350510

RESUMEN

Advancement of gene expression measurements in longitudinal studies enables the identification of genes associated with disease severity over time. However, problems arise when the technology used to measure gene expression differs between time points. Observed differences between the results obtained at different time points can be caused by technical differences. Modeling the two measurements jointly over time might provide insight into the causes of these different results. Our work is motivated by a study of gene expression data of blood samples from Huntington disease patients, which were obtained using two different sequencing technologies. At time point 1, DeepSAGE technology was used to measure the gene expression, with a subsample also measured using RNA-Seq technology. At time point 2, all samples were measured using RNA-Seq technology. Significant associations between gene expression measured by DeepSAGE and disease severity using data from the first time point could not be replicated by the RNA-Seq data from the second time point. We modeled the relationship between the two sequencing technologies using the data from the overlapping samples. We used linear mixed models with either DeepSAGE or RNA-Seq measurements as the dependent variable and disease severity as the independent variable. In conclusion, (1) for one out of 14 genes, the initial significant result could be replicated with both technologies using data from both time points; (2) statistical efficiency is lost due to disagreement between the two technologies, measurement error when predicting gene expressions, and the need to include additional parameters to account for possible differences.


Asunto(s)
Enfermedad de Huntington , Perfilación de la Expresión Génica , Humanos , Enfermedad de Huntington/genética , Estudios Longitudinales , Tecnología
12.
Mol Omics ; 16(3): 231-242, 2020 06 15.
Artículo en Inglés | MEDLINE | ID: mdl-32211690

RESUMEN

Rapid progress in high-throughput glycomics analysis enables the researchers to conduct large sample studies. Typically, the between-subject differences in total abundance of raw glycomics data are very large, and it is necessary to reduce the differences, making measurements comparable across samples. Essentially there are two ways to approach this issue: row-wise and column-wise normalization. In glycomics, the differences per subject are usually forced to be exactly zero, by scaling each sample having the sum of all glycan intensities equal to 100%. This total area (row-wise) normalization (TA) results in so-called compositional data, rendering many standard multivariate statistical methods inappropriate or inapplicable. Ignoring the compositional nature of the data, moreover, may lead to spurious results. Alternatively, a log-transformation to the raw data can be performed prior to column-wise normalization and implementing standard statistical tools. Until now, there is no clear consensus on the appropriate normalization method applied to glycomics data. Nor is systematic investigation of impact of TA on downstream analysis available to justify the choice of TA. Our motivation lies in efficient variable selection to identify glycan biomarkers with regard to accurate prediction as well as interpretability of the model chosen. Via extensive simulations we investigate how different normalization methods affect the performance of variable selection, and compare their performance. We also address the effect of various types of measurement error in glycans: additive, multiplicative and two-component error. We show that when sample-wise differences are not large row-wise normalization (like TA) can have deleterious effects on variable selection and prediction.


Asunto(s)
Biomarcadores/análisis , Glicómica/métodos , Algoritmos , Calibración , Espectrometría de Masas
13.
Front Genet ; 10: 1028, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31781154

RESUMEN

Background: Soil-transmitted helminths have been shown to have the immune regulatory capacity, which they use to enhance their long term survival within their host. As these parasites reside in the gastrointestinal tract, they might modulate the immune system through altering the gut bacterial composition. Although the relationships between helminth infections or the microbiome with the immune system have been studied separately, their combined interactions are largely unknown. In this study we aim to analyze the relationship between bacterial communities with cytokine response in the presence or absence of helminth infections. Results: For 66 subjects from a randomized placebo-controlled trial, stool and blood samples were available at both baseline and 21 months after starting three-monthly albendazole treatment. The stool samples were used to identify the helminth infection status and fecal microbiota composition, while whole blood samples were cultured to obtain cytokine responses to innate and adaptive stimuli. When subjects were free of helminth infection (helminth-negative), increasing proportions of Bacteroidetes was associated with lower levels of IL-10 response to LPS {estimate [95% confidence interval (CI)] -1.96 (-3.05, -0.87)}. This association was significantly diminished when subjects were helminth-infected (helminth positive) (p-value for the difference between helminth-negative versus helminth-positive was 0.002). Higher diversity was associated with greater IFN-γ responses to PHA in helminth-negative (0.95 (0.15, 1.75); versus helminth-positive [-0.07 (-0.88, 0.73), p-value = 0.056] subjects. Albendazole treatment showed no direct effect in the association between bacterial proportion and cytokine responses, although the Bacteroidetes' effect on IL-10 responses to LPS tended downward in the albendazole-treated group [-1.74 (-4.08, 0.59)] versus placebo [-0.11 (-0.84, 0.62); p-value = 0.193]. Conclusion: We observed differences in the relationship between gut microbiome composition and immune responses, when comparing individuals infected or uninfected with geohelminths. Although these findings are part of a preliminary exploration, the data support the hypothesis that intestinal helminths may modulate immune responses, in unison with the gut microbiota. Trial Registration: ISRCTN, ISRCTN83830814. Registered 27 February 2008 - Retrospectively registered, http://www.isrctn.com/ISRCTN83830814.

14.
Genet Med ; 21(12): 2706-2712, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31204389

RESUMEN

PURPOSE: Biallelic pathogenic variants in the mismatch repair (MMR) genes cause a recessive childhood cancer predisposition syndrome known as constitutional mismatch repair deficiency (CMMRD). Family members with a heterozygous MMR variant have Lynch syndrome. We aimed at estimating cancer risk in these heterozygous carriers as a novel approach to avoid complicated statistical methods to correct for ascertainment bias. METHODS: Cumulative colorectal cancer incidence was estimated in a cohort of PMS2- and MSH6-associated families, ascertained by the CMMRD phenotype of the index, by using mutation probabilities based on kinship coefficients as analytical weights in a proportional hazard regression on the cause-specific hazards. Confidence intervals (CIs) were obtained by bootstrapping at the family level. RESULTS: The estimated cumulative colorectal cancer risk at age 70 years for heterozygous PMS2 variant carriers was 8.7% (95% CI 4.3-12.7%) for both sexes combined, and 9.9% (95% CI 4.9-15.3%) for men and 5.9% (95% CI 1.6-11.1%) for women separately. For heterozygous MSH6 variant carriers these estimates are 11.8% (95% CI 4.5-22.7%) for both sexes combined, 10.0% (95% CI 1.83-24.5%) for men and 11.7% (95% CI 2.10-26.5%) for women. CONCLUSION: Our findings are consistent with previous reports that used more complex statistical methods to correct for ascertainment bias. These results underline the need for MMR gene-specific surveillance protocols for Lynch syndrome.


Asunto(s)
Neoplasias Colorrectales Hereditarias sin Poliposis/complicaciones , Neoplasias Colorrectales/etiología , Medición de Riesgo/métodos , Adulto , Anciano , Estudios de Cohortes , Neoplasias Colorrectales Hereditarias sin Poliposis/genética , Neoplasias Colorrectales Hereditarias sin Poliposis/metabolismo , Reparación de la Incompatibilidad de ADN , Proteínas de Unión al ADN/genética , Proteínas de Unión al ADN/metabolismo , Femenino , Predisposición Genética a la Enfermedad/genética , Mutación de Línea Germinal , Humanos , Incidencia , Masculino , Persona de Mediana Edad , Endonucleasa PMS2 de Reparación del Emparejamiento Incorrecto/genética , Endonucleasa PMS2 de Reparación del Emparejamiento Incorrecto/metabolismo , Mutación , Factores de Riesgo
15.
Biophys J ; 116(7): 1194-1203, 2019 04 02.
Artículo en Inglés | MEDLINE | ID: mdl-30885379

RESUMEN

Hydrogen/deuterium exchange monitored by mass spectrometry is a promising technique for rapidly fingerprinting structural and dynamical properties of proteins. The time-dependent change in the mass of any fragment of the polypeptide chain depends uniquely on the rate of exchange of its amide hydrogens, but determining the latter from the former is generally not possible. Here, we show that, if time-resolved measurements are available for a number of overlapping peptides that cover the whole sequence, rate constants for each amide hydrogen exchange (or equivalently, their protection factors) may be extracted and the uniqueness of the solutions obtained depending on the degree of peptide overlap. However, in most cases, the solution is not unique, and multiple alternatives must be considered. We provide a statistical method that clusters the solutions to further reduce their number. Such analysis always provides meaningful constraints on protection factors and can be used in situations in which obtaining more refined experimental data is impractical. It also provides a systematic way to improve data collection strategies to obtain unambiguous information at single-residue level (e.g., for assessing protein structure predictions at atomistic level).


Asunto(s)
Deuterio/química , Espectrometría de Masas/métodos , Péptidos/química , Amidas/química , Complemento C3/química , Enlace de Hidrógeno , Espectrometría de Masas/normas
16.
Stat Med ; 38(12): 2248-2268, 2019 05 30.
Artículo en Inglés | MEDLINE | ID: mdl-30761571

RESUMEN

Clustered overdispersed multivariate count data are challenging to model due to the presence of correlation within and between samples. Typically, the first source of correlation needs to be addressed but its quantification is of less interest. Here, we focus on the correlation between time points. In addition, the effects of covariates on the multivariate counts distribution need to be assessed. To fulfill these requirements, a regression model based on the Dirichlet-multinomial distribution for association between covariates and the categorical counts is extended by using random effects to deal with the additional clustering. This model is the Dirichlet-multinomial mixed regression model. Alternatively, a negative binomial regression mixed model can be deployed where the corresponding likelihood is conditioned on the total count. It appears that these two approaches are equivalent when the total count is fixed and independent of the random effects. We consider both subject-specific and categorical-specific random effects. However, the latter has a larger computational burden when the number of categories increases. Our work is motivated by microbiome data sets obtained by sequencing of the amplicon of the bacterial 16S rRNA gene. These data have a compositional structure and are typically overdispersed. The microbiome data set is from an epidemiological study carried out in a helminth-endemic area in Indonesia. The conclusions are as follows: time has no statistically significant effect on microbiome composition, the correlation between subjects is statistically significant, and treatment has a significant effect on the microbiome composition only in infected subjects who remained infected.


Asunto(s)
Análisis Multivariante , Análisis de Regresión , Simulación por Computador , Humanos , Microbiota , Modelos Estadísticos
17.
Biom J ; 61(3): 747-768, 2019 05.
Artículo en Inglés | MEDLINE | ID: mdl-30693553

RESUMEN

Marginal tests based on individual SNPs are routinely used in genetic association studies. Studies have shown that haplotype-based methods may provide more power in disease mapping than methods based on single markers when, for example, multiple disease-susceptibility variants occur within the same gene. A limitation of haplotype-based methods is that the number of parameters increases exponentially with the number of SNPs, inducing a commensurate increase in the degrees of freedom and weakening the power to detect associations. To address this limitation, we introduce a hierarchical linkage disequilibrium model for disease mapping, based on a reparametrization of the multinomial haplotype distribution, where every parameter corresponds to the cumulant of each possible subset of a set of loci. This hierarchy present in the parameters enables us to employ flexible testing strategies over a range of parameter sets: from standard single SNP analyses through the full haplotype distribution tests, reducing degrees of freedom and increasing the power to detect associations. We show via extensive simulations that our approach maintains the type I error at nominal level and has increased power under many realistic scenarios, as compared to single SNP and standard haplotype-based studies. To evaluate the performance of our proposed methodology in real data, we analyze genome-wide data from the Wellcome Trust Case-Control Consortium.


Asunto(s)
Biometría/métodos , Haplotipos , Desequilibrio de Ligamiento , Artritis Reumatoide/genética , Sitios Genéticos/genética , Estudio de Asociación del Genoma Completo , Humanos , Cirrosis Hepática Biliar/genética , Polimorfismo de Nucleótido Simple
18.
BMC Bioinformatics ; 19(1): 371, 2018 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-30309317

RESUMEN

BACKGROUND: With the exponential growth in available biomedical data, there is a need for data integration methods that can extract information about relationships between the data sets. However, these data sets might have very different characteristics. For interpretable results, data-specific variation needs to be quantified. For this task, Two-way Orthogonal Partial Least Squares (O2PLS) has been proposed. To facilitate application and development of the methodology, free and open-source software is required. However, this is not the case with O2PLS. RESULTS: We introduce OmicsPLS, an open-source implementation of the O2PLS method in R. It can handle both low- and high-dimensional datasets efficiently. Generic methods for inspecting and visualizing results are implemented. Both a standard and faster alternative cross-validation methods are available to determine the number of components. A simulation study shows good performance of OmicsPLS compared to alternatives, in terms of accuracy and CPU runtime. We demonstrate OmicsPLS by integrating genetic and glycomic data. CONCLUSIONS: We propose the OmicsPLS R package: a free and open-source implementation of O2PLS for statistical data integration. OmicsPLS is available at https://cran.r-project.org/package=OmicsPLS and can be installed in R via install.packages("OmicsPLS").


Asunto(s)
Genómica/métodos , Metabolómica/métodos , Humanos , Análisis de los Mínimos Cuadrados , Programas Informáticos
19.
BMC Proc ; 12(Suppl 9): 33, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30275885

RESUMEN

The main goal of this paper is to estimate the effect of triglyceride levels on methylation of cytosine-phosphate-guanine (CpG) sites in multiple-case families. These families are selected because they have 2 or more cases of metabolic syndrome (primary phenotype). The methylations at the CpG sites are the secondary phenotypes. Ascertainment corrections are needed when there is an association between the primary and secondary phenotype. We will apply the newly developed secondary phenotype analysis for multiple-case family studies to identify CpG sites where methylations are influenced by triglyceride levels. Our second goal is to compare the performance of the naïve approach, which ignores the sampling of the families, SOLAR (Sequential Oligogenic Linkage Analysis Routines), which adjusts for ascertainment via probands, and the secondary phenotype approach. The analysis of possible CpG sites associated with triglyceride levels shows results consistent with the literature when using the secondary phenotype approach. Overall, the secondary phenotype approach performed well, but the comparison of the different approaches does not show significant differences between them. However, for genome-wide applications, we recommend using the secondary phenotype approach when there is an association between primary and secondary phenotypes, and to use the naïve approach otherwise.

20.
Genet Sel Evol ; 50(1): 49, 2018 Oct 10.
Artículo en Inglés | MEDLINE | ID: mdl-30314431

RESUMEN

BACKGROUND: Genomic prediction (GP) accuracy in numerically small breeds is limited by the small size of the reference population. Our objective was to test a multi-breed multiple genomic relationship matrices (GRM) GP model (MBMG) that weighs pre-selected markers separately, uses the remaining markers to explain the remaining genetic variance that can be explained by markers, and weighs information of breeds in the reference population by their genetic correlation with the validation breed. METHODS: Genotype and phenotype data were used on 595 Jersey bulls from New Zealand and 5503 Holstein bulls from the Netherlands, all with deregressed proofs for stature. Different sets of markers were used, containing either pre-selected markers from a meta-genome-wide association analysis on stature, remaining markers or both. We implemented a multi-breed bivariate GREML model in which we fitted either a single multi-breed GRM (MBSG), or two distinct multi-breed GRM (MBMG), one made with pre-selected markers and the other with remaining markers. Accuracies of predicting stature for Jersey individuals using the multi-breed models (Holstein and Jersey combined reference population) was compared to those obtained using either the Jersey (within-breed) or Holstein (across-breed) reference population. All the models were subsequently fitted in the analysis of simulated phenotypes, with a simulated genetic correlation between breeds of 1, 0.5, and 0.25. RESULTS: The MBMG model always gave better prediction accuracies for stature compared to MBSG, within-, and across-breed GP models. For example, with MBSG, accuracies obtained by fitting 48,912 unselected markers (0.43), 357 pre-selected markers (0.38) or a combination of both (0.43), were lower than accuracies obtained by fitting pre-selected and unselected markers in separate GRM in MBMG (0.49). This improvement was further confirmed by results from a simulation study, with MBMG performing on average 23% better than MBSG with all markers fitted. CONCLUSIONS: With the MBMG model, it is possible to use information from numerically large breeds to improve prediction accuracy of numerically small breeds. The superiority of MBMG is mainly due to its ability to use information on pre-selected markers, explain the remaining genetic variance and weigh information from a different breed by the genetic correlation between breeds.


Asunto(s)
Cruzamiento/métodos , Modelos Genéticos , Polimorfismo Genético , Animales , Cruzamiento/normas , Bovinos/genética , Marcadores Genéticos , Tamaño de la Muestra , Selección Genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...