Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 98
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 38(4): 1075-1086, 2022 01 27.
Artículo en Inglés | MEDLINE | ID: mdl-34788368

RESUMEN

MOTIVATION: Accurate disease diagnosis and prognosis based on omics data rely on the effective identification of robust prognostic and diagnostic markers that reflect the states of the biological processes underlying the disease pathogenesis and progression. In this article, we present GCNCC, a Graph Convolutional Network-based approach for Clustering and Classification, that can identify highly effective and robust network-based disease markers. Based on a geometric deep learning framework, GCNCC learns deep network representations by integrating gene expression data with protein interaction data to identify highly reproducible markers with consistently accurate prediction performance across independent datasets possibly from different platforms. GCNCC identifies these markers by clustering the nodes in the protein interaction network based on latent similarity measures learned by the deep architecture of a graph convolutional network, followed by a supervised feature selection procedure that extracts clusters that are highly predictive of the disease state. RESULTS: By benchmarking GCNCC based on independent datasets from different diseases (psychiatric disorder and cancer) and different platforms (microarray and RNA-seq), we show that GCNCC outperforms other state-of-the-art methods in terms of accuracy and reproducibility. AVAILABILITY AND IMPLEMENTATION: https://github.com/omarmaddouri/GCNCC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Redes Neurales de la Computación , Mapas de Interacción de Proteínas , Humanos , Reproducibilidad de los Resultados
2.
Plant Cell ; 32(2): 470-485, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-31852774

RESUMEN

Among many glycoproteins within the plant secretory system, KORRIGAN1 (KOR1), a membrane-anchored endo-ß-1,4-glucanase involved in cellulose biosynthesis, provides a link between N-glycosylation, cell wall biosynthesis, and abiotic stress tolerance. After insertion into the endoplasmic reticulum, KOR1 cycles between the trans-Golgi network (TGN) and the plasma membrane (PM). From the TGN, the protein is targeted to growing cell plates during cell division. These processes are governed by multiple sequence motifs and also host genotypes. Here, we investigated the interaction and hierarchy of known and newly identified sorting signals in KOR1 and how they affect KOR1 transport at various stages in the secretory pathway. Conventional steady-state localization showed that structurally compromised KOR1 variants were directed to tonoplasts. In addition, a tandem fluorescent timer technology allowed for differential visualization of young versus aged KOR1 proteins, enabling the analysis of single-pass transport through the secretory pathway. Observations suggest the presence of multiple checkpoints/branches during KOR1 trafficking, where the destination is determined based on KOR1's sequence motifs and folding status. Moreover, growth analyses of dominant PM-confined KOR1-L48L49→A48A49 variants revealed the importance of active removal of KOR1 from the PM during salt stress, which otherwise interfered with stress acclimation.


Asunto(s)
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Celulasa/metabolismo , Retículo Endoplásmico/metabolismo , Proteínas de la Membrana/metabolismo , Estrés Salino/fisiología , Tolerancia a la Sal/fisiología , Red trans-Golgi/metabolismo , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Membrana Celular/metabolismo , Pared Celular/metabolismo , Celulasa/genética , Celulosa/metabolismo , Regulación de la Expresión Génica de las Plantas , Glicosilación , Aparato de Golgi/metabolismo , Proteínas de la Membrana/genética , Mutación , Raíces de Plantas/crecimiento & desarrollo , Plantas Modificadas Genéticamente , Transporte de Proteínas , Control de Calidad , Estrés Salino/genética , Tolerancia a la Sal/genética , Sales (Química)/metabolismo , Alineación de Secuencia , Transcriptoma
3.
Sci Stud Read ; 27(1): 5-20, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36843656

RESUMEN

Purpose: Researchers have developed a constellation model of decoding-related reading disabilities (RD) to improve the RD risk determination. The model's hallmark is its inclusion of various RD indicators to determine RD risk. Classification methods such as logistic regression (LR) might be one way to determine RD risk within the constellation model framework. However, some issues may arise with applying the logistic regression method (e.g., multicollinearity). Machine learning techniques, such as random forest (RF), might assist in overcoming these limitations. They can better deal with complex data relations than traditional approaches. We examined the prediction performance of RF and compared it against LR to determine RD risk. Method: The sample comprised 12,171 students from Florida whose third-grade RD risk was operationalized using the constellation model with one, two, three, or four RD indicators in first and second grade. Results: Results revealed that LR and RF performed on par in accurately predicting RD risk. Regarding predictor importance, reading fluency was consistently the most critical predictor for RD risk. Conclusion: Findings suggest that RF does not outperform LR in RD prediction accuracy in models with multiple linearly related predictors. Findings also highlight including reading fluency in early identification batteries for later RD determination.

4.
Bioinformatics ; 37(19): 3212-3219, 2021 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-33822889

RESUMEN

MOTIVATION: When learning to subtype complex disease based on next-generation sequencing data, the amount of available data is often limited. Recent works have tried to leverage data from other domains to design better predictors in the target domain of interest with varying degrees of success. But they are either limited to the cases requiring the outcome label correspondence across domains or cannot leverage the label information at all. Moreover, the existing methods cannot usually benefit from other information available a priori such as gene interaction networks. RESULTS: In this article, we develop a generative optimal Bayesian supervised domain adaptation (OBSDA) model that can integrate RNA sequencing (RNA-Seq) data from different domains along with their labels for improving prediction accuracy in the target domain. Our model can be applied in cases where different domains share the same labels or have different ones. OBSDA is based on a hierarchical Bayesian negative binomial model with parameter factorization, for which the optimal predictor can be derived by marginalization of likelihood over the posterior of the parameters. We first provide an efficient Gibbs sampler for parameter inference in OBSDA. Then, we leverage the gene-gene network prior information and construct an informed and flexible variational family to infer the posterior distributions of model parameters. Comprehensive experiments on real-world RNA-Seq data demonstrate the superior performance of OBSDA, in terms of accuracy in identifying cancer subtypes by utilizing data from different domains. Moreover, we show that by taking advantage of the prior network information we can further improve the performance. AVAILABILITY AND IMPLEMENTATION: The source code for implementations of OBSDA and SI-OBSDA are available at the following link. https://github.com/SHBLK/BSDA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
J Biomed Inform ; 117: 103691, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33610882

RESUMEN

Survival data analysis has been leveraged in medical research to study disease morbidity and mortality, and to discover significant bio-markers affecting them. A crucial objective in studying high dimensional medical data is the development of inherently interpretable models that can efficiently capture sparse underlying signals while retaining a high predictive accuracy. Recently developed rule ensemble models have been shown to effectively accomplish this objective; however, they are computationally expensive when applied to survival data and do not account for sparsity in the number of variables included in the generated rules. To address these gaps, we present SURVFIT, a "doubly sparse" rule extraction formulation for survival data. This doubly sparse method can induce sparsity both in the number of rules and in the number of variables involved in the rules. Our method has the computational efficiency needed to realistically solve the problem of rule-extraction from survival data if we consider both rule sparsity and variable sparsity, by adopting a quadratic loss function with an overlapping group regularization. Further, a systematic rule evaluation framework that includes statistical testing, decomposition analysis and sensitivity analysis is provided. We demonstrate the utility of SURVFIT via experiments carried out on a synthetic dataset and a sepsis survival dataset from MIMIC-III.


Asunto(s)
Algoritmos , Aprendizaje
6.
BMC Genomics ; 21(Suppl 10): 615, 2020 Nov 18.
Artículo en Inglés | MEDLINE | ID: mdl-33208103

RESUMEN

BACKGROUND: The current computational methods on identifying conserved protein complexes across multiple Protein-Protein Interaction (PPI) networks suffer from the lack of explicit modeling of the desired topological properties within conserved protein complexes as well as their scalability. RESULTS: To overcome those issues, we propose a scalable algorithm-ClusterM-for identifying conserved protein complexes across multiple PPI networks through the integration of network topology and protein sequence similarity information. ClusterM overcomes the computational barrier that existed in previous methods, where the complexity escalates exponentially when handling an increasing number of PPI networks; and it is able to detect conserved protein complexes with both topological separability and cohesive protein sequence conservation. On two independent compendiums of PPI networks from Saccharomyces cerevisiae (Sce, yeast), Drosophila melanogaster (Dme, fruit fly), Caenorhabditis elegans (Cel, worm), and Homo sapiens (Hsa, human), we demonstrate that ClusterM outperforms other state-of-the-art algorithms by a significant margin and is able to identify de novo conserved protein complexes across four species that are missed by existing algorithms. CONCLUSIONS: ClusterM can better capture the desired topological property of a typical conserved protein complex, which is densely connected within the complex while being well-separated from the rest of the networks. Furthermore, our experiments have shown that ClusterM is highly scalable and efficient when analyzing multiple PPI networks.


Asunto(s)
Biología Computacional , Mapeo de Interacción de Proteínas , Mapas de Interacción de Proteínas , Algoritmos , Animales , Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Humanos , Complejo Represivo Polycomb 1 , Saccharomyces cerevisiae/genética
7.
BMC Genomics ; 21(Suppl 9): 585, 2020 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-32900358

RESUMEN

BACKGROUND: Single-cell RNA sequencing (scRNA-seq) is a powerful profiling technique at the single-cell resolution. Appropriate analysis of scRNA-seq data can characterize molecular heterogeneity and shed light into the underlying cellular process to better understand development and disease mechanisms. The unique analytic challenge is to appropriately model highly over-dispersed scRNA-seq count data with prevalent dropouts (zero counts), making zero-inflated dimensionality reduction techniques popular for scRNA-seq data analyses. Employing zero-inflated distributions, however, may place extra emphasis on zero counts, leading to potential bias when identifying the latent structure of the data. RESULTS: In this paper, we propose a fully generative hierarchical gamma-negative binomial (hGNB) model of scRNA-seq data, obviating the need for explicitly modeling zero inflation. At the same time, hGNB can naturally account for covariate effects at both the gene and cell levels to identify complex latent representations of scRNA-seq data, without the need for commonly adopted pre-processing steps such as normalization. Efficient Bayesian model inference is derived by exploiting conditional conjugacy via novel data augmentation techniques. CONCLUSION: Experimental results on both simulated data and several real-world scRNA-seq datasets suggest that hGNB is a powerful tool for cell cluster discovery as well as cell lineage inference.


Asunto(s)
ARN , Análisis de la Célula Individual , Teorema de Bayes , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN
8.
Bioinformatics ; 35(7): 1133-1141, 2019 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-30169792

RESUMEN

MOTIVATION: Non-coding RNAs (ncRNAs) are known to play crucial roles in various biological processes, and there is a pressing need for accurate computational detection methods that could be used to efficiently scan genomes to detect novel ncRNAs. However, unlike coding genes, ncRNAs often lack distinctive sequence features that could be used for recognizing them. Although many ncRNAs are known to have a well conserved secondary structure, which provides useful cues for computational prediction, it has been also shown that a structure-based approach alone may not be sufficient for detecting ncRNAs in a single sequence. Currently, the most effective ncRNA detection methods combine structure-based techniques with a comparative genome analysis approach to improve the prediction performance. RESULTS: In this paper, we propose RNAdetect, a computational method incorporating novel features for accurate detection of ncRNAs in combination with comparative genome analysis. Given a sequence alignment, RNAdetect can accurately detect the presence of functional ncRNAs by incorporating novel predictive features based on the concept of generalized ensemble defect (GED), which assesses the degree of structure conservation across multiple related sequences and the conformation of the individual folding structures to a common consensus structure. Furthermore, n-gram models (NGMs) are used to extract features that can effectively capture sequence homology to known ncRNA families. Utilization of NGMs can enhance the detection of ncRNAs that have sparse folding structures with many unpaired bases. Extensive performance evaluation based on the Rfam database and bacterial genomes demonstrate that RNAdetect can accurately and reliably detect novel ncRNAs, outperforming the current state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION: The source code for RNAdetect and the benchmark data used in this paper can be downloaded at https://github.com/bjyoontamu/RNAdetect.


Asunto(s)
Genoma Bacteriano , ARN no Traducido/genética , Hibridación Genómica Comparativa , Biología Computacional , Conformación de Ácido Nucleico , Alineación de Secuencia , Programas Informáticos
9.
Bioinformatics ; 35(17): 2941-2948, 2019 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30629122

RESUMEN

MOTIVATION: For many RNA families, the secondary structure is known to be better conserved among the member RNAs compared to the primary sequence. For this reason, it is important to consider the underlying folding structures when aligning RNA sequences, especially for those with relatively low sequence identity. Given a set of RNAs with unknown structures, simultaneous RNA alignment and folding algorithms aim to accurately align the RNAs by jointly predicting their consensus secondary structure and the optimal sequence alignment. Despite the improved accuracy of the resulting alignment, the computational complexity of simultaneous alignment and folding for a pair of RNAs is O(N6), which is too costly to be used for large-scale analysis. RESULTS: In order to address this shortcoming, in this work, we propose a novel network-based scheme for pairwise structural alignment of RNAs. The proposed algorithm, TOPAS, builds on the concept of topological networks that provide structural maps of the RNAs to be aligned. For each RNA sequence, TOPAS first constructs a topological network based on the predicted folding structure, which consists of sequential edges and structural edges weighted by the base-pairing probabilities. The obtained networks can then be efficiently aligned by using probabilistic network alignment techniques, thereby yielding the structural alignment of the RNAs. The computational complexity of our proposed method is significantly lower than that of the Sankoff-style dynamic programming approach, while yielding favorable alignment results. Furthermore, another important advantage of the proposed algorithm is its capability of handling RNAs with pseudoknots while predicting the RNA structural alignment. We demonstrate that TOPAS generally outperforms previous RNA structural alignment methods on RNA benchmarks in terms of both speed and accuracy. AVAILABILITY AND IMPLEMENTATION: Source code of TOPAS and the benchmark data used in this paper are available at https://github.com/bjyoontamu/TOPAS.


Asunto(s)
Algoritmos , ARN , Alineación de Secuencia , Emparejamiento Base , Conformación de Ácido Nucleico , Análisis de Secuencia de ARN
10.
BMC Bioinformatics ; 20(Suppl 12): 321, 2019 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-31216989

RESUMEN

BACKGROUND: Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data. RESULTS: We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missing-value problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missing-value process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and real-world RNA-seq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches. CONCLUSION: Optimal clustering with missing values obviates the need for imputation-based pre-processing of the data, while at the same time possessing smaller clustering errors.


Asunto(s)
Algoritmos , Neoplasias de la Mama/genética , Análisis por Conglomerados , Simulación por Computador , Femenino , Perfilación de la Expresión Génica , Humanos , Modelos Teóricos , Distribución Normal , Probabilidad
11.
BMC Genomics ; 20(Suppl 6): 435, 2019 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-31189480

RESUMEN

BACKGROUND: Single-cell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based on such data still manifests significant uncertainty. RESULTS: The goal of this paper is to develop optimal classification of single-cell trajectories accounting for potential model uncertainty. Partially-observed Boolean dynamical systems (POBDS) are used for modeling gene regulatory networks observed through noisy gene-expression data. We derive the exact optimal Bayesian classifier (OBC) for binary classification of single-cell trajectories. The application of the OBC becomes impractical for large GRNs, due to computational and memory requirements. To address this, we introduce a particle-based single-cell classification method that is highly scalable for large GRNs with much lower complexity than the optimal solution. CONCLUSION: The performance of the proposed particle-based method is demonstrated through numerical experiments using a POBDS model of the well-known T-cell large granular lymphocyte (T-LGL) leukemia network with noisy time-series gene-expression data.


Asunto(s)
Algoritmos , Teorema de Bayes , Biología Computacional/métodos , Redes Reguladoras de Genes , Leucemia Linfocítica Granular Grande/genética , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica , Humanos , Modelos Biológicos , Modelos Genéticos , Incertidumbre
12.
Bioinformatics ; 34(13): i61-i69, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29949981

RESUMEN

Motivation: High-throughput sequencing technologies, in particular RNA sequencing (RNA-seq), have become the basic practice for genomic studies in biomedical research. In addition to studying genes individually, for example, through differential expression analysis, investigating co-ordinated expression variations of genes may help reveal the underlying cellular mechanisms to derive better understanding and more effective prognosis and intervention strategies. Although there exists a variety of co-expression network based methods to analyze microarray data for this purpose, instead of blindly extending these methods for microarray data that may introduce unnecessary bias, it is crucial to develop methods well adapted to RNA-seq data to identify the functional modules of genes with similar expression patterns. Results: We have developed a fully Bayesian covariate-dependent negative binomial factor analysis (dNBFA) method-dNBFA-for RNA-seq count data, to capture coordinated gene expression changes, while considering effects from covariates reflecting different influencing factors. Unlike existing co-expression network based methods, our proposed model does not require multiple ad-hoc choices on data processing, transformation, as well as co-expression measures and can be directly applied to RNA-seq data. Furthermore, being capable of incorporating covariate information, the proposed method can tackle setups with complex confounding factors in different experiment designs. Finally, the natural model parameterization removes the need for a normalization preprocessing step, as commonly adopted to compensate for the effect of sequencing-depth variations. Efficient Bayesian inference of model parameters is derived by exploiting conditional conjugacy via novel data augmentation techniques. Experimental results on several real-world RNA-seq datasets on complex diseases suggest dNBFA as a powerful tool for discovering the gene modules with significant differential expression and meaningful biological insight. Availability and implementation: dNBFA is implemented in R language and is available at https://github.com/siamakz/dNBFA.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Redes Reguladoras de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Trastorno Autístico/genética , Teorema de Bayes , Análisis Factorial , Humanos , Neoplasias/genética
13.
Bioinformatics ; 34(19): 3349-3356, 2018 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-29688254

RESUMEN

Motivation: Rapid adoption of high-throughput sequencing technologies has enabled better understanding of genome-wide molecular profile changes associated with phenotypic differences in biomedical studies. Often, these changes are due to multiple interacting factors. Existing methods are mostly considering differential expression across two conditions studying one main factor without considering other confounding factors. In addition, they are often coupled with essential sophisticated ad-hoc pre-processing steps such as normalization, restricting their adaptability to general experimental setups. Complex multi-factor experimental design to accurately decipher genotype-phenotype relationships signifies the need for developing effective statistical tools for genome-scale sequencing data profiled under multi-factor conditions. Results: We have developed a novel Bayesian negative binomial regression (BNB-R) method for the analysis of RNA sequencing (RNA-seq) count data. In particular, the natural model parameterization removes the needs for the normalization step, while the method is capable of tackling complex experimental design involving multi-variate dependence structures. Efficient Bayesian inference of model parameters is obtained by exploiting conditional conjugacy via novel data augmentation techniques. Comprehensive studies on both synthetic and real-world RNA-seq data demonstrate the superior performance of BNB-R in terms of the areas under both the receiver operating characteristic and precision-recall curves. Availability and implementation: BNB-R is implemented in R language and is available at https://github.com/siamakz/BNBR. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Teorema de Bayes , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Proyectos de Investigación , Análisis de Secuencia de ARN , Programas Informáticos
14.
J Biomed Inform ; 96: 103231, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31202936

RESUMEN

Early detection and risk assessment of complex chronic disease based on longitudinal clinical data is helpful for doctors to make early diagnosis and monitor the disease progression. Disease diagnosis with computer-aided methods has been extensively studied. However, early detection and contemporaneous risk assessment based on partially labeled irregular longitudinal measurements is relatively unexplored. In this paper, we propose a flexible mixed-kernel framework for training a contemporaneous disease risk detector to predict the onset of disease and monitor the disease progression. Moreover, we address the label insufficiency problem by identifying the pattern of disease-induced progression over time with longitudinal data. Our method is based on a Structured Output Support Vector Machine (SOSVM), extended to longitudinal data analysis. Extensive experiments are conducted on several datasets of varying complexity, including the contemporaneous risk assessment with simulated irregular longitudinal data; the identification of the onset of Type 1 Diabetes (T1D) with irregularly sampled longitudinal RNA-Seq gene expression dataset; as well as the monitoring of the drug long-term effects on patients using longitudinal RNA-Seq dataset containing missing time points, demonstrating that our method enhances the accuracy in both early diagnosis and risk estimation with partially labeled irregular longitudinal clinical data.


Asunto(s)
Enfermedad Crónica , Diabetes Mellitus Tipo 1/diagnóstico , Medición de Riesgo/métodos , Algoritmos , Simulación por Computador , Análisis de Datos , Diabetes Mellitus Tipo 1/genética , Diagnóstico por Computador , Progresión de la Enfermedad , Diagnóstico Precoz , Humanos , Interferón beta/uso terapéutico , Estudios Longitudinales , Modelos Estadísticos , RNA-Seq , Máquina de Vectores de Soporte
15.
Proc Natl Acad Sci U S A ; 113(47): 13301-13306, 2016 11 22.
Artículo en Inglés | MEDLINE | ID: mdl-27821777

RESUMEN

An outstanding challenge in the nascent field of materials informatics is to incorporate materials knowledge in a robust Bayesian approach to guide the discovery of new materials. Utilizing inputs from known phase diagrams, features or material descriptors that are known to affect the ferroelectric response, and Landau-Devonshire theory, we demonstrate our approach for BaTiO3-based piezoelectrics with the desired target of a vertical morphotropic phase boundary. We predict, synthesize, and characterize a solid solution, (Ba0.5Ca0.5)TiO3-Ba(Ti0.7Zr0.3)O3, with piezoelectric properties that show better temperature reliability than other BaTiO3-based piezoelectrics in our initial training data.

16.
BMC Genomics ; 19(Suppl 4): 170, 2018 03 21.
Artículo en Inglés | MEDLINE | ID: mdl-29589561

RESUMEN

BACKGROUND: Genotype-phenotype association has been one of the long-standing problems in bioinformatics. Identifying both the marginal and epistatic effects among genetic markers, such as Single Nucleotide Polymorphisms (SNPs), has been extensively integrated in Genome-Wide Association Studies (GWAS) to help derive "causal" genetic risk factors and their interactions, which play critical roles in life and disease systems. Identifying "synergistic" interactions with respect to the outcome of interest can help accurate phenotypic prediction and understand the underlying mechanism of system behavior. Many statistical measures for estimating synergistic interactions have been proposed in the literature for such a purpose. However, except for empirical performance, there is still no theoretical analysis on the power and limitation of these synergistic interaction measures. RESULTS: In this paper, it is shown that the existing information-theoretic multivariate synergy depends on a small subset of the interaction parameters in the model, sometimes on only one interaction parameter. In addition, an adjusted version of multivariate synergy is proposed as a new measure to estimate the interactive effects, with experiments conducted over both simulated data sets and a real-world GWAS data set to show the effectiveness. CONCLUSIONS: We provide rigorous theoretical analysis and empirical evidence on why the information-theoretic multivariate synergy helps with identifying genetic risk factors via synergistic interactions. We further establish the rigorous sample complexity analysis on detecting interactive effects, confirmed by both simulated and real-world data sets.


Asunto(s)
Biología Computacional/métodos , Diabetes Mellitus Tipo 1/genética , Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Algoritmos , Estudios de Casos y Controles , Simulación por Computador , Marcadores Genéticos , Predisposición Genética a la Enfermedad , Humanos , Modelos Logísticos , Modelos Genéticos
17.
BMC Bioinformatics ; 18(Suppl 14): 500, 2017 12 28.
Artículo en Inglés | MEDLINE | ID: mdl-29297279

RESUMEN

BACKGROUND: Functional modules in biological networks consist of numerous biomolecules and their complicated interactions. Recent studies have shown that biomolecules in a functional module tend to have similar interaction patterns and that such modules are often conserved across biological networks of different species. As a result, such conserved functional modules can be identified through comparative analysis of biological networks. RESULTS: In this work, we propose a novel network querying algorithm based on the CUFID (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins) framework combined with an efficient seed-and-extension approach. The proposed algorithm, CUFID-query, can accurately detect conserved functional modules as small subnetworks in the target network that are expected to perform similar functions to the given query functional module. The CUFID framework was recently developed for probabilistic pairwise global comparison of biological networks, and it has been applied to pairwise global network alignment, where the framework was shown to yield accurate network alignment results. In the proposed CUFID-query algorithm, we adopt the CUFID framework and extend it for local network alignment, specifically to solve network querying problems. First, in the seed selection phase, the proposed method utilizes the CUFID framework to compare the query and the target networks and to predict the probabilistic node-to-node correspondence between the networks. Next, the algorithm selects and greedily extends the seed in the target network by iteratively adding nodes that have frequent interactions with other nodes in the seed network, in a way that the conductance of the extended network is maximally reduced. Finally, CUFID-query removes irrelevant nodes from the querying results based on the personalized PageRank vector for the induced network that includes the fully extended network and its neighboring nodes. CONCLUSIONS: Through extensive performance evaluation based on biological networks with known functional modules, we show that CUFID-query outperforms the existing state-of-the-art algorithms in terms of prediction accuracy and biological significance of the predictions.


Asunto(s)
Algoritmos , Mapeo de Interacción de Proteínas/métodos , Motor de Búsqueda , Animales , Drosophila melanogaster/genética , Humanos , Saccharomyces cerevisiae/genética , Factores de Tiempo
18.
BMC Bioinformatics ; 18(Suppl 14): 552, 2017 12 28.
Artículo en Inglés | MEDLINE | ID: mdl-29297278

RESUMEN

BACKGROUND: Phenotypic classification is problematic because small samples are ubiquitous; and, for these, use of prior knowledge is critical. If knowledge concerning the feature-label distribution - for instance, genetic pathways - is available, then it can be used in learning. Optimal Bayesian classification provides optimal classification under model uncertainty. It differs from classical Bayesian methods in which a classification model is assumed and prior distributions are placed on model parameters. With optimal Bayesian classification, uncertainty is treated directly on the feature-label distribution, which assures full utilization of prior knowledge and is guaranteed to outperform classical methods. RESULTS: The salient problem confronting optimal Bayesian classification is prior construction. In this paper, we propose a new prior construction methodology based on a general framework of constraints in the form of conditional probability statements. We call this prior the maximal knowledge-driven information prior (MKDIP). The new constraint framework is more flexible than our previous methods as it naturally handles the potential inconsistency in archived regulatory relationships and conditioning can be augmented by other knowledge, such as population statistics. We also extend the application of prior construction to a multinomial mixture model when labels are unknown, which often occurs in practice. The performance of the proposed methods is examined on two important pathway families, the mammalian cell-cycle and a set of p53-related pathways, and also on a publicly available gene expression dataset of non-small cell lung cancer when combined with the existing prior knowledge on relevant signaling pathways. CONCLUSION: The new proposed general prior construction framework extends the prior construction methodology to a more flexible framework that results in better inference when proper prior knowledge exists. Moreover, the extension of optimal Bayesian classification to multinomial mixtures where data sets are both small and unlabeled, enables superior classifier design using small, unstructured data sets. We have demonstrated the effectiveness of our approach using pathway information and available knowledge of gene regulating functions; however, the underlying theory can be applied to a wide variety of knowledge types, and other applications when there are small samples.


Asunto(s)
Algoritmos , Animales , Teorema de Bayes , Carcinoma de Pulmón de Células no Pequeñas/genética , Ciclo Celular , Entropía , Humanos , Teoría de la Información , Neoplasias Pulmonares/genética , Mamíferos/metabolismo , Probabilidad , Transducción de Señal , Proteína p53 Supresora de Tumor/metabolismo
19.
BMC Bioinformatics ; 18(Suppl 14): 517, 2017 12 28.
Artículo en Inglés | MEDLINE | ID: mdl-29297285

RESUMEN

BACKGROUND: Piwi-interacting RNAs (piRNAs) are a new class of small non-coding RNAs that are known to be associated with RNA silencing. The piRNAs play an important role in protecting the genome from invasive transposons in the germline. Recent studies have shown that piRNAs are linked to the genome stability and a variety of human cancers. Due to their clinical importance, there is a pressing need for effective computational methods that can be used for computational identification of piRNAs. However, piRNAs lack conserved structural motifs and show relatively low sequence similarity across different species, which makes accurate computational prediction of piRNAs challenging. RESULTS: In this paper, we propose a novel method, piRNAdetect, for reliable computational prediction of piRNAs in genome sequences. In the proposed method, we first classify piRNA sequences in the training dataset that share similar sequence motifs and extract effective predictive features through the use of n-gram models (NGMs). The extracted NGM-based features are then used to construct a support vector machine that can be used for accurate prediction of novel piRNAs. CONCLUSIONS: We demonstrate the effectiveness of the proposed piRNAdetect algorithm through extensive performance evaluation based on piRNAs in three different species - H. sapiens, R. norvegicus, and M. musculus - obtained from the piRBase and show that piRNAdetect outperforms the current state-of-the-art methods in terms of efficiency and accuracy.


Asunto(s)
Biología Computacional/métodos , ARN Interferente Pequeño/genética , Máquina de Vectores de Soporte , Animales , Área Bajo la Curva , Bases de Datos Genéticas , Humanos , Curva ROC
20.
BMC Genomics ; 18(Suppl 6): 677, 2017 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-28984191

RESUMEN

BACKGROUND: Flux Balance Analysis (FBA) based mathematical modeling enables in silico prediction of systems behavior for genome-scale metabolic networks. Computational methods have been derived in the FBA framework to solve bi-level optimization for deriving "optimal" mutant microbial strains with targeted biochemical overproduction. The common inherent assumption of these methods is that the surviving mutants will always cooperate with the engineering objective by overproducing the maximum desired biochemicals. However, it has been shown that this optimistic assumption may not be valid in practice. METHODS: We study the validity and robustness of existing bi-level methods for strain optimization under uncertainty and non-cooperative environment. More importantly, we propose new pessimistic optimization formulations: P-ROOM and P-OptKnock, aiming to derive robust mutants with the desired overproduction under two different mutant cell survival models: (1) ROOM assuming mutants have the minimum changes in reaction fluxes from wild-type flux values, and (2) the one considered by OptKnock maximizing the biomass production yield. When optimizing for desired overproduction, our pessimistic formulations derive more robust mutant strains by considering the uncertainty of the cell survival models at the inner level and the cooperation between the outer- and inner-level decision makers. For both P-ROOM and P-OptKnock, by converting multi-level formulations into single-level Mixed Integer Programming (MIP) problems based on the strong duality theorem, we can derive exact optimal solutions that are highly scalable with large networks. RESULTS: Our robust formulations P-ROOM and P-OptKnock are tested with a small E. coli core metabolic network and a large-scale E. coli iAF1260 network. We demonstrate that the original bi-level formulations (ROOM and OptKnock) derive mutants that may not achieve the predicted overproduction under uncertainty and non-cooperative environment. The knockouts obtained by the proposed pessimistic formulations yield higher chemical production rates than those by the optimistic formulations. Moreover, with higher uncertainty levels, both cellular models under pessimistic approaches produce the same mutant strains. CONCLUSIONS: In this paper, we propose a new pessimistic optimization framework for mutant strain design. Our pessimistic strain optimization methods produce more robust solutions regardless of the inner-level mutant survival models, which is desired as the models for cell survival are often approximate to real-world systems. Such robust and reliable knockout strategies obtained by the pessimistic formulations would provide confidence for in-vivo experimental design of microbial mutants of interest.


Asunto(s)
Modelos Biológicos , Mutación , Simulación por Computador , Escherichia coli/genética , Escherichia coli/metabolismo , Análisis de Flujos Metabólicos , Ácido Succínico/metabolismo , Incertidumbre
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA