Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Proc Natl Acad Sci U S A ; 117(22): 12411-12418, 2020 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-32430323

RESUMO

Genetic factors and socioeconomic status (SES) inequalities play a large role in educational attainment, and both have been associated with variations in brain structure and cognition. However, genetics and SES are correlated, and no prior study has assessed their neural associations independently. Here we used a polygenic score for educational attainment (EduYears-PGS), as well as SES, in a longitudinal study of 551 adolescents to tease apart genetic and environmental associations with brain development and cognition. Subjects received a structural MRI scan at ages 14 and 19. At both time points, they performed three working memory (WM) tasks. SES and EduYears-PGS were correlated (r = 0.27) and had both common and independent associations with brain structure and cognition. Specifically, lower SES was related to less total cortical surface area and lower WM. EduYears-PGS was also related to total cortical surface area, but in addition had a regional association with surface area in the right parietal lobe, a region related to nonverbal cognitive functions, including mathematics, spatial cognition, and WM. SES, but not EduYears-PGS, was related to a change in total cortical surface area from age 14 to 19. This study demonstrates a regional association of EduYears-PGS and the independent prediction of SES with cognitive function and brain development. It suggests that the SES inequalities, in particular parental education, are related to global aspects of cortical development, and exert a persistent influence on brain development during adolescence.


Assuntos
Encéfalo/crescimento & desenvolvimento , Cognição , Escolaridade , Sucesso Acadêmico , Adolescente , Adulto , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Feminino , Humanos , Estudos Longitudinais , Imageamento por Ressonância Magnética , Masculino , Memória de Curto Prazo , Herança Multifatorial , Classe Social , Adulto Jovem
2.
BMC Bioinformatics ; 22(1): 487, 2021 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-34627154

RESUMO

BACKGROUND: Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has [Formula: see text] formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. RESULTS: An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl's law of 3 for 4 threads and about 6 for 16 threads, respectively. CONCLUSIONS: Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.


Assuntos
Genoma , Software , Teorema de Bayes , DNA , Cadeias de Markov
3.
Drug Discov Today Technol ; 32-33: 65-72, 2019 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33386096

RESUMO

Application of AI technologies in synthesis prediction has developed very rapidly in recent years. We attempt here to give a comprehensive summary on the latest advancement on retro-synthesis planning, forward synthesis prediction as well as quantum chemistry-based reaction prediction models. Besides an introduction on the AI/ML models for addressing various synthesis related problems, the sources of the reaction datasets used in model building is also covered. In addition to the predictive models, the robotics based high throughput experimentation technology will be another crucial factor for conducting synthesis in an automated fashion. Some state-of-the-art of high throughput experimentation practices carried out in the pharmaceutical industry are highlighted in this chapter to give the reader a sense of how future chemistry will be conducted to make compounds faster and cheaper.


Assuntos
Inteligência Artificial , Desenho Assistido por Computador , Medicamentos Sintéticos/química , Humanos
4.
PLoS Comput Biol ; 12(5): e1004871, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27177143

RESUMO

By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.


Assuntos
Hibridização Genômica Comparativa/estatística & dados numéricos , Variações do Número de Cópias de DNA , Modelos Genéticos , Teorema de Bayes , Neoplasias da Mama/genética , Linhagem Celular , Biologia Computacional , Simulação por Computador , Compressão de Dados , Feminino , Genoma Humano , Humanos , Cadeias de Markov , Software
5.
BMC Bioinformatics ; 17(1): 224, 2016 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-27233515

RESUMO

BACKGROUND: Discovery of microRNAs (miRNAs) relies on predictive models for characteristic features from miRNA precursors (pre-miRNAs). The short length of miRNA genes and the lack of pronounced sequence features complicate this task. To accommodate the peculiarities of plant and animal miRNAs systems, tools for both systems have evolved differently. However, these tools are biased towards the species for which they were primarily developed and, consequently, their predictive performance on data sets from other species of the same kingdom might be lower. While these biases are intrinsic to the species, their characterization can lead to computational approaches capable of diminishing their negative effect on the accuracy of pre-miRNAs predictive models. We investigate in this study how 45 predictive models induced for data sets from 45 species, distributed in eight subphyla/classes, perform when applied to a species different from the species used in its induction. RESULTS: Our computational experiments show that the separability of pre-miRNAs and pseudo pre-miRNAs instances is species-dependent and no feature set performs well for all species, even within the same subphylum/class. Mitigating this species dependency, we show that an ensemble of classifiers reduced the classification errors for all 45 species. As the ensemble members were obtained using meaningful, and yet computationally viable feature sets, the ensembles also have a lower computational cost than individual classifiers that rely on energy stability parameters, which are of prohibitive computational cost in large scale applications. CONCLUSION: In this study, the combination of multiple pre-miRNAs feature sets and multiple learning biases enhanced the predictive accuracy of pre-miRNAs classifiers of 45 species. This is certainly a promising approach to be incorporated in miRNA discovery tools towards more accurate and less species-dependent tools. The material to reproduce the results from this paper can be downloaded from http://dx.doi.org/10.5281/zenodo.49754 .


Assuntos
Algoritmos , Biologia Computacional/métodos , MicroRNAs/genética , Precursores de RNA/genética , Animais , Humanos , MicroRNAs/química , Precursores de RNA/química , Software , Especificidade da Espécie
6.
Bioinformatics ; 30(14): 1950-7, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24618471

RESUMO

MOTIVATION: Counting the frequencies of k-mers in read libraries is often a first step in the analysis of high-throughput sequencing data. Infrequent k-mers are assumed to be a result of sequencing errors. The frequent k-mers constitute a reduced but error-free representation of the experiment, which can inform read error correction or serve as the input to de novo assembly methods. Ideally, the memory requirement for counting should be linear in the number of frequent k-mers and not in the, typically much larger, total number of k-mers in the read library. RESULTS: We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human. Our method is designed to minimize cache misses in a cache-efficient manner by using a pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a hash, for the actual counting. Although this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant of method can resort to a counting Bloom filter for even larger savings in memory at the expense of false-negative rates in addition to the false-positive rates common to all Bloom filter-based approaches. A comparison with the state-of-the-art shows reduced memory requirements and running times. AVAILABILITY AND IMPLEMENTATION: The tools are freely available for download at http://bioinformatics.rutgers.edu/Software/Turtle and http://figshare.com/articles/Turtle/791582.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Genoma Humano , Humanos
7.
BMC Bioinformatics ; 15: 124, 2014 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-24884650

RESUMO

BACKGROUND: Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests. RESULTS: Small or non-significant differences were found among the estimated classification performances of classifiers induced using sets with diversification of features, despite the wide differences in their dimension. Inspired in these results, we obtained a lower-dimensional feature set, which achieved a sensitivity of 90% and a specificity of 95%. These estimates are within 0.1% of the maximal values obtained with any feature set (SELECT, Section "Results and discussion") while it is 34 times faster to compute. Even compared to another feature set (FS2, see Section "Results and discussion"), which is the computationally least expensive feature set of those from the literature which perform within 0.1% of the maximal values, it is 34 times faster to compute. The results obtained by the tools used as references in the experiments carried out showed that five out of these six tools have lower sensitivity or specificity. CONCLUSION: In miRNA discovery the number of putative miRNA loci is in the order of millions. Analysis of putative pre-miRNAs using a computationally expensive feature set would be wasteful or even unfeasible for large genomes. In this work, we propose a relatively inexpensive feature set and explore most of the learning aspects implemented in current ab-initio pre-miRNA prediction tools, which may lead to the development of efficient ab-initio pre-miRNA discovery tools.The material to reproduce the main results from this paper can be downloaded from http://bioinformatics.rutgers.edu/Static/Software/discriminant.tar.gz.


Assuntos
MicroRNAs/química , Precursores de RNA/química , Algoritmos , Inteligência Artificial , Composição de Bases , Biologia Computacional/métodos , Humanos , Software
8.
Bioinformatics ; 28(18): i325-i332, 2012 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-22962448

RESUMO

MOTIVATION: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics. RESULTS: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L(1) distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L(1) distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants. AVAILABILITY AND IMPLEMENTATION: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net. CONTACT: pavelm@cs.rutgers.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação INDEL , Análise de Sequência de DNA/métodos , Mapeamento Cromossômico , Variação Genética , Genoma Humano , Genômica/métodos , Humanos , Nucleotídeos/química
9.
Bioinformatics ; 28(22): 2875-82, 2012 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-23060616

RESUMO

MOTIVATION: Next-generation sequencing techniques have facilitated a large-scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals. RESULTS: Here, we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph, where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance, in particular, for deletions or insertions (indels) of length 20-100 nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular, for insert size based approaches. In this size range, we even outperform split-read aligners. We achieve competitive results also on biological data, where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners. AVAILABILITY: CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com. CONTACT: as@cwi.nl or tm@cwi.nl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Variação Genética , Genoma Humano , Simulação por Computador , Humanos , Mutação INDEL
10.
PLoS One ; 18(6): e0286074, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37279196

RESUMO

Compression as an accelerant of computation is increasingly recognized as an important component in engineering fast real-world machine learning methods for big data; c.f., its impact on genome-scale approximate string matching. Previous work showed that compression can accelerate algorithms for Hidden Markov Models (HMM) with discrete observations, both for the classical frequentist HMM algorithms-Forward Filtering, Backward Smoothing and Viterbi-and Gibbs sampling for Bayesian HMM. For Bayesian HMM with continuous-valued observations, compression was shown to greatly accelerate computations for specific types of data. For instance, data from large-scale experiments interrogating structural genetic variation can be assumed to be piece-wise constant with noise, or, equivalently, data generated by HMM with dominant self-transition probabilities. Here we extend the compressive computation approach to the classical frequentist HMM algorithms on continuous-valued observations, providing the first compressive approach for this problem. In a large-scale simulation study, we demonstrate empirically that in many settings compressed HMM algorithms very clearly outperform the classical algorithms with no, or only an insignificant effect, on the computed probabilities and infered state paths of maximal likelihood. This provides an efficient approach to big data computations with HMM. An open-source implementation of the method is available from https://github.com/lucabello/wavelet-hmms.


Assuntos
Algoritmos , Cadeias de Markov , Teorema de Bayes , Probabilidade , Simulação por Computador
11.
Bioinformatics ; 27(7): 946-52, 2011 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-21266444

RESUMO

MOTIVATION: Analyzing short time-courses is a frequent and relevant problem in molecular biology, as, for example, 90% of gene expression time-course experiments span at most nine time-points. The biological or clinical questions addressed are elucidating gene regulation by identification of co-expressed genes, predicting response to treatment in clinical, trial-like settings or classifying novel toxic compounds based on similarity of gene expression time-courses to those of known toxic compounds. The latter problem is characterized by irregular and infrequent sample times and a total lack of prior assumptions about the incoming query, which comes in stark contrast to clinical settings and requires to implicitly perform a local, gapped alignment of time series. The current state-of-the-art method (SCOW) uses a variant of dynamic time warping and models time series as higher order polynomials (splines). RESULTS: We suggest to model time-courses monitoring response to toxins by piecewise constant functions, which are modeled as left-right Hidden Markov Models. A Bayesian approach to parameter estimation and inference helps to cope with the short, but highly multivariate time-courses. We improve prediction accuracy by 7% and 4%, respectively, when classifying toxicology and stress response data. We also reduce running times by at least a factor of 140; note that reasonable running times are crucial when classifying response to toxins. In conclusion, we have demonstrated that appropriate reduction of model complexity can result in substantial improvements both in classification performance and running time. AVAILABILITY: A Python package implementing the methods described is freely available under the GPL from http://bioinformatics.rutgers.edu/Software/MVQueries/.


Assuntos
Perfilação da Expressão Gênica/métodos , Animais , Teorema de Bayes , Classificação , Expressão Gênica/efeitos dos fármacos , Cinética , Camundongos , Toxinas Biológicas/farmacologia
12.
Bioinformatics ; 27(12): 1645-52, 2011 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-21511716

RESUMO

MOTIVATION: Changes in gene expression levels play a central role in tumors. Additional information about the distribution of gene expression levels and distances between adjacent genes on chromosomes should be integrated into the analysis of tumor expression profiles. RESULTS: We use a Hidden Markov Model with distance-scaled transition matrices (DSHMM) to incorporate chromosomal distances of adjacent genes on chromosomes into the identification of differentially expressed genes in breast cancer. We train the DSHMM by integrating prior knowledge about potential distributions of expression levels of differentially expressed and unchanged genes in tumor. We find that especially the combination of these data and to a lesser extent the modeling of distances between adjacent genes contribute to a substantial improvement of the identification of differentially expressed genes in comparison to other existing methods. This performance benefit is also supported by the identification of genes well known to be associated with breast cancer. That suggests applications of DSHMMs for screening of other tumor expression profiles. AVAILABILITY: The DSHMM is available as part of the open-source Java library Jstacs (www.jstacs.de/index.php/DSHMM).


Assuntos
Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Cadeias de Markov , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Mapeamento Cromossômico , Feminino , Expressão Gênica , Genes Neoplásicos , Humanos , Modelos Genéticos
13.
Mol Inform ; 41(12): e2200043, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-35732584

RESUMO

Computer aided synthesis planning, suggesting synthetic routes for molecules of interest, is a rapidly growing field. The machine learning methods used are often dependent on access to large datasets for training, but finite experimental budgets limit how much data can be obtained from experiments. This suggests the use of schemes for data collection such as active learning, which identifies the data points of highest impact for model accuracy, and which has been used in recent studies with success. However, little has been done to explore the robustness of the methods predicting reaction yield when used together with active learning to reduce the amount of experimental data needed for training. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined AUROC faster than random sampling on both datasets. Analysis of feature importance of the trained machine learning models suggests active learning had a larger influence on the model accuracy when only a few features were important for the model prediction.


Assuntos
Aprendizado de Máquina
14.
BMC Bioinformatics ; 12: 428, 2011 Nov 02.
Artigo em Inglês | MEDLINE | ID: mdl-22047014

RESUMO

BACKGROUND: Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from high-density arrays and next generation sequencing amplify these problems. RESULTS: We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by kd-trees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling. CONCLUSIONS: We test our approximate sampling method on simulated and biological ArrayCGH datasets and high-density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results with the state-of-the art Bayesian approaches. AVAILABILITY: An implementation of our method will be made available as part of the open source GHMM library from http://ghmm.org.


Assuntos
Variações do Número de Cópias de DNA , Modelos Genéticos , Algoritmos , Sequência de Bases , Teorema de Bayes , Hibridização Genômica Comparativa , Humanos , Linfoma de Célula do Manto/genética , Cadeias de Markov , Método de Monte Carlo , Probabilidade
15.
PeerJ Comput Sci ; 7: e397, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33817043

RESUMO

The Alternating Direction Method of Multipliers (ADMM) is a popular and promising distributed framework for solving large-scale machine learning problems. We consider decentralized consensus-based ADMM in which nodes may only communicate with one-hop neighbors. This may cause slow convergence. We investigate the impact of network topology on the performance of an ADMM-based learning of Support Vector Machine using expander, and mean-degree graphs, and additionally some of the common modern network topologies. In particular, we investigate to which degree the expansion property of the network influences the convergence in terms of iterations, training and communication time. We furthermore suggest which topology is preferable. Additionally, we provide an implementation that makes these theoretical advances easily available. The results show that the performance of decentralized ADMM-based learning of SVMs in terms of convergence is improved using graphs with large spectral gaps, higher and homogeneous degrees.

16.
Alzheimers Res Ther ; 13(1): 151, 2021 09 06.
Artigo em Inglês | MEDLINE | ID: mdl-34488882

RESUMO

BACKGROUND: In Alzheimer's disease, amyloid- ß (A ß) peptides aggregate in the lowering CSF amyloid levels - a key pathological hallmark of the disease. However, lowered CSF amyloid levels may also be present in cognitively unimpaired elderly individuals. Therefore, it is of great value to explain the variance in disease progression among patients with A ß pathology. METHODS: A cohort of n=2293 participants, of whom n=749 were A ß positive, was selected from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database to study heterogeneity in disease progression for individuals with A ß pathology. The analysis used baseline clinical variables including demographics, genetic markers, and neuropsychological data to predict how the cognitive ability and AD diagnosis of subjects progressed using statistical models and machine learning. Due to the relatively low prevalence of A ß pathology, models fit only to A ß-positive subjects were compared to models fit to an extended cohort including subjects without established A ß pathology, adjusting for covariate differences between the cohorts. RESULTS: A ß pathology status was determined based on the A ß42/A ß40 ratio. The best predictive model of change in cognitive test scores for A ß-positive subjects at the 2-year follow-up achieved an R2 score of 0.388 while the best model predicting adverse changes in diagnosis achieved a weighted F1 score of 0.791. A ß-positive subjects declined faster on average than those without A ß pathology, but the specific level of CSF A ß was not predictive of progression rate. When predicting cognitive score change 4 years after baseline, the best model achieved an R2 score of 0.325 and it was found that fitting models to the extended cohort improved performance. Moreover, using all clinical variables outperformed the best model based only on a suite of cognitive test scores which achieved an R2 score of 0.228. CONCLUSION: Our analysis shows that CSF levels of A ß are not strong predictors of the rate of cognitive decline in A ß-positive subjects when adjusting for other variables. Baseline assessments of cognitive function accounts for the majority of variance explained in the prediction of 2-year decline but is insufficient for achieving optimal results in longer-term predictions. Predicting changes both in cognitive test scores and in diagnosis provides multiple perspectives of the progression of potential AD subjects.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Idoso , Doença de Alzheimer/complicações , Peptídeos beta-Amiloides , Biomarcadores , Cognição , Disfunção Cognitiva/diagnóstico , Progressão da Doença , Humanos , Testes Neuropsicológicos , Proteínas tau
17.
BMC Bioinformatics ; 11: 9, 2010 Jan 06.
Artigo em Inglês | MEDLINE | ID: mdl-20053276

RESUMO

BACKGROUND: Cluster analysis is an important technique for the exploratory analysis of biological data. Such data is often high-dimensional, inherently noisy and contains outliers. This makes clustering challenging. Mixtures are versatile and powerful statistical models which perform robustly for clustering in the presence of noise and have been successfully applied in a wide range of applications. RESULTS: PyMix - the Python mixture package implements algorithms and data structures for clustering with basic and advanced mixture models. The advanced models include context-specific independence mixtures, mixtures of dependence trees and semi-supervised learning. PyMix is licenced under the GNU General Public licence (GPL). PyMix has been successfully used for the analysis of biological sequence, complex disease and gene expression data. CONCLUSIONS: PyMix is a useful tool for cluster analysis of biological data. Due to the general nature of the framework, PyMix can be applied to a wide range of applications and data sets.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Software , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Reconhecimento Automatizado de Padrão , Análise de Sequência de DNA
18.
Bioinformatics ; 25(12): i6-14, 2009 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-19478017

RESUMO

MOTIVATION: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-beta (IFNbeta) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNbeta treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects. RESULTS: We propose constrained estimation of mixtures of hidden Markov models as a methodology to classify patient response to IFNbeta treatment. The advantages of our approach are that it takes the temporal nature of the data into account and its robustness with respect to noise, missing data and mislabeled samples. Moreover, mixture estimation enables to explore the presence of response sub-groups of patients on the transcriptional level. We clearly outperformed all prior approaches in terms of prediction accuracy, raising it, for the first time, >90%. Additionally, we were able to identify potentially mislabeled samples and to sub-divide the good responders into two sub-groups that exhibited different transcriptional response programs. This is supported by recent findings on MS pathology and therefore may raise interesting clinical follow-up questions. AVAILABILITY: The method is implemented in the GQL framework and is available at http://www.ghmm.org/gql. Datasets are available at http://www.cin.ufpe.br/ approximately igcf/MSConst. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Classificação/métodos , Humanos , Interferon beta/química , Interferon beta/farmacologia , Cadeias de Markov , Esclerose Múltipla/genética , Esclerose Múltipla/metabolismo
19.
J Healthc Inform Res ; 4(1): 1-18, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35415439

RESUMO

Many factors affect blood glucose levels in type 1 diabetics, several of which vary largely both in magnitude and delay of the effect. Modern rapid-acting insulins generally have a peak time after 60-90 min, while carbohydrate intake can affect blood glucose levels more rapidly for high glycemic index foods, or slower for other carbohydrate sources. It is important to have good estimates of the development of glucose levels in the near future both for diabetic patients managing their insulin distribution manually, as well as for closed-loop systems making decisions about the distribution. Modern continuous glucose monitoring systems provide excellent sources of data to train machine learning models to predict future glucose levels. In this paper, we present an approach for predicting blood glucose levels for diabetics up to 1 h into the future. The approach is based on recurrent neural networks trained in an end-to-end fashion, requiring nothing but the glucose level history for the patient. Our approach obtains results that are comparable to the state of the art on the Ohio T1DM dataset for blood glucose level prediction. In addition to predicting the future glucose value, our model provides an estimate of its certainty, helping users to interpret the predicted levels. This is realized by training the recurrent neural network to parameterize a univariate Gaussian distribution over the output. The approach needs no feature engineering or data preprocessing and is computationally inexpensive. We evaluate our method using the standard root-mean-squared error (RMSE) metric, along with a blood glucose-specific metric called the surveillance error grid (SEG). We further study the properties of the distribution that is learned by the model, using experiments that determine the nature of the certainty estimate that the model is able to capture.

20.
PeerJ ; 8: e8225, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32025365

RESUMO

Natural history museums are unique spaces for interdisciplinary research and educational innovation. Through extensive exhibits and public programming and by hosting rich communities of amateurs, students, and researchers at all stages of their careers, they can provide a place-based window to focus on integration of science and discovery, as well as a locus for community engagement. At the same time, like a synthesis radio telescope, when joined together through emerging digital resources, the global community of museums (the 'Global Museum') is more than the sum of its parts, allowing insights and answers to diverse biological, environmental, and societal questions at the global scale, across eons of time, and spanning vast diversity across the Tree of Life. We argue that, whereas natural history collections and museums began with a focus on describing the diversity and peculiarities of species on Earth, they are now increasingly leveraged in new ways that significantly expand their impact and relevance. These new directions include the possibility to ask new, often interdisciplinary questions in basic and applied science, such as in biomimetic design, and by contributing to solutions to climate change, global health and food security challenges. As institutions, they have long been incubators for cutting-edge research in biology while simultaneously providing core infrastructure for research on present and future societal needs. Here we explore how the intersection between pressing issues in environmental and human health and rapid technological innovation have reinforced the relevance of museum collections. We do this by providing examples as food for thought for both the broader academic community and museum scientists on the evolving role of museums. We also identify challenges to the realization of the full potential of natural history collections and the Global Museum to science and society and discuss the critical need to grow these collections. We then focus on mapping and modelling of museum data (including place-based approaches and discovery), and explore the main projects, platforms and databases enabling this growth. Finally, we aim to improve relevant protocols for the long-term storage of specimens and tissues, ensuring proper connection with tomorrow's technologies and hence further increasing the relevance of natural history museums.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA