Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
1.
Hum Genet ; 141(9): 1515-1528, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34862561

RESUMO

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.


Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Algoritmos , Humanos , Redes Neurais de Computação
2.
J Am Assoc Lab Anim Sci ; 60(3): 298-305, 2021 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-33653438

RESUMO

Over the past 2 decades, zebrafish, Danio rerio, have become a mainstream laboratory animal model, yet zebrafish husbandry practices remain far from standardized. Feeding protocols play a critical role in the health, wellbeing, and productivity of zebrafish laboratories, yet they vary significantly between facilities. In this study, we compared our current feeding protocol for juvenile zebrafish (30 dpf to 75 dpf), a 3:1mixture of fish flake and freeze-dried krill fed twice per day with live artemia twice per day (FKA), to a diet of Gemma Micro 300 fed once per day with live artemia once per day (GMA). Our results showed that juvenile EK wild-type zebrafish fed GMA were longer and heavier than juveniles fed FKA. As compared with FKA-fed juveniles, fish fed GMA as juveniles showed better reproductive performance as measured by spawning success, fertilization rate, and clutch size. As adults, fish from both feeding protocols were acclimated to our standard adult feeding protocol, and the long-term effects of juvenile diet were assessed. At 2 y of age, the groups showed no difference in mortality or fecundity. Reproductive performance is a crucial aspect of zebrafish research, as much of the research focuses on the developing embryo. Here we show that switching juvenile zebrafish from a mixture of flake and krill to Gemma Micro 300 improves reproductive performance, even with fewer feedings of live artemia, thus simplifying husbandry practices.


Assuntos
Reprodução , Peixe-Zebra , Ração Animal , Animais , Artemia , Dieta/veterinária , Fertilidade
3.
BioData Min ; 9: 14, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27053949

RESUMO

BACKGROUND: Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions. RESULTS: We systematically tested our approach on a simulation study with datasets possessing various genetic constraints including heritability, number of SNPs, sample size, etc. Our methodology showed high success rates for detecting the interaction SNP pair. We also applied our approach to two bladder cancer datasets, which showed consistent results with well-studied methodologies, such as multifactor dimensionality reduction (MDR) and statistical epistasis network (SEN). Furthermore, we built permuted random forest networks (PRFN), in which we used nodes to represent SNPs and edges to indicate interactions. CONCLUSIONS: We successfully developed a scale-invariant methodology to detect pure gene-gene interactions based on permutation strategies and the machine learning method random forest. This methodology showed great potential to be used for detecting gene-gene interactions to study underlying genetic architectures in a scale-free way, which could be benefit to uncover the complex disease mechanisms.

4.
BioData Min ; 9: 7, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26839594

RESUMO

BACKGROUND: Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses. RESULTS: We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS. CONCLUSIONS: The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.

5.
Behav Res Methods ; 47(1): 235-50, 2015 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24706080

RESUMO

The System for Continuous Observation of Rodents in Home-cage Environment (SCORHE) was developed to demonstrate the viability of compact and scalable designs for quantifying activity levels and behavior patterns for mice housed within a commercial ventilated cage rack. The SCORHE in-rack design provides day- and night-time monitoring with the consistency and convenience of the home-cage environment. The dual-video camera custom hardware design makes efficient use of space, does not require home-cage modification, and is animal-facility user-friendly. Given the system's low cost and suitability for use in existing vivariums without modification to the animal husbandry procedures or housing setup, SCORHE opens up the potential for the wider use of automated video monitoring in animal facilities. SCORHE's potential uses include day-to-day health monitoring, as well as advanced behavioral screening and ethology experiments, ranging from the assessment of the short- and long-term effects of experimental cancer treatments to the evaluation of mouse models. When used for phenotyping and animal model studies, SCORHE aims to eliminate the concerns often associated with many mouse-monitoring methods, such as circadian rhythm disruption, acclimation periods, lack of night-time measurements, and short monitoring periods. Custom software integrates two video streams to extract several mouse activity and behavior measures. Studies comparing the activity levels of ABCB5 knockout and HMGN1 overexpresser mice with their respective C57BL parental strains demonstrate SCORHE's efficacy in characterizing the activity profiles for singly- and doubly-housed mice. Another study was conducted to demonstrate the ability of SCORHE to detect a change in activity resulting from administering a sedative.


Assuntos
Comportamento Animal/efeitos dos fármacos , Abrigo para Animais , Hipnóticos e Sedativos/farmacologia , Gravação em Vídeo/métodos , Adaptação Psicológica , Animais , Ritmo Circadiano , Desenho Assistido por Computador , Camundongos , Camundongos Endogâmicos C57BL , Modelos Animais
6.
Phys Lett A ; 378(35): 2611-2613, 2014 Jul 11.
Artigo em Inglês | MEDLINE | ID: mdl-25197159

RESUMO

We show that a reduced form of the structural requirements for deterministic hidden variables used in Bell-Kochen-Specker theorems is already sufficient for the no-go results. Those requirements are captured by the following principle: an observable takes a spectral value x if and only if the spectral projector associated with x takes the value 1. We show that the "only if" part of this condition suffices. The proof identifies an important structural feature behind the no-go results; namely, if at least one projector is assigned the value 1 in any resolution of the identity, then at most one is.

7.
BioData Min ; 7: 13, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25076985
8.
Genome Res ; 24(7): 1209-23, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24985915

RESUMO

Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community.


Assuntos
Biologia Computacional/métodos , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Anotação de Sequência Molecular , Transcriptoma , Animais , Análise por Conglomerados , Drosophila melanogaster/classificação , Evolução Molecular , Éxons , Feminino , Genoma de Inseto , Humanos , Masculino , Motivos de Nucleotídeos , Filogenia , Matrizes de Pontuação de Posição Específica , Regiões Promotoras Genéticas , Edição de RNA , Sítios de Splice de RNA , Splicing de RNA , Reprodutibilidade dos Testes , Sítio de Iniciação de Transcrição
9.
BioData Min ; 7: 12, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25057294
10.
J Child Adolesc Psychopharmacol ; 24(7): 366-73, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25019955

RESUMO

OBJECTIVE: Among children <13 years of age with persistent psychosis and contemporaneous decline in functioning, it is often difficult to determine if the diagnosis of childhood onset schizophrenia (COS) is warranted. Despite decades of experience, we have up to a 44% false positive screening diagnosis rate among patients identified as having probable or possible COS; final diagnoses are made following inpatient hospitalization and medication washout. Because our lengthy medication-free observation is not feasible in clinical practice, we constructed diagnostic classifiers using screening data to assist clinicians practicing in the community or academic centers. METHODS: We used cross-validation, logistic regression, receiver operating characteristic (ROC) analysis, and random forest to determine the best algorithm for classifying COS (n=85) versus histories of psychosis and impaired functioning in children and adolescents who, at screening, were considered likely to have COS, but who did not meet diagnostic criteria for schizophrenia after medication washout and inpatient observation (n=53). We used demographics, clinical history measures, intelligence quotient (IQ) and screening rating scales, and number of typical and atypical antipsychotic medications as our predictors. RESULTS: Logistic regression models using nine, four, and two predictors performed well with positive predictive values>90%, overall accuracy>77%, and areas under the curve (AUCs)>86%. CONCLUSIONS: COS can be distinguished from alternate disorders with psychosis in children and adolescents; greater levels of positive and negative symptoms and lower levels of depression combine to make COS more likely. We include a worksheet so that clinicians in the community and academic centers can predict the probability that a young patient may be schizophrenic, using only two ratings.


Assuntos
Algoritmos , Valor Preditivo dos Testes , Transtornos Psicóticos/diagnóstico , Esquizofrenia Infantil/diagnóstico , Adolescente , Antipsicóticos/uso terapêutico , Criança , Feminino , Humanos , Testes de Inteligência , Modelos Logísticos , Masculino , Escalas de Graduação Psiquiátrica , Transtornos Psicóticos/tratamento farmacológico , Curva ROC , Esquizofrenia Infantil/tratamento farmacológico
11.
BioData Min ; 7(1): 2, 2014 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-24581306

RESUMO

BACKGROUND: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios. RESULTS: We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented. CONCLUSIONS: The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a "risk machine", will share properties from the statistical machine that it is derived from.

12.
Genet Epidemiol ; 38(3): 209-19, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24535726

RESUMO

As the cost of genome-wide genotyping decreases, the number of genome-wide association studies (GWAS) has increased considerably. However, the transition from GWAS findings to the underlying biology of various phenotypes remains challenging. As a result, due to its system-level interpretability, pathway analysis has become a popular tool for gaining insights on the underlying biology from high-throughput genetic association data. In pathway analyses, gene sets representing particular biological processes are tested for significant associations with a given phenotype. Most existing pathway analysis approaches rely on single-marker statistics and assume that pathways are independent of each other. As biological systems are driven by complex biomolecular interactions, embracing the complex relationships between single-nucleotide polymorphisms (SNPs) and pathways needs to be addressed. To incorporate the complexity of gene-gene interactions and pathway-pathway relationships, we propose a system-level pathway analysis approach, synthetic feature random forest (SF-RF), which is designed to detect pathway-phenotype associations without making assumptions about the relationships among SNPs or pathways. In our approach, the genotypes of SNPs in a particular pathway are aggregated into a synthetic feature representing that pathway via Random Forest (RF). Multiple synthetic features are analyzed using RF simultaneously and the significance of a synthetic feature indicates the significance of the corresponding pathway. We further complement SF-RF with pathway-based Statistical Epistasis Network (SEN) analysis that evaluates interactions among pathways. By investigating the pathway SEN, we hope to gain additional insights into the genetic mechanisms contributing to the pathway-phenotype association. We apply SF-RF to a population-based genetic study of bladder cancer and further investigate the mechanisms that help explain the pathway-phenotype associations using SEN. The bladder cancer associated pathways we found are both consistent with existing biological knowledge and reveal novel and plausible hypotheses for future biological validations.


Assuntos
Epistasia Genética/genética , Modelos Genéticos , Fenótipo , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Modelos Logísticos , Polimorfismo de Nucleotídeo Único/genética , Reprodutibilidade dos Testes , Neoplasias da Bexiga Urinária/genética
13.
Biom J ; 56(4): 534-63, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24478134

RESUMO

Probability estimation for binary and multicategory outcome using logistic and multinomial logistic regression has a long-standing tradition in biostatistics. However, biases may occur if the model is misspecified. In contrast, outcome probabilities for individuals can be estimated consistently with machine learning approaches, including k-nearest neighbors (k-NN), bagged nearest neighbors (b-NN), random forests (RF), and support vector machines (SVM). Because machine learning methods are rarely used by applied biostatisticians, the primary goal of this paper is to explain the concept of probability estimation with these methods and to summarize recent theoretical findings. Probability estimation in k-NN, b-NN, and RF can be embedded into the class of nonparametric regression learning machines; therefore, we start with the construction of nonparametric regression estimates and review results on consistency and rates of convergence. In SVMs, outcome probabilities for individuals are estimated consistently by repeatedly solving classification problems. For SVMs we review classification problem and then dichotomous probability estimation. Next we extend the algorithms for estimating probabilities using k-NN, b-NN, and RF to multicategory outcomes and discuss approaches for the multicategory probability estimation problem using SVM. In simulation studies for dichotomous and multicategory dependent variables we demonstrate the general validity of the machine learning methods and compare it with logistic regression. However, each method fails in at least one simulation scenario. We conclude with a discussion of the failures and give recommendations for selecting and tuning the methods. Applications to real data and example code are provided in a companion article (doi:10.1002/bimj.201300077).


Assuntos
Inteligência Artificial , Probabilidade , Modelos Teóricos , Análise de Regressão , Estatísticas não Paramétricas , Máquina de Vetores de Suporte
14.
BioData Min ; 7(1): 28, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25614764

RESUMO

BACKGROUND: Using a collection of different terminal nodesize constructed random forests, each generating a synthetic feature, a synthetic random forest is defined as a kind of hyperforest, calculated using the new input synthetic features, along with the original features. RESULTS: Using a large collection of regression and multiclass datasets we show that synthetic random forests outperforms both conventional random forests and the optimized forest from the regresssion portfolio. CONCLUSIONS: Synthetic forests removes the need for tuning random forests with no additional effort on the part of the researcher. Importantly, the synthetic forest does this with evidently no loss in prediction compared to a well-optimized single random forest.

15.
17.
BioData Min ; 6(1): 10, 2013 May 11.
Artigo em Inglês | MEDLINE | ID: mdl-23663551
18.
Medicine (Baltimore) ; 92(1): 25-41, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23263716

RESUMO

The juvenile idiopathic inflammatory myopathies (JIIM) are systemic autoimmune diseases characterized by skeletal muscle weakness, characteristic rashes, and other systemic features. Although juvenile dermatomyositis (JDM), the most common form of JIIM, has been well studied, the other major clinical subgroups of JIIM, including juvenile polymyositis (JPM) and juvenile myositis overlapping with another autoimmune or connective tissue disease (JCTM), have not been well characterized, and their similarity to the adult clinical subgroups is unknown. We enrolled 436 patients with JIIM, including 354 classified as JDM, 33 as JPM, and 49 as JCTM, in a nationwide registry study. The aim of the study was to compare demographics; clinical features; laboratory measures, including myositis autoantibodies; and outcomes among these clinical subgroups, as well as with published data on adult patients with idiopathic inflammatory myopathies (IIM) enrolled in a separate natural history study. We used random forest classification and logistic regression modeling to compare clinical subgroups, following univariate analysis. JDM was characterized by typical rashes, including Gottron papules, heliotrope rash, malar rash, periungual capillary changes, and other photosensitive and vasculopathic skin rashes. JPM was characterized by more severe weakness, higher creatine kinase levels, falling episodes, and more frequent cardiac disease. JCTM had more frequent interstitial lung disease, Raynaud phenomenon, arthralgia, and malar rash. Differences in autoantibody frequency were also evident, with anti-p155/140, anti-MJ, and anti-Mi-2 seen more frequently in patients with JDM, anti-signal recognition particle and anti-Jo-1 in JPM, and anti-U1-RNP, PM-Scl, and other myositis-associated autoantibodies more commonly present in JCTM. Mortality was highest in patients with JCTM, whereas hospitalizations and wheelchair use were highest in JPM patients. Several demographic and clinical features were shared between juvenile and adult IIM subgroups. However, JDM and JPM patients had a lower frequency of interstitial lung disease, Raynaud phenomenon, "mechanic's hands" and carpal tunnel syndrome, and lower mortality than their adult counterparts. We conclude that juvenile myositis is a heterogeneous group of illnesses with distinct clinical subgroups, defined by varying clinical and demographic characteristics, laboratory features, and outcomes.


Assuntos
Miosite/classificação , Idade de Início , Autoanticorpos/análise , Distribuição de Qui-Quadrado , Criança , Feminino , Humanos , Modelos Logísticos , Masculino , Miosite/enzimologia , Miosite/epidemiologia , Miosite/imunologia , Fenótipo , Prognóstico , Sistema de Registros , Estatísticas não Paramétricas
19.
Front Psychiatry ; 3: 53, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22675310

RESUMO

INTRODUCTION: Multivariate machine learning methods can be used to classify groups of schizophrenia patients and controls using structural magnetic resonance imaging (MRI). However, machine learning methods to date have not been extended beyond classification and contemporaneously applied in a meaningful way to clinical measures. We hypothesized that brain measures would classify groups, and that increased likelihood of being classified as a patient using regional brain measures would be positively related to illness severity, developmental delays, and genetic risk. METHODS: Using 74 anatomic brain MRI sub regions and Random Forest (RF), a machine learning method, we classified 98 childhood onset schizophrenia (COS) patients and 99 age, sex, and ethnicity-matched healthy controls. We also used RF to estimate the probability of being classified as a schizophrenia patient based on MRI measures. We then explored relationships between brain-based probability of illness and symptoms, premorbid development, and presence of copy number variation (CNV) associated with schizophrenia. RESULTS: Brain regions jointly classified COS and control groups with 73.7% accuracy. Greater brain-based probability of illness was associated with worse functioning (p = 0.0004) and fewer developmental delays (p = 0.02). Presence of CNV was associated with lower probability of being classified as schizophrenia (p = 0.001). The regions that were most important in classifying groups included left temporal lobes, bilateral dorsolateral prefrontal regions, and left medial parietal lobes. CONCLUSION: Schizophrenia and control groups can be well classified using RF and anatomic brain measures, and brain-based probability of illness has a positive relationship with illness severity and a negative relationship with developmental delays/problems and CNV-based risk.

20.
Genet Epidemiol ; 35 Suppl 1: S5-11, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-22128059

RESUMO

Genetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression-based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high-dimension, low-sample-size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and, subset selection. Supervised learning methods, which include regression-based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree-based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case-control status or quantitative trait value. We include a discussion of cross-validation for model selection and assessment, and a description of available software resources for these methods.


Assuntos
Epidemiologia Molecular/métodos , Análise de Regressão , Algoritmos , Inteligência Artificial , Análise por Conglomerados , Congressos como Assunto , Árvores de Decisões , Genética , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...