Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 23(1): 502, 2022 Nov 23.
Artigo em Inglês | MEDLINE | ID: mdl-36424541

RESUMO

As genotype databases increase in size, so too do the number of detectable segments of identity by descent (IBD): segments of the genome where two individuals share an identical copy of one of their two parental haplotypes, due to shared ancestry. We show that given a large enough genotype database, these segments of IBD collectively overlap entire chromosomes, including instances of IBD that span multiple chromosomes, and can be used to accurately separate the alleles inherited from each parent across the entire genome. The resulting phase is not an improvement over state-of-the-art local phasing methods, but provides accurate long-range phasing that indicates which of two haplotypes in different regions of the genome, including different chromosomes, was inherited from the same parent. We are able to separate the DNA inherited from each parent completely, across the entire genome, with 98% median accuracy in a test set of 30,000 individuals. We estimate the IBD data requirements for accurate genome-wide phasing, and we propose a method for estimating confidence in the resulting phase. We show that our methods do not require the genotypes of close family, and that they are robust to genotype errors and missing data. In fact, our method can impute missing data accurately and correct genotype errors.


Assuntos
Genótipo , Humanos , Haplótipos , Alelos , Bases de Dados Factuais
2.
BMC Bioinformatics ; 22(1): 459, 2021 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-34563119

RESUMO

BACKGROUND: We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual's ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. RESULTS: The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and reused to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations using 10 CPU). We test ARCHes on public data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP) as well as simulated examples of known admixture. CONCLUSIONS: Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at finer population scales regardless of the amount of population admixture.


Assuntos
Genética Populacional , Genoma Humano , Haplótipos , Humanos , Polimorfismo de Nucleotídeo Único
3.
PLoS Comput Biol ; 10(5): e1003578, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24874013

RESUMO

Identifying molecular connections between developmental processes and disease can lead to new hypotheses about health risks at all stages of life. Here we introduce a new approach to identifying significant connections between gene sets and disease genes, and apply it to several gene sets related to human development. To overcome the limits of incomplete and imperfect information linking genes to disease, we pool genes within disease subtrees in the MeSH taxonomy, and we demonstrate that such pooling improves the power and accuracy of our approach. Significance is assessed through permutation. We created a web-based visualization tool to facilitate multi-scale exploration of this large collection of significant connections (http://gda.cs.tufts.edu/development). High-level analysis of the results reveals expected connections between tissue-specific developmental processes and diseases linked to those tissues, and widespread connections to developmental disorders and cancers. Yet interesting new hypotheses may be derived from examining the unexpected connections. We highlight and discuss the implications of three such connections, linking dementia with bone development, polycystic ovary syndrome with cardiovascular development, and retinopathy of prematurity with lung development. Our results provide additional evidence that TGFB lays a key role in the early pathogenesis of polycystic ovary syndrome. Our evidence also suggests that the VEGF pathway and downstream NFKB signaling may explain the complex relationship between bronchopulmonary dysplasia and retinopathy of prematurity, and may form a bridge between two currently-competing hypotheses about the molecular origins of bronchopulmonary dysplasia. Further data exploration and similar queries about other gene sets may generate a variety of new information about the molecular relationships between additional diseases.


Assuntos
Mapeamento Cromossômico/métodos , Regulação da Expressão Gênica no Desenvolvimento/genética , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Proteoma/genética , Animais , Simulação por Computador , Marcadores Genéticos/genética , Humanos
4.
BMC Bioinformatics ; 12 Suppl 8: S3, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151929

RESUMO

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. RESULTS: A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.


Assuntos
Algoritmos , Mineração de Dados , Proteínas/metabolismo , Animais , Bases de Dados de Proteínas , Humanos , Publicações Periódicas como Assunto , PubMed
5.
Nucleic Acids Res ; 37(Database issue): D274-8, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19022853

RESUMO

The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter classification (TC) system. It is a curated repository for factual information compiled largely from published references. It uses a functional/phylogenetic system of classification, and currently encompasses about 5000 representative transporters and putative transporters in more than 500 families. We here describe novel software designed to support and extend the usefulness of TCDB. Our recent efforts render it more user friendly, incorporate machine learning to input novel data in a semiautomatic fashion, and allow analyses that are more accurate and less time consuming. The availability of these tools has resulted in recognition of distant phylogenetic relationships and tremendous expansion of the information available to TCDB users.


Assuntos
Bases de Dados de Proteínas , Proteínas de Membrana Transportadoras/classificação , Inteligência Artificial , Proteínas de Membrana Transportadoras/química , Proteínas de Membrana Transportadoras/genética , Filogenia , Homologia de Sequência de Aminoácidos
6.
G3 (Bethesda) ; 9(9): 2863-2878, 2019 09 04.
Artigo em Inglês | MEDLINE | ID: mdl-31484785

RESUMO

We present a massive investigation into the genetic basis of human lifespan. Beginning with a genome-wide association (GWA) study using a de-identified snapshot of the unique AncestryDNA database - more than 300,000 genotyped individuals linked to pedigrees of over 400,000,000 people - we mapped six genome-wide significant loci associated with parental lifespan. We compared these results to a GWA analysis of the traditional lifespan proxy trait, age, and found only one locus, APOE, to be associated with both age and lifespan. By combining the AncestryDNA results with those of an independent UK Biobank dataset, we conducted a meta-analysis of more than 650,000 individuals and identified fifteen parental lifespan-associated loci. Beyond just those significant loci, our genome-wide set of polymorphisms accounts for up to 8% of the variance in human lifespan; this value represents a large fraction of the heritability estimated from phenotypic correlations between relatives.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Longevidade/genética , Idoso , Idoso de 80 Anos ou mais , Apolipoproteínas E/genética , Proteínas de Transporte/genética , Bases de Dados Genéticas , Feminino , Humanos , Masculino , Proteínas Nucleares/genética , Linhagem , Polimorfismo de Nucleotídeo Único , Estudos Prospectivos , Proteínas Proto-Oncogênicas/genética
7.
Bioinformatics ; 23(2): e156-62, 2007 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-17237085

RESUMO

MOTIVATION: The process of transcription is controlled by systems of factors which bind in specific arrangements, called cis-regulatory modules (CRMs), in promoter regions. We present a discriminative learning algorithm which simultaneously learns the DNA binding site motifs as well as the logical structure and spatial aspects of CRMs. RESULTS: Our results on yeast datasets show better predictive accuracy than a current state-of-the-art approach on the same datasets. Our results on yeast, fly and human datasets show that the inclusion of logical and spatial aspects improves the predictive accuracy of our learned models. AVAILABILITY: Source code is available at http://www.cs.wisc.edu/~noto/crm


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Genoma Fúngico/genética , Modelos Genéticos , Elementos Reguladores de Transcrição/genética , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Inteligência Artificial , Sequência de Bases , Simulação por Computador , Modelos Estatísticos , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão/métodos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Alinhamento de Sequência/métodos
8.
Genetics ; 210(3): 1109-1124, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30401766

RESUMO

Human life span is a phenotype that integrates many aspects of health and environment into a single ultimate quantity: the elapsed time between birth and death. Though it is widely believed that long life runs in families for genetic reasons, estimates of life span "heritability" are consistently low (∼15-30%). Here, we used pedigree data from Ancestry public trees, including hundreds of millions of historical persons, to estimate the heritability of human longevity. Although "nominal heritability" estimates based on correlations among genetic relatives agreed with prior literature, the majority of that correlation was also captured by correlations among nongenetic (in-law) relatives, suggestive of highly assortative mating around life span-influencing factors (genetic and/or environmental). We used structural equation modeling to account for assortative mating, and concluded that the true heritability of human longevity for birth cohorts across the 1800s and early 1900s was well below 10%, and that it has been generally overestimated due to the effect of assortative mating.


Assuntos
Longevidade/genética , Reprodução , Feminino , Humanos , Masculino , Modelos Genéticos , Linhagem
9.
Nat Commun ; 8: 14238, 2017 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-28169989

RESUMO

Despite strides in characterizing human history from genetic polymorphism data, progress in identifying genetic signatures of recent demography has been limited. Here we identify very recent fine-scale population structure in North America from a network of over 500 million genetic (identity-by-descent, IBD) connections among 770,000 genotyped individuals of US origin. We detect densely connected clusters within the network and annotate these clusters using a database of over 20 million genealogical records. Recent population patterns captured by IBD clustering include immigrants such as Scandinavians and French Canadians; groups with continental admixture such as Puerto Ricans; settlers such as the Amish and Appalachians who experienced geographic or cultural isolation; and broad historical trends, including reduced north-south gene flow. Our results yield a detailed historical portrait of North America after European settlement and support substantial genetic heterogeneity in the United States beyond that uncovered by previous studies.


Assuntos
Demografia/estatística & dados numéricos , Genética Populacional/métodos , Dinâmica Populacional/tendências , População/genética , Análise por Conglomerados , Demografia/métodos , Emigrantes e Imigrantes , Fluxo Gênico/genética , Técnicas de Genotipagem , Haplótipos/genética , Humanos , Polimorfismo de Nucleotídeo Único , Dinâmica Populacional/estatística & dados numéricos , Análise de Sequência de DNA , Estados Unidos/etnologia
10.
BMC Bioinformatics ; 7: 528, 2006 Dec 05.
Artigo em Inglês | MEDLINE | ID: mdl-17147812

RESUMO

BACKGROUND: The process of transcription is controlled by systems of transcription factors, which bind to specific patterns of binding sites in the transcriptional control regions of genes, called cis-regulatory modules (CRMs). We present an expressive and easily comprehensible CRM representation which is capable of capturing several aspects of a CRM's structure and distinguishing between DNA sequences which do or do not contain it. We also present a learning algorithm tailored for this domain, and a novel method to avoid overfitting by controlling the expressivity of the model. RESULTS: We are able to find statistically significant CRMs more often then a current state-of-the-art approach on the same data sets. We also show experimentally that each aspect of our expressive CRM model space makes a positive contribution to the learned models on yeast and fly data. CONCLUSION: Structural aspects are an important part of CRMs, both in terms of interpreting them biologically and learning them accurately. Source code for our algorithm is available at: http://www.cs.wisc.edu/~noto/crm.


Assuntos
Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos , Elementos Reguladores de Transcrição/genética , Sequências Reguladoras de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Transcrição Gênica/genética , Algoritmos , Sequência de Bases , Dados de Sequência Molecular
11.
J Comput Biol ; 22(5): 402-13, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25651392

RESUMO

Methods for translating gene expression signatures into clinically relevant information have typically relied upon having many samples from patients with similar molecular phenotypes. Here, we address the question of what can be done when it is relatively easy to obtain healthy patient samples, but when abnormalities corresponding to disease states may be rare and one-of-a-kind. The associated computational challenge, anomaly detection, is a well-studied machine-learning problem. However, due to the dimensionality and variability of expression data, existing methods based on feature space analysis or individual anomalously expressed genes are insufficient. We present a novel approach, CSAX, that identifies pathways in an individual sample in which the normal expression relationships are disrupted. To evaluate our approach, we have compiled and released a compendium of public expression data sets, reformulated to create a test bed for anomaly detection. We demonstrate the accuracy of CSAX on the data sets in our compendium, compare it to other leading methods, and show that CSAX aids in both identifying anomalies and explaining their underlying biology. We describe an approach to characterizing the difficulty of specific expression anomaly detection tasks. We then illustrate CSAX's value in two developmental case studies. Confirming prior hypotheses, CSAX highlights disruption of platelet activation pathways in a neonate with retinopathy of prematurity and identifies, for the first time, dysregulated oxidative stress response in second trimester amniotic fluid of fetuses with obese mothers. Our approach provides an important step toward identification of individual disease patterns in the era of precision medicine.


Assuntos
Algoritmos , Obesidade/genética , Retinopatia da Prematuridade/genética , Software , Transcriptoma , Adulto , Líquido Amniótico/química , Plaquetas/metabolismo , Plaquetas/patologia , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Feminino , Feto , Perfilação da Expressão Gênica , Humanos , Recém-Nascido , Obesidade/diagnóstico , Obesidade/patologia , Estresse Oxidativo , Fenótipo , Ativação Plaquetária/genética , Gravidez , Segundo Trimestre da Gravidez , Retinopatia da Prematuridade/diagnóstico , Retinopatia da Prematuridade/patologia
12.
Data Min Knowl Discov ; 25(1): 109-133, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22639542

RESUMO

Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called "normal" instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach.

13.
Artigo em Inglês | MEDLINE | ID: mdl-21393656

RESUMO

With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.


Assuntos
Algoritmos , Inteligência Artificial , Mineração de Dados/métodos , Bases de Dados Genéticas , Genômica/métodos , Proteínas de Transporte , Análise por Conglomerados , Humanos , MEDLINE , Proteínas/classificação , Proteínas/genética
14.
Proc IEEE Int Conf Data Min ; : 953-958, 2010 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-22020249

RESUMO

We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of "normal" training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.

15.
Uncertain Artif Intell ; 2008: 444-451, 2008 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-21785575

RESUMO

We consider the task of learning mappings from sequential data to real-valued responses. We present and evaluate an approach to learning a type of hidden Markov model (HMM) for regression. The learning process involves inferring the structure and parameters of a conventional HMM, while simultaneously learning a regression model that maps features that characterize paths through the model to continuous responses. Our results, in both synthetic and biological domains, demonstrate the value of jointly learning the two components of our approach.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA