Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
BMC Bioinformatics ; 23(1): 502, 2022 Nov 23.
Artículo en Inglés | MEDLINE | ID: mdl-36424541

RESUMEN

As genotype databases increase in size, so too do the number of detectable segments of identity by descent (IBD): segments of the genome where two individuals share an identical copy of one of their two parental haplotypes, due to shared ancestry. We show that given a large enough genotype database, these segments of IBD collectively overlap entire chromosomes, including instances of IBD that span multiple chromosomes, and can be used to accurately separate the alleles inherited from each parent across the entire genome. The resulting phase is not an improvement over state-of-the-art local phasing methods, but provides accurate long-range phasing that indicates which of two haplotypes in different regions of the genome, including different chromosomes, was inherited from the same parent. We are able to separate the DNA inherited from each parent completely, across the entire genome, with 98% median accuracy in a test set of 30,000 individuals. We estimate the IBD data requirements for accurate genome-wide phasing, and we propose a method for estimating confidence in the resulting phase. We show that our methods do not require the genotypes of close family, and that they are robust to genotype errors and missing data. In fact, our method can impute missing data accurately and correct genotype errors.


Asunto(s)
Genotipo , Humanos , Haplotipos , Alelos , Bases de Datos Factuales
2.
BMC Bioinformatics ; 22(1): 459, 2021 Sep 25.
Artículo en Inglés | MEDLINE | ID: mdl-34563119

RESUMEN

BACKGROUND: We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual's ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. RESULTS: The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and reused to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations using 10 CPU). We test ARCHes on public data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP) as well as simulated examples of known admixture. CONCLUSIONS: Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at finer population scales regardless of the amount of population admixture.


Asunto(s)
Genética de Población , Genoma Humano , Haplotipos , Humanos , Polimorfismo de Nucleótido Simple
3.
G3 (Bethesda) ; 9(9): 2863-2878, 2019 09 04.
Artículo en Inglés | MEDLINE | ID: mdl-31484785

RESUMEN

We present a massive investigation into the genetic basis of human lifespan. Beginning with a genome-wide association (GWA) study using a de-identified snapshot of the unique AncestryDNA database - more than 300,000 genotyped individuals linked to pedigrees of over 400,000,000 people - we mapped six genome-wide significant loci associated with parental lifespan. We compared these results to a GWA analysis of the traditional lifespan proxy trait, age, and found only one locus, APOE, to be associated with both age and lifespan. By combining the AncestryDNA results with those of an independent UK Biobank dataset, we conducted a meta-analysis of more than 650,000 individuals and identified fifteen parental lifespan-associated loci. Beyond just those significant loci, our genome-wide set of polymorphisms accounts for up to 8% of the variance in human lifespan; this value represents a large fraction of the heritability estimated from phenotypic correlations between relatives.


Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Longevidad/genética , Anciano , Anciano de 80 o más Años , Apolipoproteínas E/genética , Proteínas Portadoras/genética , Bases de Datos Genéticas , Femenino , Humanos , Masculino , Proteínas Nucleares/genética , Linaje , Polimorfismo de Nucleótido Simple , Estudios Prospectivos , Proteínas Proto-Oncogénicas/genética
4.
Genetics ; 210(3): 1109-1124, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30401766

RESUMEN

Human life span is a phenotype that integrates many aspects of health and environment into a single ultimate quantity: the elapsed time between birth and death. Though it is widely believed that long life runs in families for genetic reasons, estimates of life span "heritability" are consistently low (∼15-30%). Here, we used pedigree data from Ancestry public trees, including hundreds of millions of historical persons, to estimate the heritability of human longevity. Although "nominal heritability" estimates based on correlations among genetic relatives agreed with prior literature, the majority of that correlation was also captured by correlations among nongenetic (in-law) relatives, suggestive of highly assortative mating around life span-influencing factors (genetic and/or environmental). We used structural equation modeling to account for assortative mating, and concluded that the true heritability of human longevity for birth cohorts across the 1800s and early 1900s was well below 10%, and that it has been generally overestimated due to the effect of assortative mating.


Asunto(s)
Longevidad/genética , Reproducción , Femenino , Humanos , Masculino , Modelos Genéticos , Linaje
5.
Nat Commun ; 8: 14238, 2017 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-28169989

RESUMEN

Despite strides in characterizing human history from genetic polymorphism data, progress in identifying genetic signatures of recent demography has been limited. Here we identify very recent fine-scale population structure in North America from a network of over 500 million genetic (identity-by-descent, IBD) connections among 770,000 genotyped individuals of US origin. We detect densely connected clusters within the network and annotate these clusters using a database of over 20 million genealogical records. Recent population patterns captured by IBD clustering include immigrants such as Scandinavians and French Canadians; groups with continental admixture such as Puerto Ricans; settlers such as the Amish and Appalachians who experienced geographic or cultural isolation; and broad historical trends, including reduced north-south gene flow. Our results yield a detailed historical portrait of North America after European settlement and support substantial genetic heterogeneity in the United States beyond that uncovered by previous studies.


Asunto(s)
Demografía/estadística & datos numéricos , Genética de Población/métodos , Dinámica Poblacional/tendencias , Población/genética , Análisis por Conglomerados , Demografía/métodos , Emigrantes e Inmigrantes , Flujo Génico/genética , Técnicas de Genotipaje , Haplotipos/genética , Humanos , Polimorfismo de Nucleótido Simple , Dinámica Poblacional/estadística & datos numéricos , Análisis de Secuencia de ADN , Estados Unidos/etnología
6.
J Comput Biol ; 22(5): 402-13, 2015 May.
Artículo en Inglés | MEDLINE | ID: mdl-25651392

RESUMEN

Methods for translating gene expression signatures into clinically relevant information have typically relied upon having many samples from patients with similar molecular phenotypes. Here, we address the question of what can be done when it is relatively easy to obtain healthy patient samples, but when abnormalities corresponding to disease states may be rare and one-of-a-kind. The associated computational challenge, anomaly detection, is a well-studied machine-learning problem. However, due to the dimensionality and variability of expression data, existing methods based on feature space analysis or individual anomalously expressed genes are insufficient. We present a novel approach, CSAX, that identifies pathways in an individual sample in which the normal expression relationships are disrupted. To evaluate our approach, we have compiled and released a compendium of public expression data sets, reformulated to create a test bed for anomaly detection. We demonstrate the accuracy of CSAX on the data sets in our compendium, compare it to other leading methods, and show that CSAX aids in both identifying anomalies and explaining their underlying biology. We describe an approach to characterizing the difficulty of specific expression anomaly detection tasks. We then illustrate CSAX's value in two developmental case studies. Confirming prior hypotheses, CSAX highlights disruption of platelet activation pathways in a neonate with retinopathy of prematurity and identifies, for the first time, dysregulated oxidative stress response in second trimester amniotic fluid of fetuses with obese mothers. Our approach provides an important step toward identification of individual disease patterns in the era of precision medicine.


Asunto(s)
Algoritmos , Obesidad/genética , Retinopatía de la Prematuridad/genética , Programas Informáticos , Transcriptoma , Adulto , Líquido Amniótico/química , Plaquetas/metabolismo , Plaquetas/patología , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Femenino , Feto , Perfilación de la Expresión Génica , Humanos , Recién Nacido , Obesidad/diagnóstico , Obesidad/patología , Estrés Oxidativo , Fenotipo , Activación Plaquetaria/genética , Embarazo , Segundo Trimestre del Embarazo , Retinopatía de la Prematuridad/diagnóstico , Retinopatía de la Prematuridad/patología
7.
PLoS Comput Biol ; 10(5): e1003578, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24874013

RESUMEN

Identifying molecular connections between developmental processes and disease can lead to new hypotheses about health risks at all stages of life. Here we introduce a new approach to identifying significant connections between gene sets and disease genes, and apply it to several gene sets related to human development. To overcome the limits of incomplete and imperfect information linking genes to disease, we pool genes within disease subtrees in the MeSH taxonomy, and we demonstrate that such pooling improves the power and accuracy of our approach. Significance is assessed through permutation. We created a web-based visualization tool to facilitate multi-scale exploration of this large collection of significant connections (http://gda.cs.tufts.edu/development). High-level analysis of the results reveals expected connections between tissue-specific developmental processes and diseases linked to those tissues, and widespread connections to developmental disorders and cancers. Yet interesting new hypotheses may be derived from examining the unexpected connections. We highlight and discuss the implications of three such connections, linking dementia with bone development, polycystic ovary syndrome with cardiovascular development, and retinopathy of prematurity with lung development. Our results provide additional evidence that TGFB lays a key role in the early pathogenesis of polycystic ovary syndrome. Our evidence also suggests that the VEGF pathway and downstream NFKB signaling may explain the complex relationship between bronchopulmonary dysplasia and retinopathy of prematurity, and may form a bridge between two currently-competing hypotheses about the molecular origins of bronchopulmonary dysplasia. Further data exploration and similar queries about other gene sets may generate a variety of new information about the molecular relationships between additional diseases.


Asunto(s)
Mapeo Cromosómico/métodos , Regulación del Desarrollo de la Expresión Génica/genética , Predisposición Genética a la Enfermedad/genética , Estudio de Asociación del Genoma Completo/métodos , Modelos Genéticos , Proteoma/genética , Animales , Simulación por Computador , Marcadores Genéticos/genética , Humanos
8.
Data Min Knowl Discov ; 25(1): 109-133, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22639542

RESUMEN

Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called "normal" instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach.

9.
BMC Bioinformatics ; 12 Suppl 8: S3, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151929

RESUMEN

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. RESULTS: A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.


Asunto(s)
Algoritmos , Minería de Datos , Proteínas/metabolismo , Animales , Bases de Datos de Proteínas , Humanos , Publicaciones Periódicas como Asunto , PubMed
10.
Artículo en Inglés | MEDLINE | ID: mdl-21393656

RESUMEN

With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.


Asunto(s)
Algoritmos , Inteligencia Artificial , Minería de Datos/métodos , Bases de Datos Genéticas , Genómica/métodos , Proteínas Portadoras , Análisis por Conglomerados , Humanos , MEDLINE , Proteínas/clasificación , Proteínas/genética
11.
Proc IEEE Int Conf Data Min ; : 953-958, 2010 Dec 13.
Artículo en Inglés | MEDLINE | ID: mdl-22020249

RESUMEN

We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of "normal" training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.

12.
Nucleic Acids Res ; 37(Database issue): D274-8, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-19022853

RESUMEN

The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter classification (TC) system. It is a curated repository for factual information compiled largely from published references. It uses a functional/phylogenetic system of classification, and currently encompasses about 5000 representative transporters and putative transporters in more than 500 families. We here describe novel software designed to support and extend the usefulness of TCDB. Our recent efforts render it more user friendly, incorporate machine learning to input novel data in a semiautomatic fashion, and allow analyses that are more accurate and less time consuming. The availability of these tools has resulted in recognition of distant phylogenetic relationships and tremendous expansion of the information available to TCDB users.


Asunto(s)
Bases de Datos de Proteínas , Proteínas de Transporte de Membrana/clasificación , Inteligencia Artificial , Proteínas de Transporte de Membrana/química , Proteínas de Transporte de Membrana/genética , Filogenia , Homología de Secuencia de Aminoácido
13.
Uncertain Artif Intell ; 2008: 444-451, 2008 Jul 09.
Artículo en Inglés | MEDLINE | ID: mdl-21785575

RESUMEN

We consider the task of learning mappings from sequential data to real-valued responses. We present and evaluate an approach to learning a type of hidden Markov model (HMM) for regression. The learning process involves inferring the structure and parameters of a conventional HMM, while simultaneously learning a regression model that maps features that characterize paths through the model to continuous responses. Our results, in both synthetic and biological domains, demonstrate the value of jointly learning the two components of our approach.

14.
Bioinformatics ; 23(2): e156-62, 2007 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-17237085

RESUMEN

MOTIVATION: The process of transcription is controlled by systems of factors which bind in specific arrangements, called cis-regulatory modules (CRMs), in promoter regions. We present a discriminative learning algorithm which simultaneously learns the DNA binding site motifs as well as the logical structure and spatial aspects of CRMs. RESULTS: Our results on yeast datasets show better predictive accuracy than a current state-of-the-art approach on the same datasets. Our results on yeast, fly and human datasets show that the inclusion of logical and spatial aspects improves the predictive accuracy of our learned models. AVAILABILITY: Source code is available at http://www.cs.wisc.edu/~noto/crm


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , Genoma Fúngico/genética , Modelos Genéticos , Elementos Reguladores de la Transcripción/genética , Análisis de Secuencia de ADN/métodos , Factores de Transcripción/genética , Inteligencia Artificial , Secuencia de Bases , Simulación por Computador , Modelos Estadísticos , Datos de Secuencia Molecular , Reconocimiento de Normas Patrones Automatizadas/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Alineación de Secuencia/métodos
15.
BMC Bioinformatics ; 7: 528, 2006 Dec 05.
Artículo en Inglés | MEDLINE | ID: mdl-17147812

RESUMEN

BACKGROUND: The process of transcription is controlled by systems of transcription factors, which bind to specific patterns of binding sites in the transcriptional control regions of genes, called cis-regulatory modules (CRMs). We present an expressive and easily comprehensible CRM representation which is capable of capturing several aspects of a CRM's structure and distinguishing between DNA sequences which do or do not contain it. We also present a learning algorithm tailored for this domain, and a novel method to avoid overfitting by controlling the expressivity of the model. RESULTS: We are able to find statistically significant CRMs more often then a current state-of-the-art approach on the same data sets. We also show experimentally that each aspect of our expressive CRM model space makes a positive contribution to the learned models on yeast and fly data. CONCLUSION: Structural aspects are an important part of CRMs, both in terms of interpreting them biologically and learning them accurately. Source code for our algorithm is available at: http://www.cs.wisc.edu/~noto/crm.


Asunto(s)
Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/métodos , Elementos Reguladores de la Transcripción/genética , Secuencias Reguladoras de Ácidos Nucleicos/genética , Análisis de Secuencia de ADN/métodos , Factores de Transcripción/genética , Transcripción Genética/genética , Algoritmos , Secuencia de Bases , Datos de Secuencia Molecular
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...