Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
1.
Cell ; 148(6): 1293-307, 2012 Mar 16.
Artículo en Inglés | MEDLINE | ID: mdl-22424236

RESUMEN

Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.


Asunto(s)
Genoma Humano , Genómica , Medicina de Precisión , Diabetes Mellitus Tipo 2/genética , Femenino , Perfilación de la Expresión Génica , Humanos , Masculino , Metabolómica , Persona de Mediana Edad , Mutación , Proteómica , Virus Sincitiales Respiratorios/aislamiento & purificación , Rhinovirus/aislamiento & purificación
2.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36477833

RESUMEN

MOTIVATION: While many quantum computing (QC) methods promise theoretical advantages over classical counterparts, quantum hardware remains limited. Exploiting near-term QC in computer-aided drug design (CADD) thus requires judicious partitioning between classical and quantum calculations. RESULTS: We present HypaCADD, a hybrid classical-quantum workflow for finding ligands binding to proteins, while accounting for genetic mutations. We explicitly identify modules of our drug-design workflow currently amenable to replacement by QC: non-intuitively, we identify the mutation-impact predictor as the best candidate. HypaCADD thus combines classical docking and molecular dynamics with quantum machine learning (QML) to infer the impact of mutations. We present a case study with the coronavirus (SARS-CoV-2) protease and associated mutants. We map a classical machine-learning module onto QC, using a neural network constructed from qubit-rotation gates. We have implemented this in simulation and on two commercial quantum computers. We find that the QML models can perform on par with, if not better than, classical baselines. In summary, HypaCADD offers a successful strategy for leveraging QC for CADD. AVAILABILITY AND IMPLEMENTATION: Jupyter Notebooks with Python code are freely available for academic use on GitHub: https://www.github.com/hypahub/hypacadd_notebook. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
COVID-19 , Programas Informáticos , Humanos , Flujo de Trabajo , Metodologías Computacionales , Teoría Cuántica , SARS-CoV-2 , Diseño de Fármacos , Simulación de Dinámica Molecular
3.
Genome Res ; 28(4): 423-431, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29567674

RESUMEN

Over a decade ago, the Atacama humanoid skeleton (Ata) was discovered in the Atacama region of Chile. The Ata specimen carried a strange phenotype-6-in stature, fewer than expected ribs, elongated cranium, and accelerated bone age-leading to speculation that this was a preserved nonhuman primate, human fetus harboring genetic mutations, or even an extraterrestrial. We previously reported that it was human by DNA analysis with an estimated bone age of about 6-8 yr at the time of demise. To determine the possible genetic drivers of the observed morphology, DNA from the specimen was subjected to whole-genome sequencing using the Illumina HiSeq platform with an average 11.5× coverage of 101-bp, paired-end reads. In total, 3,356,569 single nucleotide variations (SNVs) were found as compared to the human reference genome, 518,365 insertions and deletions (indels), and 1047 structural variations (SVs) were detected. Here, we present the detailed whole-genome analysis showing that Ata is a female of human origin, likely of Chilean descent, and its genome harbors mutations in genes (COL1A1, COL2A1, KMT2D, FLNB, ATR, TRIP11, PCNT) previously linked with diseases of small stature, rib anomalies, cranial malformations, premature joint fusion, and osteochondrodysplasia (also known as skeletal dysplasia). Together, these findings provide a molecular characterization of Ata's peculiar phenotype, which likely results from multiple known and novel putative gene mutations affecting bone development and ossification.


Asunto(s)
ADN Antiguo/análisis , Genoma Humano/genética , Osteocondrodisplasias/genética , Secuenciación Completa del Genoma , Animales , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación INDEL , Anotación de Secuencia Molecular , Mutación/genética , Osteocondrodisplasias/fisiopatología , Fenotipo , Polimorfismo de Nucleótido Simple/genética
4.
Nature ; 526(7571): 75-81, 2015 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-26432246

RESUMEN

Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.


Asunto(s)
Variación Genética/genética , Genoma Humano/genética , Mapeo Físico de Cromosoma , Secuencia de Aminoácidos , Predisposición Genética a la Enfermedad , Genética Médica , Genética de Población , Estudio de Asociación del Genoma Completo , Genómica , Genotipo , Haplotipos/genética , Homocigoto , Humanos , Datos de Secuencia Molecular , Tasa de Mutación , Polimorfismo de Nucleótido Simple/genética , Sitios de Carácter Cuantitativo/genética , Análisis de Secuencia de ADN , Eliminación de Secuencia/genética
5.
Hum Mutat ; 38(9): 1155-1168, 2017 09.
Artículo en Inglés | MEDLINE | ID: mdl-28397312

RESUMEN

The CAGI-4 Hopkins clinical panel challenge was an attempt to assess state-of-the-art methods for clinical phenotype prediction from DNA sequence. Participants were provided with exonic sequences of 83 genes for 106 patients from the Johns Hopkins DNA Diagnostic Laboratory. Five groups participated in the challenge, predicting both the probability that each patient had each of the 14 possible classes of disease, as well as one or more causal variants. In cases where the Hopkins laboratory reported a variant, at least one predictor correctly identified the disease class in 36 of the 43 patients (84%). Even in cases where the Hopkins laboratory did not find a variant, at least one predictor correctly identified the class in 39 of the 63 patients (62%). Each prediction group correctly diagnosed at least one patient that was not successfully diagnosed by any other group. We discuss the causal variant predictions by different groups and their implications for further development of methods to assess variants of unknown significance. Our results suggest that clinically relevant variants may be missed when physicians order small panels targeted on a specific phenotype. We also quantify the false-positive rate of DNA-guided analysis in the absence of prior phenotypic indication.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos , Bases de Datos Genéticas , Predisposición Genética a la Enfermedad , Pruebas Genéticas , Humanos , Fenotipo
6.
Bioinformatics ; 32(24): 3829-3832, 2016 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-27667791

RESUMEN

LongISLND is a software package designed to simulate sequencing data according to the characteristics of third generation, single-molecule sequencing technologies. The general software architecture is easily extendable, as demonstrated by the emulation of Pacific Biosciences (PacBio) multi-pass sequencing with P5 and P6 chemistries, producing data in FASTQ, H5, and the latest PacBio BAM format. We demonstrate its utility by downstream processing with consensus building and variant calling. AVAILABILITY AND IMPLEMENTATION: LongISLND is implemented in Java and available at http://bioinform.github.io/longislnd CONTACT: hugo.lam@roche.comSupplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Simulación por Computador , Alineación de Secuencia
7.
Nature ; 470(7332): 59-65, 2011 Feb 03.
Artículo en Inglés | MEDLINE | ID: mdl-21293372

RESUMEN

Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Genética de Población , Genoma Humano/genética , Genómica , Duplicación de Gen/genética , Predisposición Genética a la Enfermedad/genética , Genotipo , Humanos , Mutagénesis Insercional/genética , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Eliminación de Secuencia/genética
8.
BMC Genomics ; 17: 64, 2016 Jan 16.
Artículo en Inglés | MEDLINE | ID: mdl-26772178

RESUMEN

BACKGROUND: The human genome contains variants ranging in size from small single nucleotide polymorphisms (SNPs) to large structural variants (SVs). High-quality benchmark small variant calls for the pilot National Institute of Standards and Technology (NIST) Reference Material (NA12878) have been developed by the Genome in a Bottle Consortium, but no similar high-quality benchmark SV calls exist for this genome. Since SV callers output highly discordant results, we developed methods to combine multiple forms of evidence from multiple sequencing technologies to classify candidate SVs into likely true or false positives. Our method (svclassify) calculates annotations from one or more aligned bam files from many high-throughput sequencing technologies, and then builds a one-class model using these annotations to classify candidate SVs as likely true or false positives. RESULTS: We first used pedigree analysis to develop a set of high-confidence breakpoint-resolved large deletions. We then used svclassify to cluster and classify these deletions as well as a set of high-confidence deletions from the 1000 Genomes Project and a set of breakpoint-resolved complex insertions from Spiral Genetics. We find that likely SVs cluster separately from likely non-SVs based on our annotations, and that the SVs cluster into different types of deletions. We then developed a supervised one-class classification method that uses a training set of random non-SV regions to determine whether candidate SVs have abnormal annotations different from most of the genome. To test this classification method, we use our pedigree-based breakpoint-resolved SVs, SVs validated by the 1000 Genomes Project, and assembly-based breakpoint-resolved insertions, along with semi-automated visualization using svviz. CONCLUSIONS: We find that candidate SVs with high scores from multiple technologies have high concordance with PCR validation and an orthogonal consensus method MetaSV (99.7 % concordant), and candidate SVs with low scores are questionable. We distribute a set of 2676 high-confidence deletions and 68 high-confidence insertions with high svclassify scores from these call sets for benchmarking SV callers. We expect these methods to be particularly useful for establishing high-confidence SV calls for benchmark samples that have been characterized by multiple technologies.


Asunto(s)
Genoma Humano , Variación Estructural del Genoma , Programas Informáticos , Benchmarking , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Anotación de Secuencia Molecular , Linaje , Polimorfismo de Nucleótido Simple/genética
9.
Bioinformatics ; 31(16): 2741-4, 2015 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-25861968

RESUMEN

UNLABELLED: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. AVAILABILITY AND IMPLEMENTATION: Code in Python is at http://bioinform.github.io/metasv/. CONTACT: rd@bina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Variación Genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Mutagénesis Insercional , Eliminación de Secuencia
10.
Bioinformatics ; 31(9): 1469-71, 2015 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-25524895

RESUMEN

SUMMARY: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing. AVAILABILITY AND IMPLEMENTATION: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim. CONTACT: rd@bina.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Variación Genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Simulación por Computador , Genómica , Humanos , Mutación , Neoplasias/genética , Alineación de Secuencia
11.
PLoS Genet ; 7(8): e1002236, 2011 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-21876680

RESUMEN

As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.


Asunto(s)
Elementos Transponibles de ADN , Genoma Humano , Polimorfismo de Nucleótido Simple , Frecuencia de los Genes , Genotipo , Heterocigoto , Humanos , Mutagénesis Insercional , Tasa de Mutación
12.
Nucleic Acids Res ; 39(16): 7058-76, 2011 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-21596777

RESUMEN

In the human genome, it has been estimated that considerably more sequence is under natural selection in non-coding regions [such as transcription-factor binding sites (TF-binding sites) and non-coding RNAs (ncRNAs)] compared to protein-coding ones. However, less attention has been paid to them. To study selective pressure on non-coding elements, we use next-generation sequencing data from the recently completed pilot phase of the 1000 Genomes Project, which, compared to traditional methods, allows for the characterization of a full spectrum of genomic variations, including single-nucleotide polymorphisms (SNPs), short insertions and deletions (indels) and structural variations (SVs). We develop a framework for combining these variation data with non-coding elements, calculating various population-based metrics to compare classes and subclasses of elements, and developing element-aware aggregation procedures to probe the internal structure of an element. Overall, we find that TF-binding sites and ncRNAs are less selectively constrained for SNPs than coding sequences (CDSs), but more constrained than a neutral reference. We also determine that the relative amounts of constraint for the three types of variations are, in general, correlated, but there are some differences: counter-intuitively, TF-binding sites and ncRNAs are more selectively constrained for indels than for SNPs, compared to CDSs. After inspecting the overall properties of a class of elements, we analyze selective pressure on subclasses within an element class, and show that the extent of selection is associated with the genomic properties of each subclass. We find, for instance, that ncRNAs with higher expression levels tend to be under stronger purifying selection, and the actual regions of TF-binding motifs are under stronger selective pressure than the corresponding peak regions. Further, we develop element-aware aggregation plots to analyze selective pressure across the linear structure of an element, with the confidence intervals evaluated using both simple bootstrapping and block bootstrapping techniques. We find, for example, that both micro-RNAs (particularly the seed regions) and their binding targets are under stronger selective pressure for SNPs than their immediate genomic surroundings. In addition, we demonstrate that substitutions in TF-binding motifs inversely correlate with site conservation, and SNPs unfavorable for motifs are under more selective constraints than favorable SNPs. Finally, to further investigate intra-element differences, we show that SVs have the tendency to use distinctive modes and mechanisms when they interact with genomic elements, such as enveloping whole gene(s) rather than disrupting them partially, as well as duplicating TF motifs in tandem.


Asunto(s)
ADN Intergénico/química , Variación Genética , Genoma Humano , Sitios de Unión , Frecuencia de los Genes , Genómica , Humanos , Mutación INDEL , MicroARNs/química , Polimorfismo de Nucleótido Simple , Proteínas/genética , Seudogenes , ARN no Traducido/metabolismo , Factores de Transcripción/metabolismo
13.
PLoS Genet ; 6(2): e1000848, 2010 Feb 19.
Artículo en Inglés | MEDLINE | ID: mdl-20174564

RESUMEN

Transcription factors are key components of regulatory networks that control development, as well as the response to environmental stimuli. We have established an experimental pipeline in Caenorhabditis elegans that permits global identification of the binding sites for transcription factors using chromatin immunoprecipitation and deep sequencing. We describe and validate this strategy, and apply it to the transcription factor PHA-4, which plays critical roles in organ development and other cellular processes. We identified thousands of binding sites for PHA-4 during formation of the embryonic pharynx, and also found a role for this factor during the starvation response. Many binding sites were found to shift dramatically between embryos and starved larvae, from developmentally regulated genes to genes involved in metabolism. These results indicate distinct roles for this regulator in two different biological processes and demonstrate the versatility of transcription factors in mediating diverse biological roles.


Asunto(s)
Proteínas de Caenorhabditis elegans/metabolismo , Caenorhabditis elegans/crecimiento & desarrollo , Caenorhabditis elegans/genética , Ambiente , Genoma de los Helmintos/genética , Transactivadores/metabolismo , Animales , Sitios de Unión , Proteínas de Caenorhabditis elegans/genética , Inmunoprecipitación de Cromatina , Embrión no Mamífero/metabolismo , Regulación del Desarrollo de la Expresión Génica , Genes de Helminto/genética , Proteínas Fluorescentes Verdes/metabolismo , Larva/metabolismo , Unión Proteica , ARN Polimerasa II/metabolismo , Proteínas Recombinantes de Fusión/metabolismo , Inanición , Análisis de Supervivencia , Transactivadores/genética , Factores de Transcripción/metabolismo
14.
J Thorac Cardiovasc Surg ; 166(1): 141-152.e1, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-34689984

RESUMEN

OBJECTIVES: We examined for differences in pre-left ventricular assist device (LVAD) implantation myocardial transcriptome signatures among patients with different degrees of mitral regurgitation (MR). METHODS: Between January 2018 and October 2019, we collected left ventricular (LV) cores during durable LVAD implantation (n = 72). A retrospective chart review was performed. Total RNA was isolated from LV cores and used to construct cDNA sequence libraries. The libraries were sequenced with the NovaSeq system, and data were quantified using Kallisto. Gene Set Enrichment Analysis (GSEA) and Gene Ontology analyses were performed, with a false discovery rate <0.05 considered significant. RESULTS: Comparing patients with preoperative mild or less MR (n = 30) and those with moderate-severe MR (n = 42), the moderate-severe MR group weighted less (P = .004) and had more tricuspid valve repairs (P = .043), without differences in demographics or comorbidities. We then compared both groups with a group of human donor hearts without heart failure (n = 8). Compared with the donor hearts, there were 3985 differentially expressed genes (DEGs) for mild or less MR and 4587 DEGs for moderate-severe MR. Specifically altered genes included 448 DEGs for specific for mild or less MR and 1050 DEGs for moderate-severe MR. On GSEA, common regulated genes showed increased immune gene expression and reduced expression of contraction and energetic genes. Of the 1050 genes specific for moderate-severe MR, there were additional up-regulated genes related to inflammation and reduced expression of genes related to cellular proliferation. CONCLUSIONS: Patients undergoing durable LVAD implantation with moderate-severe MR had increased activation of genes related to inflammation and reduction of cellular proliferation genes. This may have important implications for myocardial recovery.


Asunto(s)
Insuficiencia Cardíaca , Trasplante de Corazón , Corazón Auxiliar , Insuficiencia de la Válvula Mitral , Humanos , Insuficiencia de la Válvula Mitral/diagnóstico por imagen , Insuficiencia de la Válvula Mitral/genética , Insuficiencia de la Válvula Mitral/cirugía , Transcriptoma , Estudios Retrospectivos , Resultado del Tratamiento , Donantes de Tejidos , Insuficiencia Cardíaca/genética , Insuficiencia Cardíaca/cirugía , Inflamación
15.
PLoS Comput Biol ; 7(1): e1001050, 2011 Jan 06.
Artículo en Inglés | MEDLINE | ID: mdl-21253555

RESUMEN

We have accumulated a large amount of biological network data and expect even more to come. Soon, we anticipate being able to compare many different biological networks as we commonly do for molecular sequences. It has long been believed that many of these networks change, or "rewire", at different rates. It is therefore important to develop a framework to quantify the differences between networks in a unified fashion. We developed such a formalism based on analogy to simple models of sequence evolution, and used it to conduct a systematic study of network rewiring on all the currently available biological networks. We found that, similar to sequences, biological networks show a decreased rate of change at large time divergences, because of saturation in potential substitutions. However, different types of biological networks consistently rewire at different rates. Using comparative genomics and proteomics data, we found a consistent ordering of the rewiring rates: transcription regulatory, phosphorylation regulatory, genetic interaction, miRNA regulatory, protein interaction, and metabolic pathway network, from fast to slow. This ordering was found in all comparisons we did of matched networks between organisms. To gain further intuition on network rewiring, we compared our observed rewirings with those obtained from simulation. We also investigated how readily our formalism could be mapped to other network contexts; in particular, we showed how it could be applied to analyze changes in a range of "commonplace" networks such as family trees, co-authorships and linux-kernel function dependencies.


Asunto(s)
Evolución Biológica , Genómica , Proteómica
16.
Nucleic Acids Res ; 38(20): 6997-7007, 2010 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-20615899

RESUMEN

Duplicated pseudogenes in the human genome are disabled copies of functioning parent genes. They result from block duplication events occurring throughout evolutionary history. Relatively recent duplications (with sequence similarity≥90% and length≥1 kb) are termed segmental duplications (SDs); here, we analyze the interrelationship of SDs and pseudogenes. We present a decision-tree approach to classify pseudogenes based on their (and their parents') characteristics in relation to SDs. The classification identifies 140 novel pseudogenes and makes possible improved annotation for the 3172 pseudogenes located in SDs. In particular, it reveals that many pseudogenes in SDs likely did not arise directly from parent genes, but are the result of a multi-step process. In these cases, the initial duplication or retrotransposition of a parent gene gives rise to a 'parent pseudogene', followed by further duplication creating duplicated-duplicated or duplicated-processed pseudogenes, respectively. Moreover, we can precisely identify these parent pseudogenes by overlap with ancestral SD loci. Finally, a comparison of nucleotide substitutions per site in a pseudogene with its surrounding SD region allows us to estimate the time difference between duplication and disablement events, and this suggests that most duplicated pseudogenes in SDs were likely disabled around the time of the original duplication.


Asunto(s)
Genoma Humano , Seudogenes , Duplicaciones Segmentarias en el Genoma , Evolución Molecular , Duplicación de Gen , Sitios Genéticos , Humanos
17.
Nucleic Acids Res ; 37(Database issue): D738-43, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-18957444

RESUMEN

Pseudofam (http://pseudofam.pseudogene.org) is a database of pseudogene families based on the protein families from the Pfam database. It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. The current version of Pseudofam contains more than 125,000 pseudogenes identified from 10 eukaryotic genomes and aligned within nearly 3000 families (approximately one-third of the total families in PfamA). Pseudofam uses a large-scale parallelized homology search algorithm (implemented as an extension of the PseudoPipe pipeline) to identify pseudogenes. Each identified pseudogene is assigned to its parent protein family and subsequently aligned to each other by transferring the parent domain alignments from the Pfam family. Pseudogenes are also given additional annotation based on an ontology, reflecting their mode of creation and subsequent history. In particular, our annotation highlights the association of pseudogene families with genomic features, such as segmental duplications. In addition, pseudogene families are associated with key statistics, which identify outlier families with an unusual degree of pseudogenization. The statistics also show how the number of genes and pseudogenes in families correlates across different species. Overall, they highlight the fact that housekeeping families tend to be enriched with a large number of pseudogenes.


Asunto(s)
Bases de Datos Genéticas , Seudogenes , Animales , Interpretación Estadística de Datos , Genómica , Humanos , Internet , Proteínas/clasificación , Proteínas/genética , Alineación de Secuencia
18.
Nat Biotechnol ; 39(9): 1151-1160, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34504347

RESUMEN

The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor-normal genomic DNA (gDNA) samples derived from a breast cancer cell line-which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations-and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking 'tumor-only' or 'matched tumor-normal' analyses.


Asunto(s)
Benchmarking , Neoplasias de la Mama/genética , Análisis Mutacional de ADN/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Secuenciación Completa del Genoma/normas , Línea Celular Tumoral , Conjuntos de Datos como Asunto , Células Germinativas , Humanos , Mutación , Estándares de Referencia , Reproducibilidad de los Resultados
19.
BMC Bioinformatics ; 11: 243, 2010 May 11.
Artículo en Inglés | MEDLINE | ID: mdl-20459839

RESUMEN

BACKGROUND: Many protein interactions, especially those involved in signaling, involve short linear motifs consisting of 5-10 amino acid residues that interact with modular protein domains such as the SH3 binding domains and the kinase catalytic domains. One straightforward way of identifying these interactions is by scanning for matches to the motif against all the sequences in a target proteome. However, predicting domain targets by motif sequence alone without considering other genomic and structural information has been shown to be lacking in accuracy. RESULTS: We developed an efficient search algorithm to scan the target proteome for potential domain targets and to increase the accuracy of each hit by integrating a variety of pre-computed features, such as conservation, surface propensity, and disorder. The integration is performed using naïve Bayes and a training set of validated experiments. CONCLUSIONS: By integrating a variety of biologically relevant features to predict domain targets, we demonstrated a notably improved prediction of modular protein domain targets. Combined with emerging high-resolution data of domain specificities, we believe that our approach can assist in the reconstruction of many signaling pathways.


Asunto(s)
Estructura Terciaria de Proteína , Proteínas/química , Proteómica/métodos , Programas Informáticos , Algoritmos , Secuencias de Aminoácidos , Sitios de Unión , Modelos Moleculares , Conformación Proteica , Proteínas/metabolismo , Proteoma/química , Proteoma/metabolismo
20.
Sci Rep ; 10(1): 4983, 2020 03 18.
Artículo en Inglés | MEDLINE | ID: mdl-32188929

RESUMEN

Tumor Mutational Burden (TMB) is a measure of the abundance of somatic mutations in a tumor, which has been shown to be an emerging biomarker for both anti-PD-(L)1 treatment and prognosis; however, multiple challenges still hinder the adoption of TMB as a biomarker. The key challenges are the inconsistency of tumor mutational burden measurement among assays and the lack of a meaningful threshold for TMB classification. Here we describe a new method, ecTMB (Estimation and Classification of TMB), which uses an explicit background mutation model to predict TMB robustly and to classify samples into biologically meaningful subtypes defined by tumor mutational burden.


Asunto(s)
Biomarcadores de Tumor/genética , ADN de Neoplasias/genética , Genoma Humano , Mutación , Neoplasias/clasificación , Neoplasias/genética , Carga Tumoral , Análisis Mutacional de ADN , ADN de Neoplasias/análisis , Exoma , Humanos , Inmunoterapia/métodos , Modelos Estadísticos , Neoplasias/tratamiento farmacológico , Neoplasias/patología , Pronóstico , Resultado del Tratamiento
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA