Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Hum Mol Genet ; 31(R1): R62-R72, 2022 10 20.
Artigo em Inglês | MEDLINE | ID: mdl-35943817

RESUMO

Non-coding genetic variants outside of protein-coding genome regions play an important role in genetic and epigenetic regulation. It has become increasingly important to understand their roles, as non-coding variants often make up the majority of top findings of genome-wide association studies (GWAS). In addition, the growing popularity of disease-specific whole-genome sequencing (WGS) efforts expands the library of and offers unique opportunities for investigating both common and rare non-coding variants, which are typically not detected in more limited GWAS approaches. However, the sheer size and breadth of WGS data introduce additional challenges to predicting functional impacts in terms of data analysis and interpretation. This review focuses on the recent approaches developed for efficient, at-scale annotation and prioritization of non-coding variants uncovered in WGS analyses. In particular, we review the latest scalable annotation tools, databases and functional genomic resources for interpreting the variant findings from WGS based on both experimental data and in silico predictive annotations. We also review machine learning-based predictive models for variant scoring and prioritization. We conclude with a discussion of future research directions which will enhance the data and tools necessary for the effective functional analyses of variants identified by WGS to improve our understanding of disease etiology.


Assuntos
Epigênese Genética , Estudo de Associação Genômica Ampla , Sequenciamento Completo do Genoma , Genômica
2.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37947320

RESUMO

SUMMARY: Preparing functional genomic (FG) data with diverse assay types and file formats for integration into analysis workflows that interpret genome-wide association and other studies is a significant and time-consuming challenge. Here we introduce hipFG (Harmonization and Integration Pipeline for Functional Genomics), an automatically customized pipeline for efficient and scalable normalization of heterogenous FG data collections into standardized, indexed, rapidly searchable analysis-ready datasets while accounting for FG datatypes (e.g. chromatin interactions, genomic intervals, quantitative trait loci). AVAILABILITY AND IMPLEMENTATION: hipFG is freely available at https://bitbucket.org/wanglab-upenn/hipFG. A Docker container is available at https://hub.docker.com/r/wanglab/hipfg.


Assuntos
Estudo de Associação Genômica Ampla , Software , Genômica , Cromatina , Locos de Características Quantitativas
3.
Alzheimers Dement ; 20(2): 1123-1136, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37881831

RESUMO

INTRODUCTION: The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site Alzheimer's Genomics Database (GenomicsDB) is a public knowledge base of Alzheimer's disease (AD) genetic datasets and genomic annotations. METHODS: GenomicsDB uses a custom systems architecture to adopt and enforce rigorous standards that facilitate harmonization of AD-relevant genome-wide association study summary statistics datasets with functional annotations, including over 230 million annotated variants from the AD Sequencing Project. RESULTS: GenomicsDB generates interactive reports compiled from the harmonized datasets and annotations. These reports contextualize AD-risk associations in a broader functional genomic setting and summarize them in the context of functionally annotated genes and variants. DISCUSSION: Created to make AD-genetics knowledge more accessible to AD researchers, the GenomicsDB is designed to guide users unfamiliar with genetic data in not only exploring but also interpreting this ever-growing volume of data. Scalable and interoperable with other genomics resources using data technology standards, the GenomicsDB can serve as a central hub for research and data analysis on AD and related dementias. HIGHLIGHTS: The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (NIAGADS) offers to the public a unique, disease-centric collection of AD-relevant GWAS summary statistics datasets. Interpreting these data is challenging and requires significant bioinformatics expertise to standardize datasets and harmonize them with functional annotations on genome-wide scales. The NIAGADS Alzheimer's GenomicsDB helps overcome these challenges by providing a user-friendly public knowledge base for AD-relevant genetics that shares harmonized, annotated summary statistics datasets from the NIAGADS repository in an interpretable, easily searchable format.


Assuntos
Doença de Alzheimer , Estados Unidos , Humanos , Doença de Alzheimer/genética , Estudo de Associação Genômica Ampla , National Institute on Aging (U.S.) , Genômica , Bases de Dados Factuais , Predisposição Genética para Doença/genética
4.
Bioinformatics ; 36(12): 3879-3881, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32330239

RESUMO

SUMMARY: We report Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), a scalable bioinformatics pipeline characterizing non-coding genome-wide association study (GWAS) association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources. AVAILABILITY AND IMPLEMENTATION: SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at https://bitbucket.org/wanglab-upenn/SparkINFERNO or https://hub.docker.com/r/wanglab/spark-inferno. CONTACT: lswang@pennmedicine.upenn.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estudo de Associação Genômica Ampla , Locos de Características Quantitativas , Algoritmos , Genômica , Software
5.
Bioinformatics ; 35(6): 1033-1039, 2019 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-30668832

RESUMO

MOTIVATION: Small non-coding RNAs (sncRNAs, <100 nts) are highly abundant RNAs that regulate diverse and often tissue-specific cellular processes by associating with transcription factor complexes or binding to mRNAs. While thousands of sncRNA genes exist in the human genome, no single resource provides searchable, unified annotation, expression and processing information for full sncRNA transcripts and mature RNA products derived from these larger RNAs. RESULTS: Our goal is to establish a complete catalog of annotation, expression, processing, conservation, tissue-specificity and other biological features for all human sncRNA genes and mature products derived from all major RNA classes. DASHR (Database of small human non-coding RNAs) v2.0 database is the first that integrates human sncRNA gene and mature products profiles obtained from multiple RNA-seq protocols. Altogether, 185 tissues/cell types and sncRNA annotations and >800 curated experiments from ENCODE and GEO/SRA across multiple RNA-seq protocols for both GRCh38/hg38 and GRCh37/hg19 assemblies are integrated in DASHR. Moreover, DASHR is the first to contain both known and novel, previously un-annotated sncRNA loci identified by unsupervised segmentation (13 times more loci with 1 678 800 total). Additionally, DASHR v2.0 adds >3 200 000 annotations for non-small RNA genes and other genomic features (long-noncoding RNAs, mRNAs, promoters, repeats). Furthermore, DASHR v2.0 introduces an enhanced user interface, interactive experiment-by-locus table view, sncRNA locus sorting and filtering by biological features. All annotation and expression information directly downloadable and accessible as UCSC genome browser tracks. AVAILABILITY AND IMPLEMENTATION: DASHR v2.0 is freely available at https://lisanwanglab.org/DASHRv2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Pequeno RNA não Traduzido/provisão & distribuição , Bases de Dados de Ácidos Nucleicos , Genômica , Humanos , RNA Longo não Codificante , Análise de Sequência de RNA , Software
6.
Nucleic Acids Res ; 46(W1): W36-W42, 2018 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-29733404

RESUMO

The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.


Assuntos
Biologia Computacional/tendências , Pequeno RNA não Traduzido/genética , RNA/genética , Software , Animais , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/instrumentação , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , Análise de Sequência de RNA/instrumentação , Transcriptoma/genética
7.
Nucleic Acids Res ; 46(17): 8740-8753, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-30113658

RESUMO

The majority of variants identified by genome-wide association studies (GWAS) reside in the noncoding genome, affecting regulatory elements including transcriptional enhancers. However, characterizing their effects requires the integration of GWAS results with context-specific regulatory activity and linkage disequilibrium annotations to identify causal variants underlying noncoding association signals and the regulatory elements, tissue contexts, and target genes they affect. We propose INFERNO, a novel method which integrates hundreds of functional genomics datasets spanning enhancer activity, transcription factor binding sites, and expression quantitative trait loci with GWAS summary statistics. INFERNO includes novel statistical methods to quantify empirical enrichments of tissue-specific enhancer overlap and to identify co-regulatory networks of dysregulated long noncoding RNAs (lncRNAs). We applied INFERNO to two large GWAS studies. For schizophrenia (36,989 cases, 113,075 controls), INFERNO identified putatively causal variants affecting brain enhancers for known schizophrenia-related genes. For inflammatory bowel disease (IBD) (12,882 cases, 21,770 controls), INFERNO found enrichments of immune and digestive enhancers and lncRNAs involved in regulation of the adaptive immune response. In summary, INFERNO comprehensively infers the molecular mechanisms of causal noncoding variants, providing a sensitive hypothesis generation method for post-GWAS analysis. The software is available as an open source pipeline and a web server.


Assuntos
Elementos Facilitadores Genéticos , Genoma Humano , Doenças Inflamatórias Intestinais/genética , RNA Longo não Codificante/genética , Esquizofrenia/genética , Software , Imunidade Adaptativa , Estudos de Casos e Controles , Feminino , Marcadores Genéticos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Doenças Inflamatórias Intestinais/imunologia , Doenças Inflamatórias Intestinais/fisiopatologia , Internet , Desequilíbrio de Ligação , Masculino , Fenótipo , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , RNA Longo não Codificante/imunologia , Esquizofrenia/imunologia , Esquizofrenia/fisiopatologia
8.
Plant Cell ; 27(11): 3024-37, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26561561

RESUMO

Posttranscriptional chemical modification of RNA bases is a widespread and physiologically relevant regulator of RNA maturation, stability, and function. While modifications are best characterized in short, noncoding RNAs such as tRNAs, growing evidence indicates that mRNAs and long noncoding RNAs (lncRNAs) are likewise modified. Here, we apply our high-throughput annotation of modified ribonucleotides (HAMR) pipeline to identify and classify modifications that affect Watson-Crick base pairing at three different levels of the Arabidopsis thaliana transcriptome (polyadenylated, small, and degrading RNAs). We find this type of modifications primarily within uncapped, degrading mRNAs and lncRNAs, suggesting they are the cause or consequence of RNA turnover. Additionally, modifications within stable mRNAs tend to occur in alternatively spliced introns, suggesting they regulate splicing. Furthermore, these modifications target mRNAs with coherent functions, including stress responses. Thus, our comprehensive analysis across multiple RNA classes yields insights into the functions of covalent RNA modifications in plant transcriptomes.


Assuntos
Processamento Alternativo/genética , Arabidopsis/genética , Capuzes de RNA/metabolismo , Arabidopsis/metabolismo , Pareamento de Bases/genética , Células HEK293 , Células HeLa , Humanos , Anotação de Sequência Molecular , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Reprodutibilidade dos Testes , Ribonucleotídeos/metabolismo , Estresse Fisiológico/genética , Transcriptoma/genética
9.
Nucleic Acids Res ; 44(D1): D216-22, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26553799

RESUMO

Small non-coding RNAs (sncRNAs) are highly abundant RNAs, typically <100 nucleotides long, that act as key regulators of diverse cellular processes. Although thousands of sncRNA genes are known to exist in the human genome, no single database provides searchable, unified annotation, and expression information for full sncRNA transcripts and mature RNA products derived from these larger RNAs. Here, we present the Database of small human noncoding RNAs (DASHR). DASHR contains the most comprehensive information to date on human sncRNA genes and mature sncRNA products. DASHR provides a simple user interface for researchers to view sequence and secondary structure, compare expression levels, and evidence of specific processing across all sncRNA genes and mature sncRNA products in various human tissues. DASHR annotation and expression data covers all major classes of sncRNAs including microRNAs (miRNAs), Piwi-interacting (piRNAs), small nuclear, nucleolar, cytoplasmic (sn-, sno-, scRNAs, respectively), transfer (tRNAs), and ribosomal RNAs (rRNAs). Currently, DASHR (v1.0) integrates 187 smRNA high-throughput sequencing (smRNA-seq) datasets with over 2.5 billion reads and annotation data from multiple public sources. DASHR contains annotations for ∼ 48,000 human sncRNA genes and mature sncRNA products, 82% of which are expressed in one or more of the curated tissues. DASHR is available at http://lisanwanglab.org/DASHR.


Assuntos
Bases de Dados de Ácidos Nucleicos , Pequeno RNA não Traduzido/metabolismo , Humanos , Anotação de Sequência Molecular , Processamento Pós-Transcricional do RNA , Pequeno RNA não Traduzido/química , Pequeno RNA não Traduzido/genética
10.
Bioinformatics ; 31(22): 3600-7, 2015 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-26206306

RESUMO

MOTIVATION: Effective computational methods for peptide-protein binding prediction can greatly help clinical peptide vaccine search and design. However, previous computational methods fail to capture key nonlinear high-order dependencies between different amino acid positions. As a result, they often produce low-quality rankings of strong binding peptides. To solve this problem, we propose nonlinear high-order machine learning methods including high-order neural networks (HONNs) with possible deep extensions and high-order kernel support vector machines to predict major histocompatibility complex-peptide binding. RESULTS: The proposed high-order methods improve quality of binding predictions over other prediction methods. With the proposed methods, a significant gain of up to 25-40% is observed on the benchmark and reference peptide datasets and tasks. In addition, for the first time, our experiments show that pre-training with high-order semi-restricted Boltzmann machines significantly improves the performance of feed-forward HONNs. Moreover, our experiments show that the proposed shallow HONN outperform the popular pre-trained deep neural network on most tasks, which demonstrates the effectiveness of modelling high-order feature interactions for predicting major histocompatibility complex-peptide binding. AVAILABILITY AND IMPLEMENTATION: There is no associated distributable software. CONTACT: renqiang@nec-labs.com or mark.gerstein@yale.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Complexo Principal de Histocompatibilidade , Redes Neurais de Computação , Peptídeos/metabolismo , Sequência de Aminoácidos , Área Sob a Curva , Bases de Dados de Proteínas , Epitopos/química , Humanos , Dados de Sequência Molecular , Peptídeos/química , Ligação Proteica , Curva ROC , Máquina de Vetores de Suporte
11.
Bioinformatics ; 31(8): 1290-2, 2015 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-25480377

RESUMO

UNLABELLED: We implemented a high-throughput identification pipeline for promoter interacting enhancer element to streamline the workflow from mapping raw Hi-C reads, identifying DNA-DNA interacting fragments with high confidence and quality control, detecting histone modifications and DNase hypersensitive enrichments in putative enhancer elements, to ultimately extracting possible intra- and inter-chromosomal enhancer-target gene relationships. AVAILABILITY AND IMPLEMENTATION: This software package is designed to run on high-performance computing clusters with Oracle Grid Engine. The source code is freely available under the MIT license for academic and nonprofit use. The source code and instructions are available at the Wang lab website (http://wanglab.pcbi.upenn.edu/hippie/). It is also provided as an Amazon Machine Image to be used directly on Amazon Cloud with minimal installation. CONTACT: lswang@mail.med.upenn.edu or bdgregor@sas.upenn.edu SUPPLEMENTARY INFORMATION: Supplementary Material is available at Bioinformatics online.


Assuntos
DNA/genética , DNA/metabolismo , Elementos Facilitadores Genéticos/genética , Regiões Promotoras Genéticas/genética , Análise de Sequência de DNA/métodos , Humanos , Linguagens de Programação
12.
bioRxiv ; 2023 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-37162864

RESUMO

Preparing functional genomic (FG) data with diverse assay types and file formats for integration into analysis workflows that interpret genome-wide association and other studies is a significant and time-consuming challenge. Here we introduce hipFG, an automatically customized pipeline for efficient and scalable normalization of heterogenous FG data collections into standardized, indexed, rapidly searchable analysis-ready datasets while accounting for FG datatypes (e.g., chromatin interactions, genomic intervals, quantitative trait loci).

13.
medRxiv ; 2023 Jul 08.
Artigo em Inglês | MEDLINE | ID: mdl-37461624

RESUMO

Limited ancestral diversity has impaired our ability to detect risk variants more prevalent in non-European ancestry groups in genome-wide association studies (GWAS). We constructed and analyzed a multi-ancestry GWAS dataset in the Alzheimer's Disease (AD) Genetics Consortium (ADGC) to test for novel shared and ancestry-specific AD susceptibility loci and evaluate underlying genetic architecture in 37,382 non-Hispanic White (NHW), 6,728 African American, 8,899 Hispanic (HIS), and 3,232 East Asian individuals, performing within-ancestry fixed-effects meta-analysis followed by a cross-ancestry random-effects meta-analysis. We identified 13 loci with cross-ancestry associations including known loci at/near CR1 , BIN1 , TREM2 , CD2AP , PTK2B , CLU , SHARPIN , MS4A6A , PICALM , ABCA7 , APOE and two novel loci not previously reported at 11p12 ( LRRC4C ) and 12q24.13 ( LHX5-AS1 ). Reflecting the power of diverse ancestry in GWAS, we observed the SHARPIN locus using 7.1% the sample size of the original discovering single-ancestry GWAS (n=788,989). We additionally identified three GWS ancestry-specific loci at/near ( PTPRK ( P =2.4×10 -8 ) and GRB14 ( P =1.7×10 -8 ) in HIS), and KIAA0825 ( P =2.9×10 -8 in NHW). Pathway analysis implicated multiple amyloid regulation pathways (strongest with P adjusted =1.6×10 -4 ) and the classical complement pathway ( P adjusted =1.3×10 -3 ). Genes at/near our novel loci have known roles in neuronal development ( LRRC4C, LHX5-AS1 , and PTPRK ) and insulin receptor activity regulation ( GRB14 ). These findings provide compelling support for using traditionally-underrepresented populations for gene discovery, even with smaller sample sizes.

14.
NAR Genom Bioinform ; 4(1): lqab123, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35047815

RESUMO

Querying massive functional genomic and annotation data collections, linking and summarizing the query results across data sources/data types are important steps in high-throughput genomic and genetic analytical workflows. However, these steps are made difficult by the heterogeneity and breadth of data sources, experimental assays, biological conditions/tissues/cell types and file formats. FILER (FunctIonaL gEnomics Repository) is a framework for querying large-scale genomics knowledge with a large, curated integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface. FILER uniquely provides: (i) streamlined access to >50 000 harmonized, annotated genomic datasets across >20 integrated data sources, >1100 tissues/cell types and >20 experimental assays; (ii) a scalable genomic querying interface; and (iii) ability to analyze and annotate user's experimental data. This rich resource spans >17 billion GRCh37/hg19 and GRCh38/hg38 genomic records. Our benchmark querying 7 × 109 hg19 FILER records shows FILER is highly scalable, with a sub-linear 32-fold increase in querying time when increasing the number of queries 1000-fold from 1000 to 1 000 000 intervals. Together, these features facilitate reproducible research and streamline integrating/querying large-scale genomic data within analyses/workflows. FILER can be deployed on cloud or local servers (https://bitbucket.org/wanglab-upenn/FILER) for integration with custom pipelines and is freely available (https://lisanwanglab.org/FILER).

15.
J Alzheimers Dis ; 86(1): 461-477, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35068457

RESUMO

BACKGROUND: Recent Alzheimer's disease (AD) genetics findings from genome-wide association studies (GWAS) span progressively larger and more diverse populations and outcomes. Currently, there is no up-to-date resource providing harmonized and searchable information on all AD genetic associations found by GWAS, nor linking the reported genetic variants and genes with functional and genomic annotations. OBJECTIVE: Create an integrated/harmonized, and literature-derived collection of population-specific AD genetic associations. METHODS: We developed the Alzheimer's Disease Variant Portal (ADVP), an extensive collection of associations curated from >200 GWAS publications from Alzheimer's Disease Genetics Consortium and other consortia. Genetic associations were systematically extracted, harmonized, and annotated from both the genome-wide significant and suggestive loci reported in these publications. To ensure consistent representation of AD genetic findings, all the extracted genetic association information was harmonized across specifically designed publication, variant, and association categories. RESULTS: ADVP V1.0 (February 2021) catalogs 6,990 associations related to disease-risk, expression quantitative traits, endophenotypes, or neuropathology. This extensive harmonization effort led to a catalog containing >900 loci, >1,800 variants, >80 cohorts, and 8 populations. Besides, ADVP provides investigators with a seamless integration of genomic and publicly available functional annotations across multiple databases per harmonized variant and gene records, thus facilitating further understanding and analyses of these genetics findings. CONCLUSION: ADVP is a valuable resource for investigators to quickly and systematically explore high-confidence AD genetic findings and provides insights into population-specific AD genetic architecture. ADVP is continually maintained and enhanced by NIAGADS and is freely accessible at https://advp.niagads.org.


Assuntos
Doença de Alzheimer , Estudo de Associação Genômica Ampla , Doença de Alzheimer/genética , Endofenótipos , Predisposição Genética para Doença/genética , Humanos , Polimorfismo de Nucleotídeo Único
16.
Methods Mol Biol ; 2254: 73-91, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33326071

RESUMO

The INFERNO method provides an integrative computational framework for characterizing the causal variants, tissue contexts, affected regulatory mechanisms, and target genes underlying noncoding genetic variants associated with any phenotype or disease of interest. Here we describe the computational steps required to run the full INFERNO pipeline on any dataset of interest.


Assuntos
Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos , RNA Longo não Codificante/genética , Software , Humanos , Anotação de Sequência Molecular
17.
BMC Bioinformatics ; 11 Suppl 8: S1, 2010 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-21034426

RESUMO

BACKGROUND: We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. RESULTS: The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families. CONCLUSIONS: Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.


Assuntos
Algoritmos , Sítios de Ligação , Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Motivos de Aminoácidos , Inteligência Artificial , DNA/química , Bases de Dados Genéticas , Conformação de Ácido Nucleico , Conformação Proteica , Software , Fatores de Transcrição/química
18.
NAR Genom Bioinform ; 2(2): lqaa022, 2020 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-32270138

RESUMO

Most regulatory chromatin interactions are mediated by various transcription factors (TFs) and involve physically interacting elements such as enhancers, insulators or promoters. To map these elements and interactions at a fine scale, we developed HIPPIE2 that analyzes raw reads from high-throughput chromosome conformation (Hi-C) experiments to identify precise loci of DNA physically interacting regions (PIRs). Unlike standard genome binning approaches (e.g. 10-kb to 1-Mb bins), HIPPIE2 dynamically infers the physical locations of PIRs using the distribution of restriction sites to increase analysis precision and resolution. We applied HIPPIE2 to in situ Hi-C datasets across six human cell lines (GM12878, IMR90, K562, HMEC, HUVEC, NHEK) with matched ENCODE/Roadmap functional genomic data. HIPPIE2 detected 1042 738 distinct PIRs, with high resolution (average PIR length of 1006 bp) and high reproducibility (92.3% in GM12878). PIRs are enriched for epigenetic marks (H3K27ac, H3K4me1) and open chromatin, suggesting active regulatory roles. HIPPIE2 identified 2.8 million significant PIR-PIR interactions, 27.2% of which were enriched for TF binding sites. 50 608 interactions were enhancer-promoter interactions and were enriched for 33 TFs, including known DNA looping/long-range mediators. These findings demonstrate that the novel dynamic approach of HIPPIE2 (https://bitbucket.com/wanglab-upenn/HIPPIE2) enables the characterization of chromatin and regulatory interactions with high resolution and reproducibility.

19.
Comput Struct Biotechnol J ; 18: 1539-1547, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32637050

RESUMO

Recent high-throughput structure-sensitive genome-wide sequencing-based assays have enabled large-scale studies of RNA structure, and robust transcriptome-wide computational prediction of individual RNA structures across RNA classes from these assays has potential to further improve the prediction accuracy. Here, we describe HiPR, a novel method for RNA structure prediction at single-nucleotide resolution that combines high-throughput structure probing data (DMS-seq, DMS-MaPseq) with a novel probabilistic folding algorithm. On validation data spanning a variety of RNA classes, HiPR often increases accuracy for predicting RNA structures, giving researchers new tools to study RNA structure.

20.
Methods Mol Biol ; 1562: 211-229, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28349463

RESUMO

RNA molecules are often altered post-transcriptionally by the covalent modification of their nucleotides. These modifications are known to modulate the structure, function, and activity of RNAs. When reverse transcribed into cDNA during RNA sequencing library preparation, atypical (modified) ribonucleotides that affect Watson-Crick base pairing will interfere with reverse transcriptase (RT), resulting in cDNA products with mis-incorporated bases or prematurely terminated RNA products. These interactions with RT can therefore be inferred from mismatch patterns in the sequencing reads, and are distinguishable from simple base-calling errors, single-nucleotide polymorphisms (SNPs), or RNA editing sites. Here, we describe a computational protocol for the in silico identification of modified ribonucleotides from RT-based RNA-seq read-out using the High-throughput Analysis of Modified Ribonucleotides (HAMR) software. HAMR can identify these modifications transcriptome-wide with single nucleotide resolution, and also differentiate between different types of modifications to predict modification identity. Researchers can use HAMR to identify and characterize RNA modifications using RNA-seq data from a variety of common RT-based sequencing protocols such as Poly(A), total RNA-seq, and small RNA-seq.


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Processamento Pós-Transcricional do RNA , RNA/genética , Software , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Genoma , Genômica/métodos , Humanos , Navegador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA