Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Mol Biol Evol ; 41(5)2024 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-38758089

RESUMEN

Polyploidy is a prominent mechanism of plant speciation and adaptation, yet the mechanistic understandings of duplicated gene regulation remain elusive. Chromatin structure dynamics are suggested to govern gene regulatory control. Here, we characterized genome-wide nucleosome organization and chromatin accessibility in allotetraploid cotton, Gossypium hirsutum (AADD, 2n = 4X = 52), relative to its two diploid parents (AA or DD genome) and their synthetic diploid hybrid (AD), using DNS-seq. The larger A-genome exhibited wider average nucleosome spacing in diploids, and this intergenomic difference diminished in the allopolyploid but not hybrid. Allopolyploidization also exhibited increased accessibility at promoters genome-wide and synchronized cis-regulatory motifs between subgenomes. A prominent cis-acting control was inferred for chromatin dynamics and demonstrated by transposable element removal from promoters. Linking accessibility to gene expression patterns, we found distinct regulatory effects for hybridization and later allopolyploid stages, including nuanced establishment of homoeolog expression bias and expression level dominance. Histone gene expression and nucleosome organization are coordinated through chromatin accessibility. Our study demonstrates the capability to track high-resolution chromatin structure dynamics and reveals their role in the evolution of cis-regulatory landscapes and duplicate gene expression in polyploids, illuminating regulatory ties to subgenomic asymmetry and dominance.


Asunto(s)
Cromatina , Diploidia , Evolución Molecular , Gossypium , Poliploidía , Gossypium/genética , Cromatina/genética , Regulación de la Expresión Génica de las Plantas , Genoma de Planta , Nucleosomas/genética , Genes Duplicados , Regiones Promotoras Genéticas
2.
PLoS Genet ; 17(8): e1009689, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-34383745

RESUMEN

Elucidating the transcriptional regulatory networks that underlie growth and development requires robust ways to define the complete set of transcription factor (TF) binding sites. Although TF-binding sites are known to be generally located within accessible chromatin regions (ACRs), pinpointing these DNA regulatory elements globally remains challenging. Current approaches primarily identify binding sites for a single TF (e.g. ChIP-seq), or globally detect ACRs but lack the resolution to consistently define TF-binding sites (e.g. DNAse-seq, ATAC-seq). To address this challenge, we developed MNase-defined cistrome-Occupancy Analysis (MOA-seq), a high-resolution (< 30 bp), high-throughput, and genome-wide strategy to globally identify putative TF-binding sites within ACRs. We used MOA-seq on developing maize ears as a proof of concept, able to define a cistrome of 145,000 MOA footprints (MFs). While a substantial majority (76%) of the known ATAC-seq ACRs intersected with the MFs, only a minority of MFs overlapped with the ATAC peaks, indicating that the majority of MFs were novel and not detected by ATAC-seq. MFs were associated with promoters and significantly enriched for TF-binding and long-range chromatin interaction sites, including for the well-characterized FASCIATED EAR4, KNOTTED1, and TEOSINTE BRANCHED1. Importantly, the MOA-seq strategy improved the spatial resolution of TF-binding prediction and allowed us to identify 215 motif families collectively distributed over more than 100,000 non-overlapping, putatively-occupied binding sites across the genome. Our study presents a simple, efficient, and high-resolution approach to identify putative TF footprints and binding motifs genome-wide, to ultimately define a native cistrome atlas.


Asunto(s)
Huella de ADN/métodos , Regiones Promotoras Genéticas , Factores de Transcripción/metabolismo , Zea mays/genética , Sitios de Unión , Secuenciación de Inmunoprecipitación de Cromatina , Secuenciación de Nucleótidos de Alto Rendimiento , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Elementos Reguladores de la Transcripción , Secuenciación Completa del Genoma
3.
PLoS Comput Biol ; 16(11): e1007450, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33156882

RESUMEN

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.


Asunto(s)
Curaduría de Datos , Expresión Génica , Metadatos , Biología Computacional
4.
BMC Genomics ; 21(1): 773, 2020 Nov 10.
Artículo en Inglés | MEDLINE | ID: mdl-33167858

RESUMEN

BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS: Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS: The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.


Asunto(s)
Minería de Datos , Procesamiento de Lenguaje Natural , Mapeo de Interacción de Proteínas , Aprendizaje Automático , Mutación
5.
BMC Bioinformatics ; 19(1): 131, 2018 04 11.
Artículo en Inglés | MEDLINE | ID: mdl-29642840

RESUMEN

BACKGROUND: Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. RESULTS: We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. CONCLUSIONS: We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.


Asunto(s)
Algoritmos , Bases de Datos Genéticas , Epigenómica , Simulación por Computador , Variaciones en el Número de Copia de ADN/genética , Desoxirribonucleasas/metabolismo , Genoma , Humanos , Modelos Estadísticos , Neoplasias/genética , Zea mays/genética
6.
BMC Med Inform Decis Mak ; 18(Suppl 2): 42, 2018 07 23.
Artículo en Inglés | MEDLINE | ID: mdl-30066644

RESUMEN

BACKGROUND: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. METHODS: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. RESULTS: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. CONCLUSIONS: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.


Asunto(s)
Algoritmos , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Proteínas/metabolismo , Bases de Datos Factuales
7.
Genome Biol ; 21(1): 165, 2020 07 06.
Artículo en Inglés | MEDLINE | ID: mdl-32631399

RESUMEN

BACKGROUND: The functional genome of agronomically important plant species remains largely unexplored, yet presents a virtually untapped resource for targeted crop improvement. Functional elements of regulatory DNA revealed through profiles of chromatin accessibility can be harnessed for fine-tuning gene expression to optimal phenotypes in specific environments. RESULT: Here, we investigate the non-coding regulatory space in the maize (Zea mays) genome during early reproductive development of pollen- and grain-bearing inflorescences. Using an assay for differential sensitivity of chromatin to micrococcal nuclease (MNase) digestion, we profile accessible chromatin and nucleosome occupancy in these largely undifferentiated tissues and classify at least 1.6% of the genome as accessible, with the majority of MNase hypersensitive sites marking proximal promoters, but also 3' ends of maize genes. This approach maps regulatory elements to footprint-level resolution. Integration of complementary transcriptome profiles and transcription factor occupancy data are used to annotate regulatory factors, such as combinatorial transcription factor binding motifs and long non-coding RNAs, that potentially contribute to organogenesis, including tissue-specific regulation between male and female inflorescence structures. Finally, genome-wide association studies for inflorescence architecture traits based solely on functional regions delineated by MNase hypersensitivity reveals new SNP-trait associations in known regulators of inflorescence development as well as new candidates. CONCLUSIONS: These analyses provide a comprehensive look into the cis-regulatory landscape during inflorescence differentiation in a major cereal crop, which ultimately shapes architecture and influences yield potential.


Asunto(s)
Ensamble y Desensamble de Cromatina , Regulación del Desarrollo de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , Inflorescencia/crecimiento & desarrollo , Zea mays/crecimiento & desarrollo , Genoma de Planta , Estudio de Asociación del Genoma Completo , Inflorescencia/metabolismo , Nucleasa Microcócica , Regiones Promotoras Genéticas , ARN Largo no Codificante/metabolismo , Zea mays/metabolismo
8.
Database (Oxford) ; 20192019 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-30624652

RESUMEN

Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Bases de Datos de Proteínas , Aprendizaje Automático , Humanos , Preparaciones Farmacéuticas/química , Preparaciones Farmacéuticas/metabolismo , Proteínas/química , Proteínas/metabolismo , Semántica , Programas Informáticos
9.
Sci Rep ; 8(1): 16335, 2018 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-30397274

RESUMEN

Molecular mechanisms underlying the health disparity of prostate cancer (PCa) have not been fully determined. In this study, we applied bioinformatic approach to identify and validate dysregulated genes associated with tumor aggressiveness in African American (AA) compared to Caucasian American (CA) men with PCa. We retrieved and analyzed microarray data from 619 PCa patients, 412 AA and 207 CA, and we validated these genes in tumor tissues and cell lines by Real-Time PCR, Western blot, immunocytochemistry (ICC) and immunohistochemistry (IHC) analyses. We identified 362 differentially expressed genes in AA men and involved in regulating signaling pathways associated with tumor aggressiveness. In PCa tissues and cells, NKX3.1, APPL2, TPD52, LTC4S, ALDH1A3 and AMD1 transcripts were significantly upregulated (p < 0.05) compared to normal cells. IHC confirmed the overexpression of TPD52 (p = 0.0098) and LTC4S (p < 0.0005) in AA compared to CA men. ICC and Western blot analyses additionally corroborated this observation in PCa cells. These findings suggest that dysregulation of transcripts in PCa may drive the disparity of PCa outcomes and provide new insights into development of new therapeutic agents against aggressive tumors. More studies are warranted to investigate the clinical significance of these dysregulated genes in promoting the oncogenic pathways in AA men.


Asunto(s)
Negro o Afroamericano/genética , Regulación Neoplásica de la Expresión Génica , Neoplasias de la Próstata/etnología , Neoplasias de la Próstata/genética , Adulto , Negro o Afroamericano/estadística & datos numéricos , Línea Celular Tumoral , Humanos , Masculino , Persona de Mediana Edad , Análisis de Secuencia por Matrices de Oligonucleótidos , Pronóstico , Neoplasias de la Próstata/diagnóstico , Neoplasias de la Próstata/patología , Transducción de Señal/genética , Población Blanca/genética , Población Blanca/estadística & datos numéricos
10.
Data Brief ; 20: 358-363, 2018 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-30175199

RESUMEN

Presented here are data from Next-Generation Sequencing of differential micrococcal nuclease digestions of formaldehyde-crosslinked chromatin in selected tissues of maize (Zea mays) inbred line B73. Supplemental materials include a wet-bench protocol for making DNS-seq libraries, the DNS-seq data processing pipeline for producing genome browser tracks. This report also includes the peak-calling pipeline using the iSeg algorithm to segment positive and negative peaks from the DNS-seq difference profiles. The data repository for the sequence data is the NCBI SRA, BioProject Accession PRJNA445708.

11.
Sci Rep ; 7: 43294, 2017 03 03.
Artículo en Inglés | MEDLINE | ID: mdl-28256629

RESUMEN

Choosing the optimal chemotherapy regimen is still an unmet medical need for breast cancer patients. In this study, we reanalyzed data from seven independent data sets with totally 1079 breast cancer patients. The patients were treated with three different types of commonly used neoadjuvant chemotherapies: anthracycline alone, anthracycline plus paclitaxel, and anthracycline plus docetaxel. We developed random forest models with variable selection using both genetic and clinical variables to predict the response of a patient using pCR (pathological complete response) as the measure of response. The models were then used to reassign an optimal regimen to each patient to maximize the chance of pCR. An independent validation was performed where each independent study was left out during model building and later used for validation. The expected pCR rates of our method are significantly higher than the rates of the best treatments for all the seven independent studies. A validation study on 21 breast cancer cell lines showed that our prediction agrees with their drug-sensitivity profiles. In conclusion, the new strategy, called PRES (Personalized REgimen Selection), may significantly increase response rates for breast cancer patients, especially those with HER2 and ER negative tumors, who will receive one of the widely-accepted chemotherapy regimens.


Asunto(s)
Antineoplásicos/administración & dosificación , Neoplasias de la Mama/tratamiento farmacológico , Neoplasias de la Mama/patología , Quimioterapia/métodos , Medicina de Precisión/métodos , Transcriptoma , Antraciclinas/administración & dosificación , Línea Celular Tumoral , Docetaxel , Femenino , Humanos , Masculino , Modelos Biológicos , Paclitaxel/administración & dosificación , Taxoides/administración & dosificación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA