Rechercher | Portail Régional BVS

1.

Evolutionary Dynamics of Chromatin Structure and Duplicate Gene Expression in Diploid and Allopolyploid Cotton.

Hu, Guanjing; Grover, Corrinne E; Vera, Daniel L; Lung, Pei-Yau; Girimurugan, Senthil B; Miller, Emma R; Conover, Justin L; Ou, Shujun; Xiong, Xianpeng; Zhu, De; Li, Dongming; Gallagher, Joseph P; Udall, Joshua A; Sui, Xin; Zhang, Jinfeng; Bass, Hank W; Wendel, Jonathan F.

Mol Biol Evol ; 41(5)2024 May 03.

Article de Anglais | MEDLINE | ID: mdl-38758089

RÉSUMÉ

Polyploidy is a prominent mechanism of plant speciation and adaptation, yet the mechanistic understandings of duplicated gene regulation remain elusive. Chromatin structure dynamics are suggested to govern gene regulatory control. Here, we characterized genome-wide nucleosome organization and chromatin accessibility in allotetraploid cotton, Gossypium hirsutum (AADD, 2n = 4X = 52), relative to its two diploid parents (AA or DD genome) and their synthetic diploid hybrid (AD), using DNS-seq. The larger A-genome exhibited wider average nucleosome spacing in diploids, and this intergenomic difference diminished in the allopolyploid but not hybrid. Allopolyploidization also exhibited increased accessibility at promoters genome-wide and synchronized cis-regulatory motifs between subgenomes. A prominent cis-acting control was inferred for chromatin dynamics and demonstrated by transposable element removal from promoters. Linking accessibility to gene expression patterns, we found distinct regulatory effects for hybridization and later allopolyploid stages, including nuanced establishment of homoeolog expression bias and expression level dominance. Histone gene expression and nucleosome organization are coordinated through chromatin accessibility. Our study demonstrates the capability to track high-resolution chromatin structure dynamics and reveals their role in the evolution of cis-regulatory landscapes and duplicate gene expression in polyploids, illuminating regulatory ties to subgenomic asymmetry and dominance.

Sujet(s)

Chromatine , Diploïdie , Évolution moléculaire , Gossypium , Polyploïdie , Gossypium/génétique , Chromatine/génétique , Régulation de l'expression des gènes végétaux , Génome végétal , Nucléosomes/génétique , Gènes dupliqués , Régions promotrices (génétique)

2.

The native cistrome and sequence motif families of the maize ear.

Savadel, Savannah D; Hartwig, Thomas; Turpin, Zachary M; Vera, Daniel L; Lung, Pei-Yau; Sui, Xin; Blank, Max; Frommer, Wolf B; Dennis, Jonathan H; Zhang, Jinfeng; Bass, Hank W.

PLoS Genet ; 17(8): e1009689, 2021 08.

Article de Anglais | MEDLINE | ID: mdl-34383745

RÉSUMÉ

Elucidating the transcriptional regulatory networks that underlie growth and development requires robust ways to define the complete set of transcription factor (TF) binding sites. Although TF-binding sites are known to be generally located within accessible chromatin regions (ACRs), pinpointing these DNA regulatory elements globally remains challenging. Current approaches primarily identify binding sites for a single TF (e.g. ChIP-seq), or globally detect ACRs but lack the resolution to consistently define TF-binding sites (e.g. DNAse-seq, ATAC-seq). To address this challenge, we developed MNase-defined cistrome-Occupancy Analysis (MOA-seq), a high-resolution (< 30 bp), high-throughput, and genome-wide strategy to globally identify putative TF-binding sites within ACRs. We used MOA-seq on developing maize ears as a proof of concept, able to define a cistrome of 145,000 MOA footprints (MFs). While a substantial majority (76%) of the known ATAC-seq ACRs intersected with the MFs, only a minority of MFs overlapped with the ATAC peaks, indicating that the majority of MFs were novel and not detected by ATAC-seq. MFs were associated with promoters and significantly enriched for TF-binding and long-range chromatin interaction sites, including for the well-characterized FASCIATED EAR4, KNOTTED1, and TEOSINTE BRANCHED1. Importantly, the MOA-seq strategy improved the spatial resolution of TF-binding prediction and allowed us to identify 215 motif families collectively distributed over more than 100,000 non-overlapping, putatively-occupied binding sites across the genome. Our study presents a simple, efficient, and high-resolution approach to identify putative TF footprints and binding motifs genome-wide, to ultimately define a native cistrome atlas.

Sujet(s)

Prise d'empreintes sur l'ADN/méthodes , Régions promotrices (génétique) , Facteurs de transcription/métabolisme , Zea mays/génétique , Sites de fixation , Séquençage après immunoprécipitation de la chromatine , Séquençage nucléotidique à haut débit , Protéines végétales/génétique , Protéines végétales/métabolisme , Éléments de régulation transcriptionnelle , Séquençage du génome entier

3.

Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach.

Qu, Jinchan; Steppi, Albert; Zhong, Dongrui; Hao, Jie; Wang, Jian; Lung, Pei-Yau; Zhao, Tingting; He, Zhe; Zhang, Jinfeng.

BMC Genomics ; 21(1): 773, 2020 Nov 10.

Article de Anglais | MEDLINE | ID: mdl-33167858

RÉSUMÉ

BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS: Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS: The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.

Sujet(s)

Fouille de données , Traitement du langage naturel , Cartographie d'interactions entre protéines , Apprentissage machine , Mutation

4.

Maximizing the reusability of gene expression data by predicting missing metadata.

Lung, Pei-Yau; Zhong, Dongrui; Pang, Xiaodong; Li, Yan; Zhang, Jinfeng.

PLoS Comput Biol ; 16(11): e1007450, 2020 11.

Article de Anglais | MEDLINE | ID: mdl-33156882

RÉSUMÉ

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

Sujet(s)

Curation de données , Expression des gènes , Métadonnées , Biologie informatique

5.

The regulatory landscape of early maize inflorescence development.

Parvathaneni, Rajiv K; Bertolini, Edoardo; Shamimuzzaman, Md; Vera, Daniel L; Lung, Pei-Yau; Rice, Brian R; Zhang, Jinfeng; Brown, Patrick J; Lipka, Alexander E; Bass, Hank W; Eveland, Andrea L.

Genome Biol ; 21(1): 165, 2020 07 06.

Article de Anglais | MEDLINE | ID: mdl-32631399

RÉSUMÉ

BACKGROUND: The functional genome of agronomically important plant species remains largely unexplored, yet presents a virtually untapped resource for targeted crop improvement. Functional elements of regulatory DNA revealed through profiles of chromatin accessibility can be harnessed for fine-tuning gene expression to optimal phenotypes in specific environments. RESULT: Here, we investigate the non-coding regulatory space in the maize (Zea mays) genome during early reproductive development of pollen- and grain-bearing inflorescences. Using an assay for differential sensitivity of chromatin to micrococcal nuclease (MNase) digestion, we profile accessible chromatin and nucleosome occupancy in these largely undifferentiated tissues and classify at least 1.6% of the genome as accessible, with the majority of MNase hypersensitive sites marking proximal promoters, but also 3' ends of maize genes. This approach maps regulatory elements to footprint-level resolution. Integration of complementary transcriptome profiles and transcription factor occupancy data are used to annotate regulatory factors, such as combinatorial transcription factor binding motifs and long non-coding RNAs, that potentially contribute to organogenesis, including tissue-specific regulation between male and female inflorescence structures. Finally, genome-wide association studies for inflorescence architecture traits based solely on functional regions delineated by MNase hypersensitivity reveals new SNP-trait associations in known regulators of inflorescence development as well as new candidates. CONCLUSIONS: These analyses provide a comprehensive look into the cis-regulatory landscape during inflorescence differentiation in a major cereal crop, which ultimately shapes architecture and influences yield potential.

Sujet(s)

Assemblage et désassemblage de la chromatine , Régulation de l'expression des gènes au cours du développement , Régulation de l'expression des gènes végétaux , Inflorescence/croissance et développement , Zea mays/croissance et développement , Génome végétal , Étude d'association pangénomique , Inflorescence/métabolisme , Micrococcal nuclease , Régions promotrices (génétique) , ARN long non codant/métabolisme , Zea mays/métabolisme

6.

Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering.

Lung, Pei-Yau; He, Zhe; Zhao, Tingting; Yu, Disa; Zhang, Jinfeng.

Database (Oxford) ; 20192019 01 01.

Article de Anglais | MEDLINE | ID: mdl-30624652

RÉSUMÉ

Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.

Sujet(s)

Biologie informatique/méthodes , Fouille de données/méthodes , Bases de données chimiques , Bases de données de protéines , Apprentissage machine , Humains , Préparations pharmaceutiques/composition chimique , Préparations pharmaceutiques/métabolisme , Protéines/composition chimique , Protéines/métabolisme , Sémantique , Logiciel

7.

Dysregulated gene expression predicts tumor aggressiveness in African-American prostate cancer patients.

Ali, Hamdy E A; Lung, Pei-Yau; Sholl, Andrew B; Gad, Shaimaa A; Bustamante, Juan J; Ali, Hamed I; Rhim, Johng S; Deep, Gagan; Zhang, Jinfeng; Abd Elmageed, Zakaria Y.

Sci Rep ; 8(1): 16335, 2018 11 05.

Article de Anglais | MEDLINE | ID: mdl-30397274

RÉSUMÉ

Molecular mechanisms underlying the health disparity of prostate cancer (PCa) have not been fully determined. In this study, we applied bioinformatic approach to identify and validate dysregulated genes associated with tumor aggressiveness in African American (AA) compared to Caucasian American (CA) men with PCa. We retrieved and analyzed microarray data from 619 PCa patients, 412 AA and 207 CA, and we validated these genes in tumor tissues and cell lines by Real-Time PCR, Western blot, immunocytochemistry (ICC) and immunohistochemistry (IHC) analyses. We identified 362 differentially expressed genes in AA men and involved in regulating signaling pathways associated with tumor aggressiveness. In PCa tissues and cells, NKX3.1, APPL2, TPD52, LTC4S, ALDH1A3 and AMD1 transcripts were significantly upregulated (p < 0.05) compared to normal cells. IHC confirmed the overexpression of TPD52 (p = 0.0098) and LTC4S (p < 0.0005) in AA compared to CA men. ICC and Western blot analyses additionally corroborated this observation in PCa cells. These findings suggest that dysregulation of transcripts in PCa may drive the disparity of PCa outcomes and provide new insights into development of new therapeutic agents against aggressive tumors. More studies are warranted to investigate the clinical significance of these dysregulated genes in promoting the oncogenic pathways in AA men.

Sujet(s)

/génétique , Régulation de l'expression des gènes tumoraux , Tumeurs de la prostate/ethnologie , Tumeurs de la prostate/génétique , Adulte , /statistiques et données numériques , Lignée cellulaire tumorale , Humains , Mâle , Adulte d'âge moyen , Séquençage par oligonucléotides en batterie , Pronostic , Tumeurs de la prostate/diagnostic , Tumeurs de la prostate/anatomopathologie , Transduction du signal/génétique , /génétique , /statistiques et données numériques

8.

Chromatin structure profile data from DNS-seq: Differential nuclease sensitivity mapping of four reference tissues of B73 maize (Zea mays L).

Turpin, Zachary M; Vera, Daniel L; Savadel, Savannah D; Lung, Pei-Yau; Wear, Emily E; Mickelson-Young, Leigh; Thompson, William F; Hanley-Bowdoin, Linda; Dennis, Jonathan H; Zhang, Jinfeng; Bass, Hank W.

Data Brief ; 20: 358-363, 2018 Oct.

Article de Anglais | MEDLINE | ID: mdl-30175199

RÉSUMÉ

Presented here are data from Next-Generation Sequencing of differential micrococcal nuclease digestions of formaldehyde-crosslinked chromatin in selected tissues of maize (Zea mays) inbred line B73. Supplemental materials include a wet-bench protocol for making DNS-seq libraries, the DNS-seq data processing pipeline for producing genome browser tracks. This report also includes the peak-calling pipeline using the iSeg algorithm to segment positive and negative peaks from the DNS-seq difference profiles. The data repository for the sequence data is the NCBI SRA, BioProject Accession PRJNA445708.

9.

Automatic extraction of protein-protein interactions using grammatical relationship graph.

Yu, Kaixian; Lung, Pei-Yau; Zhao, Tingting; Zhao, Peixiang; Tseng, Yan-Yuan; Zhang, Jinfeng.

BMC Med Inform Decis Mak ; 18(Suppl 2): 42, 2018 07 23.

Article de Anglais | MEDLINE | ID: mdl-30066644

RÉSUMÉ

BACKGROUND: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. METHODS: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. RESULTS: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. CONCLUSIONS: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.

Sujet(s)

Algorithmes , Mémorisation et recherche des informations/méthodes , Traitement du langage naturel , Protéines/métabolisme , Bases de données factuelles

10.

iSeg: an efficient algorithm for segmentation of genomic and epigenomic data.

Girimurugan, Senthil B; Liu, Yuhang; Lung, Pei-Yau; Vera, Daniel L; Dennis, Jonathan H; Bass, Hank W; Zhang, Jinfeng.

BMC Bioinformatics ; 19(1): 131, 2018 04 11.

Article de Anglais | MEDLINE | ID: mdl-29642840

RÉSUMÉ

BACKGROUND: Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. RESULTS: We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. CONCLUSIONS: We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.

Sujet(s)

Algorithmes , Bases de données génétiques , Épigénomique , Simulation numérique , Variations de nombre de copies de segment d'ADN/génétique , Désoxyribonucléases/métabolisme , Génome , Humains , Modèles statistiques , Tumeurs/génétique , Zea mays/génétique

11.

Personalized chemotherapy selection for breast cancer using gene expression profiles.

Yu, Kaixian; Sang, Qing-Xiang Amy; Lung, Pei-Yau; Tan, Winston; Lively, Ty; Sheffield, Cedric; Bou-Dargham, Mayassa J; Liu, Jun S; Zhang, Jinfeng.

Sci Rep ; 7: 43294, 2017 03 03.

Article de Anglais | MEDLINE | ID: mdl-28256629

RÉSUMÉ

Choosing the optimal chemotherapy regimen is still an unmet medical need for breast cancer patients. In this study, we reanalyzed data from seven independent data sets with totally 1079 breast cancer patients. The patients were treated with three different types of commonly used neoadjuvant chemotherapies: anthracycline alone, anthracycline plus paclitaxel, and anthracycline plus docetaxel. We developed random forest models with variable selection using both genetic and clinical variables to predict the response of a patient using pCR (pathological complete response) as the measure of response. The models were then used to reassign an optimal regimen to each patient to maximize the chance of pCR. An independent validation was performed where each independent study was left out during model building and later used for validation. The expected pCR rates of our method are significantly higher than the rates of the best treatments for all the seven independent studies. A validation study on 21 breast cancer cell lines showed that our prediction agrees with their drug-sensitivity profiles. In conclusion, the new strategy, called PRES (Personalized REgimen Selection), may significantly increase response rates for breast cancer patients, especially those with HER2 and ER negative tumors, who will receive one of the widely-accepted chemotherapy regimens.

Sujet(s)

Antinéoplasiques/administration et posologie , Tumeurs du sein/traitement médicamenteux , Tumeurs du sein/anatomopathologie , Traitement médicamenteux/méthodes , Médecine de précision/méthodes , Transcriptome , Anthracyclines/administration et posologie , Lignée cellulaire tumorale , Docetaxel , Femelle , Humains , Mâle , Modèles biologiques , Paclitaxel/administration et posologie , Taxoïdes/administration et posologie

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

ENVOYER À:

SÉLECTION CITATIONS

DÉTAIL DE RECHERCHE