Search | VHL Search Portal

1.

DeepFold: enhancing protein structure prediction through optimized loss functions, improved template features, and re-optimized energy function.

Lee, Jae-Won; Won, Jong-Hyun; Jeon, Seonggwang; Choo, Yujin; Yeon, Yubin; Oh, Jin-Seon; Kim, Minsoo; Kim, SeonHwa; Joung, InSuk; Jang, Cheongjae; Lee, Sung Jong; Kim, Tae Hyun; Jin, Kyong Hwan; Song, Giltae; Kim, Eun-Sol; Yoo, Jejoong; Paek, Eunok; Noh, Yung-Kyun; Joo, Keehyoung.

Bioinformatics ; 39(12)2023 12 01.

Article in English | MEDLINE | ID: mdl-37995286

ABSTRACT

MOTIVATION: Predicting protein structures with high accuracy is a critical challenge for the broad community of life sciences and industry. Despite progress made by deep neural networks like AlphaFold2, there is a need for further improvements in the quality of detailed structures, such as side-chains, along with protein backbone structures. RESULTS: Building upon the successes of AlphaFold2, the modifications we made include changing the losses of side-chain torsion angles and frame aligned point error, adding loss functions for side chain confidence and secondary structure prediction, and replacing template feature generation with a new alignment method based on conditional random fields. We also performed re-optimization by conformational space annealing using a molecular mechanics energy function which integrates the potential energies obtained from distogram and side-chain prediction. In the CASP15 blind test for single protein and domain modeling (109 domains), DeepFold ranked fourth among 132 groups with improvements in the details of the structure in terms of backbone, side-chain, and Molprobity. In terms of protein backbone accuracy, DeepFold achieved a median GDT-TS score of 88.64 compared with 85.88 of AlphaFold2. For TBM-easy/hard targets, DeepFold ranked at the top based on Z-scores for GDT-TS. This shows its practical value to the structural biology community, which demands highly accurate structures. In addition, a thorough analysis of 55 domains from 39 targets with publicly available structures indicates that DeepFold shows superior side-chain accuracy and Molprobity scores among the top-performing groups. AVAILABILITY AND IMPLEMENTATION: DeepFold tools are open-source software available at https://github.com/newtonjoo/deepfold.

Subject(s)

Proteins , Software , Protein Conformation , Proteins/chemistry , Protein Structure, Secondary , Protein Folding

2.

AptaTrans: a deep neural network for predicting aptamer-protein interaction using pretrained encoders.

Shin, Incheol; Kang, Keumseok; Kim, Juseong; Sel, Sanghun; Choi, Jeonghoon; Lee, Jae-Wook; Kang, Ho Young; Song, Giltae.

BMC Bioinformatics ; 24(1): 447, 2023 Nov 27.

Article in English | MEDLINE | ID: mdl-38012571

ABSTRACT

BACKGROUND: Aptamers, which are biomaterials comprised of single-stranded DNA/RNA that form tertiary structures, have significant potential as next-generation materials, particularly for drug discovery. The systematic evolution of ligands by exponential enrichment (SELEX) method is a critical in vitro technique employed to identify aptamers that bind specifically to target proteins. While advanced SELEX-based methods such as Cell- and HT-SELEX are available, they often encounter issues such as extended time consumption and suboptimal accuracy. Several In silico aptamer discovery methods have been proposed to address these challenges. These methods are specifically designed to predict aptamer-protein interaction (API) using benchmark datasets. However, these methods often fail to consider the physicochemical interactions between aptamers and proteins within tertiary structures. RESULTS: In this study, we propose AptaTrans, a pipeline for predicting API using deep learning techniques. AptaTrans uses transformer-based encoders to handle aptamer and protein sequences at the monomer level. Furthermore, pretrained encoders are utilized for the structural representation. After validation with a benchmark dataset, AptaTrans has been integrated into a comprehensive toolset. This pipeline synergistically combines with Apta-MCTS, a generative algorithm for recommending aptamer candidates. CONCLUSION: The results show that AptaTrans outperforms existing models for predicting API, and the efficacy of the AptaTrans pipeline has been confirmed through various experimental tools. We expect AptaTrans will enhance the cost-effectiveness and efficiency of SELEX in drug discovery. The source code and benchmark dataset for AptaTrans are available at https://github.com/pnumlb/AptaTrans .

Subject(s)

Aptamers, Nucleotide , Aptamers, Nucleotide/chemistry , SELEX Aptamer Technique/methods , Software , Neural Networks, Computer , Algorithms , Ligands

3.

FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model.

Lee, Dohyeon; Song, Giltae.

Bioinformatics ; 38(2): 351-356, 2022 01 03.

Article in English | MEDLINE | ID: mdl-34623374

ABSTRACT

MOTIVATION: Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. RESULTS: We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. AVAILABILITY AND IMPLEMENTATION: FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Data Compression , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms , Data Compression/methods

4.

Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562.

Zhou, Bo; Ho, Steve S; Greer, Stephanie U; Zhu, Xiaowei; Bell, John M; Arthur, Joseph G; Spies, Noah; Zhang, Xianglong; Byeon, Seunggyu; Pattni, Reenal; Ben-Efraim, Noa; Haney, Michael S; Haraksingh, Rajini R; Song, Giltae; Ji, Hanlee P; Perrin, Dimitri; Wong, Wing H; Abyzov, Alexej; Urban, Alexander E.

Genome Res ; 29(3): 472-484, 2019 03.

Article in English | MEDLINE | ID: mdl-30737237

ABSTRACT

K562 is widely used in biomedical research. It is one of three tier-one cell lines of ENCODE and also most commonly used for large-scale CRISPR/Cas9 screens. Although its functional genomic and epigenomic characteristics have been extensively studied, its genome sequence and genomic structural features have never been comprehensively analyzed. Such information is essential for the correct interpretation and understanding of the vast troves of existing functional genomics and epigenomics data for K562. We performed and integrated deep-coverage whole-genome (short-insert), mate-pair, and linked-read sequencing as well as karyotyping and array CGH analysis to identify a wide spectrum of genome characteristics in K562: copy numbers (CN) of aneuploid chromosome segments at high-resolution, SNVs and indels (both corrected for CN in aneuploid regions), loss of heterozygosity, megabase-scale phased haplotypes often spanning entire chromosome arms, structural variants (SVs), including small and large-scale complex SVs and nonreference retrotransposon insertions. Many SVs were phased, assembled, and experimentally validated. We identified multiple allele-specific deletions and duplications within the tumor suppressor gene FHIT Taking aneuploidy into account, we reanalyzed K562 RNA-seq and whole-genome bisulfite sequencing data for allele-specific expression and allele-specific DNA methylation. We also show examples of how deeper insights into regulatory complexity are gained by integrating genomic variant information and structural context with functional genomics and epigenomics data. Furthermore, using K562 haplotype information, we produced an allele-specific CRISPR targeting map. This comprehensive whole-genome analysis serves as a resource for future studies that utilize K562 as well as a framework for the analysis of other cancer genomes.

Subject(s)

Genome, Human , Humans , K562 Cells , Karyotype , Polymorphism, Genetic , Whole Genome Sequencing

5.

Haplotype-resolved and integrated genome analysis of the cancer cell line HepG2.

Zhou, Bo; Ho, Steve S; Greer, Stephanie U; Spies, Noah; Bell, John M; Zhang, Xianglong; Zhu, Xiaowei; Arthur, Joseph G; Byeon, Seunggyu; Pattni, Reenal; Saha, Ishan; Huang, Yiling; Song, Giltae; Perrin, Dimitri; Wong, Wing H; Ji, Hanlee P; Abyzov, Alexej; Urban, Alexander E.

Nucleic Acids Res ; 47(8): 3846-3861, 2019 05 07.

Article in English | MEDLINE | ID: mdl-30864654

ABSTRACT

HepG2 is one of the most widely used human cancer cell lines in biomedical research and one of the main cell lines of ENCODE. Although the functional genomic and epigenomic characteristics of HepG2 are extensively studied, its genome sequence has never been comprehensively analyzed and higher order genomic structural features are largely unknown. The high degree of aneuploidy in HepG2 renders traditional genome variant analysis methods challenging and partially ineffective. Correct and complete interpretation of the extensive functional genomics data from HepG2 requires an understanding of the cell line's genome sequence and genome structure. Using a variety of sequencing and analysis methods, we identified a wide spectrum of genome characteristics in HepG2: copy numbers of chromosomal segments at high resolution, SNVs and Indels (corrected for aneuploidy), regions with loss of heterozygosity, phased haplotypes extending to entire chromosome arms, retrotransposon insertions and structural variants (SVs) including complex and somatic genomic rearrangements. A large number of SVs were phased, sequence assembled and experimentally validated. We re-analyzed published HepG2 datasets for allele-specific expression and DNA methylation and assembled an allele-specific CRISPR/Cas9 targeting map. We demonstrate how deeper insights into genomic regulatory complexity are gained by adopting a genome-integrated framework.

Subject(s)

Chromosome Mapping/methods , Genome, Human , Genomics/methods , Haplotypes , Sequence Analysis, DNA/statistics & numerical data , Alleles , Aneuploidy , DNA Methylation , Genomic Structural Variation , Hep G2 Cells , High-Throughput Nucleotide Sequencing , Humans , INDEL Mutation , Karyotyping , Loss of Heterozygosity , Polymorphism, Single Nucleotide , Retroelements

6.

Authentication of differential gene expression in oral squamous cell carcinoma using machine learning applications.

Pratama, Rian; Hwang, Jae Joon; Lee, Ji Hye; Song, Giltae; Park, Hae Ryoun.

BMC Oral Health ; 21(1): 281, 2021 05 29.

Article in English | MEDLINE | ID: mdl-34051764

ABSTRACT

BACKGROUND: Recently, the possibility of tumour classification based on genetic data has been investigated. However, genetic datasets are difficult to handle because of their massive size and complexity of manipulation. In the present study, we examined the diagnostic performance of machine learning applications using imaging-based classifications of oral squamous cell carcinoma (OSCC) gene sets. METHODS: RNA sequencing data from SCC tissues from various sites, including oral, non-oral head and neck, oesophageal, and cervical regions, were downloaded from The Cancer Genome Atlas (TCGA). The feature genes were extracted through a convolutional neural network (CNN) and machine learning, and the performance of each analysis was compared. RESULTS: The ability of the machine learning analysis to classify OSCC tumours was excellent. However, the tool exhibited poorer performance in discriminating histopathologically dissimilar cancers derived from the same type of tissue than in differentiating cancers of the same histopathologic type with different tissue origins, revealing that the differential gene expression pattern is a more important factor than the histopathologic features for differentiating cancer types. CONCLUSION: The CNN-based diagnostic model and the visualisation methods using RNA sequencing data were useful for correctly categorising OSCC. The analysis showed differentially expressed genes in multiwise comparisons of various types of SCCs, such as KCNA10, FOSL2, and PRDM16, and extracted leader genes from pairwise comparisons were FGF20, DLC1, and ZNF705D.

Subject(s)

Carcinoma, Squamous Cell , Head and Neck Neoplasms , Mouth Neoplasms , Carcinoma, Squamous Cell/genetics , Gene Expression , Gene Expression Regulation, Neoplastic , Humans , Machine Learning , Mouth Neoplasms/genetics , Squamous Cell Carcinoma of Head and Neck

7.

The Saccharomyces Genome Database Variant Viewer.

Sheppard, Travis K; Hitz, Benjamin C; Engel, Stacia R; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C; Dalusag, Kyla S; Demeter, Janos; Hellerstedt, Sage T; Karra, Kalpana; Nash, Robert S; Paskov, Kelley M; Skrzypek, Marek S; Weng, Shuai; Wong, Edith D; Cherry, J Michael.

Nucleic Acids Res ; 44(D1): D698-702, 2016 Jan 04.

Article in English | MEDLINE | ID: mdl-26578556

ABSTRACT

The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer.

Subject(s)

Databases, Genetic , Genetic Variation , Genome, Fungal , Saccharomyces cerevisiae/genetics , Molecular Sequence Annotation , Sequence Alignment , Sequence Analysis, DNA , Sequence Analysis, Protein , User-Computer Interface

8.

DISPAQ: Distributed Profitable-Area Query from Big Taxi Trip Data.

Putri, Fadhilah Kurnia; Song, Giltae; Kwon, Joonho; Rao, Praveen.

Sensors (Basel) ; 17(10)2017 Sep 25.

Article in English | MEDLINE | ID: mdl-28946679

ABSTRACT

One of the crucial problems for taxi drivers is to efficiently locate passengers in order to increase profits. The rapid advancement and ubiquitous penetration of Internet of Things (IoT) technology into transportation industries enables us to provide taxi drivers with locations that have more potential passengers (more profitable areas) by analyzing and querying taxi trip data. In this paper, we propose a query processing system, called Distributed Profitable-Area Query (DISPAQ) which efficiently identifies profitable areas by exploiting the Apache Software Foundation's Spark framework and a MongoDB database. DISPAQ first maintains a profitable-area query index (PQ-index) by extracting area summaries and route summaries from raw taxi trip data. It then identifies candidate profitable areas by searching the PQ-index during query processing. Then, it exploits a Z-Skyline algorithm, which is an extension of skyline processing with a Z-order space filling curve, to quickly refine the candidate profitable areas. To improve the performance of distributed query processing, we also propose local Z-Skyline optimization, which reduces the number of dominant tests by distributing killer profitable areas to each cluster node. Through extensive evaluation with real datasets, we demonstrate that our DISPAQ system provides a scalable and efficient solution for processing profitable-area queries from huge amounts of big taxi trip data.

9.

Acute myocardial infarction prognosis prediction with reliable and interpretable artificial intelligence system.

Kim, Minwook; Kang, Donggil; Kim, Min Sun; Choe, Jeong Cheon; Lee, Sun-Hack; Ahn, Jin Hee; Oh, Jun-Hyok; Choi, Jung Hyun; Lee, Han Cheol; Cha, Kwang Soo; Jang, Kyungtae; Bong, WooR I; Song, Giltae; Lee, Hyewon.

J Am Med Inform Assoc ; 31(7): 1540-1550, 2024 Jun 20.

Article in English | MEDLINE | ID: mdl-38804963

ABSTRACT

OBJECTIVE: Predicting mortality after acute myocardial infarction (AMI) is crucial for timely prescription and treatment of AMI patients, but there are no appropriate AI systems for clinicians. Our primary goal is to develop a reliable and interpretable AI system and provide some valuable insights regarding short, and long-term mortality. MATERIALS AND METHODS: We propose the RIAS framework, an end-to-end framework that is designed with reliability and interpretability at its core and automatically optimizes the given model. Using RIAS, clinicians get accurate and reliable predictions which can be used as likelihood, with global and local explanations, and "what if" scenarios to achieve desired outcomes as well. RESULTS: We apply RIAS to AMI prognosis prediction data which comes from the Korean Acute Myocardial Infarction Registry. We compared FT-Transformer with XGBoost and MLP and found that FT-Transformer has superiority in sensitivity and comparable performance in AUROC and F1 score to XGBoost. Furthermore, RIAS reveals the significance of statin-based medications, beta-blockers, and age on mortality regardless of time period. Lastly, we showcase reliable and interpretable results of RIAS with local explanations and counterfactual examples for several realistic scenarios. DISCUSSION: RIAS addresses the "black-box" issue in AI by providing both global and local explanations based on SHAP values and reliable predictions, interpretable as actual likelihoods. The system's "what if" counterfactual explanations enable clinicians to simulate patient-specific scenarios under various conditions, enhancing its practical utility. CONCLUSION: The proposed framework provides reliable and interpretable predictions along with counterfactual examples.

Subject(s)

Artificial Intelligence , Myocardial Infarction , Humans , Myocardial Infarction/mortality , Myocardial Infarction/diagnosis , Prognosis , Male , Registries , Female , Republic of Korea , Reproducibility of Results , Aged , Middle Aged

10.

Evaluation of methods for detecting conversion events in gene clusters.

Song, Giltae; Hsu, Chih-Hao; Riemer, Cathy; Miller, Webb.

BMC Bioinformatics ; 12 Suppl 1: S45, 2011 Feb 15.

Article in English | MEDLINE | ID: mdl-21342577

ABSTRACT

BACKGROUND: Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. RESULTS: We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. CONCLUSIONS: Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution.

Subject(s)

Computational Biology/methods , Gene Conversion , Multigene Family , Animals , Biological Evolution , Computer Simulation , Humans , Primates/genetics , Sequence Alignment

11.

Conversion events in gene clusters.

Song, Giltae; Hsu, Chih-Hao; Riemer, Cathy; Zhang, Yu; Kim, Hie Lim; Hoffmann, Federico; Zhang, Louxin; Hardison, Ross C; Green, Eric D; Miller, Webb.

BMC Evol Biol ; 11: 226, 2011 Jul 28.

Article in English | MEDLINE | ID: mdl-21798034

ABSTRACT

BACKGROUND: Gene clusters containing multiple similar genomic regions in close proximity are of great interest for biomedical studies because of their associations with inherited diseases. However, such regions are difficult to analyze due to their structural complexity and their complicated evolutionary histories, reflecting a variety of large-scale mutational events. In particular, conversion events can mislead inferences about the relationships among these regions, as traced by traditional methods such as construction of phylogenetic trees or multi-species alignments. RESULTS: To correct the distorted information generated by such methods, we have developed an automated pipeline called CHAP (Cluster History Analysis Package) for detecting conversion events. We used this pipeline to analyze the conversion events that affected two well-studied gene clusters (α-globin and ß-globin) and three gene clusters for which comparative sequence data were generated from seven primate species: CCL (chemokine ligand), IFN (interferon), and CYP2abf (part of cytochrome P450 family 2). CHAP is freely available at http://www.bx.psu.edu/miller_lab. CONCLUSIONS: These studies reveal the value of characterizing conversion events in the context of studying gene clusters in complex genomes.

Subject(s)

Gene Conversion , Multigene Family , Primates/genetics , alpha-Globins/genetics , beta-Globins/genetics , Animals , Evolution, Molecular , Genome , Humans , Molecular Sequence Data , Phylogeny , Primates/classification , Software

12.

DNN-Boost: Somatic mutation identification of tumor-only whole-exome sequencing data using deep neural network and XGBoost.

Maruf, Firda Aminy; Pratama, Rian; Song, Giltae.

J Bioinform Comput Biol ; 19(6): 2140017, 2021 12.

Article in English | MEDLINE | ID: mdl-34895111

ABSTRACT

Detection of somatic mutation in whole-exome sequencing data can help elucidate the mechanism of tumor progression. Most computational approaches require exome sequencing for both tumor and normal samples. However, it is more common to sequence exomes for tumor samples only without the paired normal samples. To include these types of data for extensive studies on the process of tumorigenesis, it is necessary to develop an approach for identifying somatic mutations using tumor exome sequencing data only. In this study, we designed a machine learning approach using Deep Neural Network (DNN) and XGBoost to identify somatic mutations in tumor-only exome sequencing data and we integrated this into a pipeline called DNN-Boost. The XGBoost algorithm is used to extract the features from the results of variant callers and these features are then fed into the DNN model as input. The XGBoost algorithm resolves issues of missing values and overfitting. We evaluated our proposed model and compared its performance with other existing benchmark methods. We noted that the DNN-Boost classification model outperformed the benchmark method in classifying somatic mutations from paired tumor-normal exome data and tumor-only exome data.

Subject(s)

Neoplasms , Neural Networks, Computer , Exome , High-Throughput Nucleotide Sequencing , Humans , Mutation , Neoplasms/genetics , Exome Sequencing

13.

Predicting aptamer sequences that interact with target proteins using an aptamer-protein interaction classifier and a Monte Carlo tree search approach.

Lee, Gwangho; Jang, Gun Hyuk; Kang, Ho Young; Song, Giltae.

PLoS One ; 16(6): e0253760, 2021.

Article in English | MEDLINE | ID: mdl-34170922

ABSTRACT

Oligonucleotide-based aptamers, which have a three-dimensional structure with a single-stranded fragment, feature various characteristics with respect to size, toxicity, and permeability. Accordingly, aptamers are advantageous in terms of diagnosis and treatment and are materials that can be produced through relatively simple experiments. Systematic evolution of ligands by exponential enrichment (SELEX) is one of the most widely used experimental methods for generating aptamers; however, it is highly expensive and time-consuming. To reduce the related costs, recent studies have used in silico approaches, such as aptamer-protein interaction (API) classifiers that use sequence patterns to determine the binding affinity between RNA aptamers and proteins. Some of these methods generate candidate RNA aptamer sequences that bind to a target protein, but they are limited to producing candidates of a specific size. In this study, we present a machine learning approach for selecting candidate sequences of various sizes that have a high binding affinity for a specific sequence of a target protein. We applied the Monte Carlo tree search (MCTS) algorithm for generating the candidate sequences using a score function based on an API classifier. The tree structure that we designed with MCTS enables nucleotide sequence sampling, and the obtained sequences are potential aptamer candidates. We performed a quality assessment using the scores of docking simulations. Our validation datasets revealed that our model showed similar or better docking scores in ZDOCK docking simulations than the known aptamers. We expect that our method, which is size-independent and easy to use, can provide insights into searching for an appropriate aptamer sequence for a target protein during the simulation step of SELEX.

Subject(s)

Aptamers, Nucleotide , Computer Simulation , Machine Learning , Models, Chemical , Proteins/chemistry , Sequence Analysis, RNA , Aptamers, Nucleotide/chemistry , Aptamers, Nucleotide/genetics , Molecular Docking Simulation , Monte Carlo Method

14.

CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection.

Oh, Dongpin; Strattan, J Seth; Hur, Junho K; Bento, José; Urban, Alexander Eckehart; Song, Giltae; Cherry, J Michael.

Sci Rep ; 10(1): 7933, 2020 05 13.

Article in English | MEDLINE | ID: mdl-32404971

ABSTRACT

ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.

Subject(s)

Chromatin Immunoprecipitation Sequencing , Computational Biology/methods , Neural Networks, Computer , Software , Algorithms , Binding Sites , Chromatin Immunoprecipitation Sequencing/methods , Databases, Nucleic Acid , Epigenesis, Genetic , Epigenomics/methods , Histones/metabolism , Humans , Nucleotide Motifs , Protein Binding , Transcription Initiation Site

15.

Chromosome-level de novo assembly of the pig-tailed macaque genome using linked-read sequencing and HiC proximity scaffolding.

Roodgar, Morteza; Babveyh, Afshin; Nguyen, Lan H; Zhou, Wenyu; Sinha, Rahul; Lee, Hayan; Hanks, John B; Avula, Mohan; Jiang, Lihua; Jian, Ruiqi; Lee, Hoyong; Song, Giltae; Chaib, Hassan; Weissman, Irv L; Batzoglou, Serafim; Holmes, Susan; Smith, David G; Mankowski, Joseph L; Prost, Stefan; Snyder, Michael P.

Gigascience ; 9(7)2020 07 01.

Article in English | MEDLINE | ID: mdl-32649757

ABSTRACT

BACKGROUND: Macaque species share >93% genome homology with humans and develop many disease phenotypes similar to those of humans, making them valuable animal models for the study of human diseases (e.g., HIV and neurodegenerative diseases). However, the quality of genome assembly and annotation for several macaque species lags behind the human genome effort. RESULTS: To close this gap and enhance functional genomics approaches, we used a combination of de novo linked-read assembly and scaffolding using proximity ligation assay (HiC) to assemble the pig-tailed macaque (Macaca nemestrina) genome. This combinatorial method yielded large scaffolds at chromosome level with a scaffold N50 of 127.5 Mb; the 23 largest scaffolds covered 90% of the entire genome. This assembly revealed large-scale rearrangements between pig-tailed macaque chromosomes 7, 12, and 13 and human chromosomes 2, 14, and 15. We subsequently annotated the genome using transcriptome and proteomics data from personalized induced pluripotent stem cells derived from the same animal. Reconstruction of the evolutionary tree using whole-genome annotation and orthologous comparisons among 3 macaque species, human, and mouse genomes revealed extensive homology between human and pig-tailed macaques with regards to both pluripotent stem cell genes and innate immune gene pathways. Our results confirm that rhesus and cynomolgus macaques exhibit a closer evolutionary distance to each other than either species exhibits to humans or pig-tailed macaques. CONCLUSIONS: These findings demonstrate that pig-tailed macaques can serve as an excellent animal model for the study of many human diseases particularly with regards to pluripotency and innate immune pathways.

Subject(s)

Chromosomes , Genome , Genomics , Macaca nemestrina/genetics , Animals , Computational Biology/methods , Genomics/methods , Humans , Karyotyping/methods , Male , Molecular Sequence Annotation , Proteomics/methods , Repetitive Sequences, Nucleic Acid

16.

Integrative Meta-Assembly Pipeline (IMAP): Chromosome-level genome assembler combining multiple de novo assemblies.

Song, Giltae; Lee, Jongin; Kim, Juyeon; Kang, Seokwoo; Lee, Hoyong; Kwon, Daehong; Lee, Daehwan; Lang, Gregory I; Cherry, J Michael; Kim, Jaebum.

PLoS One ; 14(8): e0221858, 2019.

Article in English | MEDLINE | ID: mdl-31454399

ABSTRACT

BACKGROUND: Genomic data have become major resources to understand complex mechanisms at fine-scale temporal and spatial resolution in functional and evolutionary genetic studies, including human diseases, such as cancers. Recently, a large number of whole genomes of evolving populations of yeast (Saccharomyces cerevisiae W303 strain) were sequenced in a time-dependent manner to identify temporal evolutionary patterns. For this type of study, a chromosome-level sequence assembly of the strain or population at time zero is required to compare with the genomes derived later. However, there is no fully automated computational approach in experimental evolution studies to establish the chromosome-level genome assembly using unique features of sequencing data. METHODS AND RESULTS: In this study, we developed a new software pipeline, the integrative meta-assembly pipeline (IMAP), to build chromosome-level genome sequence assemblies by generating and combining multiple initial assemblies using three de novo assemblers from short-read sequencing data. We significantly improved the continuity and accuracy of the genome assembly using a large collection of sequencing data and hybrid assembly approaches. We validated our pipeline by generating chromosome-level assemblies of yeast strains W303 and SK1, and compared our results with assemblies built using long-read sequencing and various assembly evaluation metrics. We also constructed chromosome-level sequence assemblies of S. cerevisiae strain Sigma1278b, and three commonly used fungal strains: Aspergillus nidulans A713, Neurospora crassa 73, and Thielavia terrestris CBS 492.74, for which long-read sequencing data are not yet available. Finally, we examined the effect of IMAP parameters, such as reference and resolution, on the quality of the final assembly of the yeast strains W303 and SK1. CONCLUSIONS: We developed a cost-effective pipeline to generate chromosome-level sequence assemblies using only short-read sequencing data. Our pipeline combines the strengths of reference-guided and meta-assembly approaches. Our pipeline is available online at http://github.com/jkimlab/IMAP including a Docker image, as well as a Perl script, to help users install the IMAP package, including several prerequisite programs. Users can use IMAP to easily build the chromosome-level assembly for the genome of their interest.

Subject(s)

Sequence Analysis, DNA , Software , Chromosomes, Fungal , Genome, Fungal , Molecular Sequence Annotation , Synteny/genetics

17.

From one to many: expanding the Saccharomyces cerevisiae reference genome panel.

Engel, Stacia R; Weng, Shuai; Binkley, Gail; Paskov, Kelley; Song, Giltae; Cherry, J Michael.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-26989152

ABSTRACT

In recent years, thousands of Saccharomyces cerevisiae genomes have been sequenced to varying degrees of completion. The Saccharomyces Genome Database (SGD) has long been the keeper of the original eukaryotic reference genome sequence, which was derived primarily from S. cerevisiae strain S288C. Because new technologies are pushing S. cerevisiae annotation past the limits of any system based exclusively on a single reference sequence, SGD is actively working to expand the original S. cerevisiae systematic reference sequence from a single genome to a multi-genome reference panel. We first commissioned the sequencing of additional genomes and their automated analysis using the AGAPE pipeline. Here we describe our curation strategy to produce manually reviewed high-quality genome annotations in order to elevate 11 of these additional genomes to Reference status. Database URL: http://www.yeastgenome.org/.

Subject(s)

Genome, Fungal , Saccharomyces cerevisiae/genetics , Automation , Base Sequence , Data Mining , Databases, Genetic , Open Reading Frames/genetics , Reference Standards

18.

Integration of new alternative reference strain genome sequences into the Saccharomyces genome database.

Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C; Dalusag, Kyla; Demeter, Janos; Engel, Stacia; Hellerstedt, Sage T; Karra, Kalpana; Hitz, Benjamin C; Nash, Robert S; Paskov, Kelley; Sheppard, Travis; Skrzypek, Marek; Weng, Shuai; Wong, Edith; Michael Cherry, J.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-27252399

ABSTRACT

The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. To provide a wider scope of genetic and phenotypic variation in yeast, the genome sequences and their corresponding annotations from 11 alternative S. cerevisiae reference strains have been integrated into SGD. Genomic and protein sequence information for genes from these strains are now available on the Sequence and Protein tab of the corresponding Locus Summary pages. We illustrate how these genome sequences can be utilized to aid our understanding of strain-specific functional and phenotypic differences.Database URL: www.yeastgenome.org.

Subject(s)

Databases, Genetic , Genome, Fungal/genetics , Genomics/methods , Saccharomyces/genetics , Molecular Sequence Annotation , Reproducibility of Results , Saccharomyces cerevisiae/genetics , User-Computer Interface

19.

AGAPE (Automated Genome Analysis PipelinE) for pan-genome analysis of Saccharomyces cerevisiae.

Song, Giltae; Dickins, Benjamin J A; Demeter, Janos; Engel, Stacia; Gallagher, Jennifer; Choe, Kisurb; Dunn, Barbara; Snyder, Michael; Cherry, J Michael.

PLoS One ; 10(3): e0120671, 2015.

Article in English | MEDLINE | ID: mdl-25781462

ABSTRACT

The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community.

Subject(s)

Contig Mapping/methods , Genome, Fungal , Saccharomyces cerevisiae/genetics , Sequence Analysis, DNA/methods , Software

20.

Revealing mammalian evolutionary relationships by comparative analysis of gene clusters.

Song, Giltae; Riemer, Cathy; Dickins, Benjamin; Kim, Hie Lim; Zhang, Louxin; Zhang, Yu; Hsu, Chih-Hao; Hardison, Ross C; Green, Eric D; Miller, Webb.

Genome Biol Evol ; 4(4): 586-601, 2012.

Article in English | MEDLINE | ID: mdl-22454131

ABSTRACT

Many software tools for comparative analysis of genomic sequence data have been released in recent decades. Despite this, it remains challenging to determine evolutionary relationships in gene clusters due to their complex histories involving duplications, deletions, inversions, and conversions. One concept describing these relationships is orthology. Orthologs derive from a common ancestor by speciation, in contrast to paralogs, which derive from duplication. Discriminating orthologs from paralogs is a necessary step in most multispecies sequence analyses, but doing so accurately is impeded by the occurrence of gene conversion events. We propose a refined method of orthology assignment based on two paradigms for interpreting its definition: by genomic context or by sequence content. X-orthology (based on context) traces orthology resulting from speciation and duplication only, while N-orthology (based on content) includes the influence of conversion events. We developed a computational method for automatically mapping both types of orthology on a per-nucleotide basis in gene cluster regions studied by comparative sequencing, and we make this mapping accessible by visualizing the output. All of these steps are incorporated into our newly extended CHAP 2 package. We evaluate our method using both simulated data and real gene clusters (including the well-characterized α-globin and ß-globin clusters). We also illustrate use of CHAP 2 by analyzing four more loci: CCL (chemokine ligand), IFN (interferon), CYP2abf (part of cytochrome P450 family 2), and KIR (killer cell immunoglobulin-like receptors). These new methods facilitate and extend our understanding of evolution at these and other loci by adding automated accurate evolutionary inference to the biologist's toolkit. The CHAP 2 package is freely available from http://www.bx.psu.edu/miller_lab.

Subject(s)

Evolution, Molecular , Mammals/genetics , Multigene Family , Proteins/genetics , Animals , Gene Conversion , Gene Duplication , Genome , Humans , Mammals/classification , Phylogeny

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL