Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
Nat Commun ; 13(1): 5412, 2022 09 15.
Article in English | MEDLINE | ID: mdl-36109518

ABSTRACT

Pangenomic study might improve the completeness of human reference genome (GRCh38) and promote precision medicine. Here, we use an automated pipeline of human pangenomic analysis to build gastric cancer pan-genome for 185 paired deep sequencing data (370 samples), and characterize the gene presence-absence variations (PAVs) at whole genome level. Genes ACOT1, GSTM1, SIGLEC14 and UGT2B17 are identified as highly absent genes in gastric cancer population. A set of genes from unaligned sequences with GRCh38 are predicted. We successfully locate one of predicted genes GC0643 on chromosome 9q34.2. Overexpression of GC0643 significantly inhibits cell growth, cell migration and invasion, cell cycle progression, and induces cell apoptosis in cancer cells. The tumor suppressor functions can be reversed by shGC0643 knockdown. The GC0643 is approved by NCBI database (GenBank: MW194843.1). Collectively, the robust pan-genome strategy provides a deeper understanding of the gene PAVs in the human cancer genome.


Subject(s)
Stomach Neoplasms , Asian People/genetics , China , Genome, Human , Humans , Lectins/genetics , Receptors, Cell Surface/genetics , Stomach Neoplasms/genetics
2.
G3 (Bethesda) ; 10(8): 2801-2809, 2020 08 05.
Article in English | MEDLINE | ID: mdl-32532800

ABSTRACT

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.


Subject(s)
Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA
3.
Genome Biol ; 20(1): 149, 2019 07 31.
Article in English | MEDLINE | ID: mdl-31366358

ABSTRACT

The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions. Here, we developed a HUman Pan-genome ANalysis (HUPAN) system to build the human pan-genome. We applied it to 185 deep sequencing and 90 assembled Han Chinese genomes and detected 29.5 Mb novel genomic sequences and at least 188 novel protein-coding genes missing in the human reference genome (GRCh38). It can be an important resource for the human genome-related biomedical studies, such as cancer genome analysis. HUPAN is freely available at http://cgm.sjtu.edu.cn/hupan/ and https://github.com/SJTU-CGM/HUPAN .


Subject(s)
Genome, Human , Software , Asian People/genetics , Black People/genetics , High-Throughput Nucleotide Sequencing , Humans , Proteins/genetics , Sequence Analysis, DNA
4.
BMC Genomics ; 20(1): 595, 2019 Jul 19.
Article in English | MEDLINE | ID: mdl-31324156

ABSTRACT

BACKGROUND: Diversity-generating retroelements (DGRs) are a unique family of retroelements that generate sequence diversity of DNA to benefit their hosts by introducing variations and accelerating the evolution of target proteins. They exist widely in bacteria, archaea, phage and plasmid. However, our understanding about DGRs in natural environments was still very limited. RESULTS: We developed an efficient computational algorithm to identify DGRs, and applied it to characterize DGRs in more than 80,000 sequenced bacterial genomes as well as more than 4,000 human metagenome datasets. In total, we identified 948 non-redundant DGRs, which expanded the number of known DGRs in bacterial genomes and human microbiomes by about 55%, and provided a much more comprehensive reference for the study of DGRs. Phylogenetic analysis was done for identified DGRs. The putative target genes of DGRs were searched, and the functions of these target genes were investigated with a comprehensive alignment against the nr database. CONCLUSIONS: DGR system is a powerful and universal mechanism to generate diversity. DGR evolution is closely associated with the living environment and their cassette structures. Furthermore, it may impact a wide range of functional processes in addition to receptor-binding. These results significantly improved our understanding about DGRs.


Subject(s)
Evolution, Molecular , Genetic Variation , Genomics , Metagenome/genetics , Retroelements/genetics , Algorithms , Bacteria/genetics , Humans , Microbiota/genetics
5.
Comput Toxicol ; 5: 38-51, 2018 Feb.
Article in English | MEDLINE | ID: mdl-30221212

ABSTRACT

Cigarette smoking entails chronic exposure to a mixture of harmful chemicals that trigger molecular changes over time, and is known to increase the risk of developing diseases. Risk assessment in the context of 21st century toxicology relies on the elucidation of mechanisms of toxicity and the identification of exposure response markers, usually from high-throughput data, using advanced computational methodologies. The sbv IMPROVER Systems Toxicology computational challenge (Fall 2015-Spring 2016) aimed to evaluate whether robust and sparse (≤40 genes) human (sub-challenge 1, SC1) and species-independent (sub-challenge 2, SC2) exposure response markers (so called gene signatures) could be extracted from human and mouse blood transcriptomics data of current (S), former (FS) and never (NS) smoke-exposed subjects as predictors of smoking and cessation status. Best-performing computational methods were identified by scoring anonymized participants' predictions. Worldwide participation resulted in 12 (SC1) and six (SC2) final submissions qualified for scoring. The results showed that blood gene expression data were informative to predict smoking exposure (i.e. discriminating smoker versus never or former smokers) status in human and across species with a high level of accuracy. By contrast, the prediction of cessation status (i.e. distinguishing FS from NS) remained challenging, as reflected by lower classification performances. Participants successfully developed inductive predictive models and extracted human and species-independent gene signatures, including genes with high consensus across teams. Post-challenge analyses highlighted "feature selection" as a key step in the process of building a classifier and confirmed the importance of testing a gene signature in independent cohorts to ensure the generalized applicability of a predictive model at a population-based level. In conclusion, the Systems Toxicology challenge demonstrated the feasibility of extracting a consistent blood-based smoke exposure response gene signature and further stressed the importance of independent and unbiased data and method evaluations to provide confidence in systems toxicology-based scientific conclusions.

6.
BMC Bioinformatics ; 19(Suppl 4): 162, 2018 05 08.
Article in English | MEDLINE | ID: mdl-29745853

ABSTRACT

BACKGROUND: Although rapid developed sequencing technologies make it possible for genotype data to be used in clinical diagnosis, it is still challenging for clinicians to understand the results of sequencing and make correct judgement based on them. Before this, diagnosis based on clinical features held a leading position. With the establishment of the Human Phenotype Ontology (HPO) and the enrichment of phenotype-disease annotations, there throws much more attention to the improvement of phenotype-based diagnosis. RESULTS: In this study, we presented a novel method called RelativeBestPair to measure similarity from the query terms to hereditary diseases based on HPO and then rank the candidate diseases. To evaluate the performance, we simulated a set of patients based on 44 complex diseases. Besides, by adding noise or imprecision or both, cases closer to real clinical conditions were generated. Thus, four simulated datasets were used to make comparison among RelativeBestPair and seven existing semantic similarity measures. RelativeBestPair ranked the underlying disease as top 1 on 93.73% of the simulated dataset without noise and imprecision, 93.64% of the simulated dataset with noise and without imprecision, 39.82% of the simulated dataset without noise and with imprecision, and 33.64% of the simulated dataset with both noise and imprecision. CONCLUSION: Compared with the seven existing semantic similarity measures, RelativeBestPair showed similar performance in two datasets without imprecision. While RelativeBestPair appeared to be equal to Resnik and better than other six methods in the simulated dataset without noise and with imprecision, it significantly outperformed all other seven methods in the simulated dataset with both noise and imprecision. It can be indicated that RelativeBestPair might be of great help in clinical setting.


Subject(s)
Disease , Semantics , Computer Simulation , Databases as Topic , Humans , Phenotype
7.
Comput Toxicol ; 5: 31-37, 2018 Feb.
Article in English | MEDLINE | ID: mdl-29556588

ABSTRACT

Crowdsourcing has emerged as a framework to address methodological challenges in omics data analysis and assess the extent to which omics data are predictive of phenotypes of interest. The sbv IMPROVER Systems Toxicology Challenge was designed to leverage crowdsourcing to determine whether human blood gene expression levels are informative of current and past smoking. Participating teams were invited to use a training gene expression dataset to derive parsimonious models (up to 40 genes) that can accurately classify subjects into exposure groups: smokers, former smokers that quit for at least one year, and never-smokers. Teams were ranked based on two classification performance metrics evaluated on a blinded test dataset. The analytical approaches of the first- and third-ranked teams, that are presented in detail in this article, involved feature selection by moderated t-test or LASSO regression and linear discriminant analysis (LDA) and logistic regression classifiers, respectively. While the 12-gene signature of the top team allowed the classification of current smokers with 100% sensitivity at 93% specificity, discriminating former smokers from never-smokers was much more challenging (65% sensitivity at 57% specificity). Gene ontology molecular functions and KEGG pathways associated with current smoking included G protein-coupled receptor activity, signaling receptor activity, calcium ion binding, and the Neuroactive ligand-receptor interaction pathway. Selection of marker genes by either moderated t-test or multivariate LASSO regression followed by LDA or logistic regression, are robust approaches to classification with omics data, confirming in part findings of previous sbv IMPROVER challenges. While current smoking is accurately identified based on blood mRNA levels, smoking cessation for more than one year is accompanied by a "normalization" of the expression of certain mRNAs, making it difficult to distinguish former smokers from never-smokers.

SELECTION OF CITATIONS
SEARCH DETAIL
...