Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 65
Filtrar
1.
Nucleic Acids Res ; 52(D1): D622-D632, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37930845

RESUMO

Modern medicine is increasingly focused on personalized medicine, and multi-omics data is crucial in understanding biological phenomena and disease mechanisms. Each ethnic group has its unique genetic background with specific genomic variations influencing disease risk and drug response. Therefore, multi-omics data from specific ethnic populations are essential for the effective implementation of personalized medicine. Various prospective cohort studies, such as the UK Biobank, All of Us and Lifelines, have been conducted worldwide. The Tohoku Medical Megabank project was initiated after the Great East Japan Earthquake in 2011. It collects biological specimens and conducts genome and omics analyses to build a basis for personalized medicine. Summary statistical data from these analyses are available in the jMorp web database (https://jmorp.megabank.tohoku.ac.jp), which provides a multidimensional approach to the diversity of the Japanese population. jMorp was launched in 2015 as a public database for plasma metabolome and proteome analyses and has been continuously updated. The current update will significantly expand the scale of the data (metabolome, genome, transcriptome, and metagenome). In addition, the user interface and backend server implementations were rewritten to improve the connectivity between the items stored in jMorp. This paper provides an overview of the new version of the jMorp.


Assuntos
Bases de Dados Genéticas , Multiômica , População , Medicina de Precisão , Humanos , Genômica/métodos , Japão , Estudos Prospectivos , População/genética
2.
J Hum Genet ; 2024 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-38918526

RESUMO

Widely used genotype imputation methods are based on the Li and Stephens model, which assumes that new haplotypes can be represented by modifying existing haplotypes in a reference panel through mutations and recombinations. These methods use genotypes from SNP arrays as inputs to estimate haplotypes that align with the input genotypes by analyzing recombination patterns within a reference panel, and then infer unobserved variants. While these methods require reference panels in an identifiable form, their public use is limited due to privacy and consent concerns. One strategy to overcome these limitations is to use de-identified haplotype information, such as summary statistics or model parameters. Advances in deep learning (DL) offer the potential to develop imputation methods that use haplotype information in a reference-free manner by handling it as model parameters, while maintaining comparable imputation accuracy to methods based on the Li and Stephens model. Here, we provide a brief introduction to DL-based reference-free genotype imputation methods, including RNN-IMP, developed by our research group. We then evaluate the performance of RNN-IMP against widely-used Li and Stephens model-based imputation methods in terms of accuracy (R2), using the 1000 Genomes Project Phase 3 dataset and corresponding simulated Omni2.5 SNP genotype data. Although RNN-IMP is sensitive to missing values in input genotypes, we propose a two-stage imputation strategy: missing genotypes are first imputed using denoising autoencoders; RNN-IMP then processes these imputed genotypes. This approach restores the imputation accuracy that is degraded by missing values, enhancing the practical use of RNN-IMP.

3.
PLoS Comput Biol ; 16(10): e1008207, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-33001993

RESUMO

Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of haplotypes for thousands of individuals, which is known as a haplotype reference panel. In general, more accurate imputation results were obtained using a larger size of haplotype reference panel. Most of the existing genotype imputation methods explicitly require the haplotype reference panel in precise form, but the accessibility of haplotype data is often limited, due to the requirement of agreements from the donors. Since de-identified information such as summary statistics or model parameters can be used publicly, imputation methods using de-identified haplotype reference information might be useful to enhance the quality of imputation results under the condition where the access of the haplotype data is limited. In this study, we proposed a novel imputation method that handles the reference panel as its model parameters by using bidirectional recurrent neural network (RNN). The model parameters are presented in the form of de-identified information from which the restoration of the genotype data at the individual-level is almost impossible. We demonstrated that the proposed method provides comparable imputation accuracy when compared with the existing imputation methods using haplotype datasets from the 1000 Genomes Project (1KGP) and the Haplotype Reference Consortium. We also considered a scenario where a subset of haplotypes is made available only in de-identified form for the haplotype reference panel. In the evaluation using the 1KGP dataset under the scenario, the imputation accuracy of the proposed method is much higher than that of the existing imputation methods. We therefore conclude that our RNN-based method is quite promising to further promote the data-sharing of sensitive genome data under the recent movement for the protection of individuals' privacy.


Assuntos
Genótipo , Haplótipos/genética , Redes Neurais de Computação , Polimorfismo de Nucleotídeo Único/genética , Bases de Dados Genéticas , Genômica , Modelos Genéticos
4.
Hum Mol Genet ; 26(3): 650-659, 2017 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-28062665

RESUMO

A previous genome-wide association study (GWAS) performed in 963 Japanese individuals (487 primary biliary cholangitis [PBC] cases and 476 healthy controls) identified TNFSF15 (rs4979462) and POU2AF1 (rs4938534) as strong susceptibility loci for PBC. In this study, we performed GWAS in additional 1,923 Japanese individuals (894 PBC cases and 1,029 healthy controls), and combined the results with the previous data. This GWAS, together with a subsequent replication study in an independent set of 7,024 Japanese individuals (512 PBC cases and 6,512 healthy controls), identified PRKCB (rs7404928) as a novel susceptibility locus for PBC (odds ratio [OR] = 1.26, P = 4.13 × 10-9). Furthermore, a primary functional variant of PRKCB (rs35015313) was identified by genotype imputation using a phased panel of 1,070 Japanese individuals from a prospective, general population cohort study and subsequent in vitro functional analyses. These results may lead to improved understanding of the disease pathways involved in PBC, forming a basis for prevention of PBC and development of novel therapeutics.


Assuntos
Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Cirrose Hepática Biliar/genética , Proteína Quinase C beta/genética , Povo Asiático , Feminino , Genótipo , Humanos , Japão , Cirrose Hepática Biliar/patologia , Masculino , Polimorfismo de Nucleotídeo Único
5.
Hum Genet ; 138(4): 389-409, 2019 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-30887117

RESUMO

Incidence rates of Mendelian diseases vary among ethnic groups, and frequencies of variant types of causative genes also vary among human populations. In this study, we examined to what extent we can predict population frequencies of recessive disorders from genomic data, and explored better strategies for variant interpretation and classification. We used a whole-genome reference panel from 3552 general Japanese individuals constructed by the Tohoku Medical Megabank Organization (ToMMo). Focusing on 32 genes for 17 congenital metabolic disorders included in newborn screening (NBS) in Japan, we identified reported and predicted pathogenic variants through variant annotation, interpretation, and multiple ways of classifications. The estimated carrier frequencies were compared with those from the Japanese NBS data based on 1,949,987 newborns from a previous study. The estimated carrier frequency based on genomic data with a recent guideline of variant interpretation for the PAH gene, in which defects cause hyperphenylalaninemia (HPA) and phenylketonuria (PKU), provided a closer estimate to that by the observed incidence than the other methods. In contrast, the estimated carrier frequencies for SLC25A13, which causes citrin deficiency, were much higher compared with the incidence rate. The results varied greatly among the 11 NBS diseases with single responsible genes; the possible reasons for departures from the carrier frequencies by reported incidence rates were discussed. Of note, (1) the number of pathogenic variants increases by including additional lines of evidence, (2) common variants with mild effects also contribute to the actual frequency of patients, and (3) penetrance of each variant remains unclear.


Assuntos
Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/genética , Doenças do Recém-Nascido/diagnóstico , Doenças do Recém-Nascido/genética , Triagem Neonatal/métodos , Povo Asiático/genética , Povo Asiático/estatística & dados numéricos , Estudos de Coortes , Feminino , Frequência do Gene , Doenças Genéticas Inatas/epidemiologia , Estudo de Associação Genômica Ampla/normas , Heterozigoto , Humanos , Incidência , Recém-Nascido , Doenças do Recém-Nascido/epidemiologia , Japão/epidemiologia , Masculino , Padrões de Referência
6.
J Am Soc Nephrol ; 29(8): 2189-2199, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-30012571

RESUMO

Background Nephrotic syndrome is the most common cause of chronic glomerular disease in children. Most of these patients develop steroid-sensitive nephrotic syndrome (SSNS), but the loci conferring susceptibility to childhood SSNS are mainly unknown.Methods We conducted a genome-wide association study (GWAS) in the Japanese population; 224 patients with childhood SSNS and 419 adult healthy controls were genotyped using the Affymetrix Japonica Array in the discovery stage. Imputation for six HLA genes (HLA-A, -C, -B, -DRB1, -DQB1, and -DPB1) was conducted on the basis of Japanese-specific references. We performed genotyping for HLA-DRB1/-DQB1 using a sequence-specific oligonucleotide-probing method on a Luminex platform. Whole-genome imputation was conducted using a phased reference panel of 2049 healthy Japanese individuals. Replication was performed in an independent Japanese sample set including 216 patients and 719 healthy controls. We genotyped candidate single-nucleotide polymorphisms using the DigiTag2 assay.Results The most significant association was detected in the HLA-DR/DQ region and replicated (rs4642516 [minor allele G], combined Pallelic=7.84×10-23; odds ratio [OR], 0.33; 95% confidence interval [95% CI], 0.26 to 0.41; rs3134996 [minor allele A], combined Pallelic=1.72×10-25; OR, 0.29; 95% CI, 0.23 to 0.37). HLA-DRB1*08:02 (Pc=1.82×10-9; OR, 2.62; 95% CI, 1.94 to 3.54) and HLA-DQB1*06:04 (Pc=2.09×10-12; OR, 0.10; 95% CI, 0.05 to 0.21) were considered primary HLA alleles associated with childhood SSNS. HLA-DRB1*08:02-DQB1*03:02 (Pc=7.01×10-11; OR, 3.60; 95% CI, 2.46 to 5.29) was identified as the most significant genetic susceptibility factor.Conclusions The most significant association with childhood SSNS was detected in the HLA-DR/DQ region. Further HLA allele/haplotype analyses should enhance our understanding of molecular mechanisms underlying SSNS.


Assuntos
Predisposição Genética para Doença , Antígenos HLA-DQ/genética , Cadeias beta de HLA-DQ/genética , Cadeias HLA-DRB1/genética , Síndrome Nefrótica/genética , Adulto , Estudos de Casos e Controles , Criança , Feminino , Estudo de Associação Genômica Ampla , Cadeias beta de HLA-DQ/imunologia , Haplótipos , Humanos , Japão , Masculino , Síndrome Nefrótica/tratamento farmacológico , Síndrome Nefrótica/imunologia , Polimorfismo de Nucleotídeo Único , Valores de Referência , Esteroides/uso terapêutico
7.
Genes Chromosomes Cancer ; 57(2): 51-60, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29044863

RESUMO

Ovarian clear cell carcinoma (OCCC) is the most refractory subtype of ovarian cancer and more prevalent in Japanese than Caucasians (25% and 5% of all ovarian cancer, respectively). The aim of this study is to discover the genomic alterations that may cause OCCC and effective molecular targets for chemotherapy. Paired genomic DNAs of 48 OCCC tissues and corresponding noncancerous tissues were extracted from formalin-fixed, paraffin embedded specimens collected between 2007 and 2015 at Tohoku University Hospital. All specimens underwent exome sequencing and the somatic genetic alterations were identified. We divided the cases into three clusters based on the mutation spectra. Clinical characteristics such as age of onset and endometriosis are similar among the clusters but one cluster shows mutations related to APOBEC activation, indicating its contribution to subset of OCCC cases. There are three hypermutated cases (showing 12-fold or higher somatic mutations than the other 45 cases) and they have germline and somatic mismatch repair gene alterations. The frequently mutated genes are ARID1A (66.7%), PIK3CA (50%), PPP2R1A (18.8%), and KRAS (16.7%). Somatic mutations important for selection of chemotherapeutic agents, such as BRAF, ERBB2, PDGFRB, PGR, and KRAS are found in 27.1% of OCCC cases, indicating clinical importance of exome analysis for OCCC. Our study suggests that the genetic instability caused by either mismatch repair defect or activation of APOBEC play critical roles in OCCC carcinogenesis.


Assuntos
Adenocarcinoma de Células Claras/genética , Neoplasias Ovarianas/genética , Adulto , Classe I de Fosfatidilinositol 3-Quinases/genética , Reparo de Erro de Pareamento de DNA , Proteínas de Ligação a DNA , Exoma , Feminino , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Pessoa de Meia-Idade , Mutação , Proteínas Nucleares/genética , Proteína Fosfatase 2/genética , Proteínas Proto-Oncogênicas p21(ras)/genética , Fatores de Transcrição/genética
8.
BMC Genomics ; 19(1): 551, 2018 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-30041597

RESUMO

BACKGROUND: Genotype imputation from single-nucleotide polymorphism (SNP) genotype data using a haplotype reference panel consisting of thousands of unrelated individuals from populations of interest can help to identify strongly associated variants in genome-wide association studies. The Tohoku Medical Megabank (TMM) project was established to support the development of precision medicine, together with the whole-genome sequencing of 1070 human genomes from individuals in the Miyagi region (Northeast Japan) and the construction of the 1070 Japanese genome reference panel (1KJPN). Here, we investigated the performance of 1KJPN for genotype imputation of Japanese samples not included in the TMM project and compared it with other population reference panels. RESULTS: We found that the 1KJPN population was more similar to other Japanese populations, Nagahama (south-central Japan) and Aki (Shikoku Island), than to East Asian populations in the 1000 Genomes Project other than JPT, suggesting that the large-scale collection (more than 1000) of Japanese genomes from the Miyagi region covered many of the genetic variations of Japanese in mainland Japan. Moreover, 1KJPN outperformed the phase 3 reference panel of the 1000 Genomes Project (1KGPp3) for Japanese samples, and IKJPN showed similar imputation rates for the TMM and other Japanese samples for SNPs with minor allele frequencies (MAFs) higher than 1%. CONCLUSIONS: 1KJPN covered most of the variants found in the samples from areas of the Japanese mainland outside the Miyagi region, implying 1KJPN is representative of the Japanese population's genomes. 1KJPN and successive reference panels are useful genome reference panels for the mainland Japanese population. Importantly, the addition of whole genome sequences not included in the 1KJPN panel improved imputation efficiencies for SNPs with MAFs under 1% for samples from most regions of the Japanese archipelago.


Assuntos
Povo Asiático/genética , Genoma Humano , Polimorfismo de Nucleotídeo Único , Genótipo , Humanos , Japão
9.
Gastroenterology ; 152(6): 1383-1394, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28163062

RESUMO

BACKGROUND & AIMS: There is still a risk for hepatocellular carcinoma (HCC) development after eradication of hepatitis C virus (HCV) infection with antiviral agents. We investigated genetic factors associated with the development of HCC in patients with a sustained virologic response (SVR) to treatment for chronic HCV infection. METHODS: We obtained genomic DNA from 457 patients in Japan with a SVR to interferon-based treatment for chronic HCV infection from 2007 through 2015. We conducted a genome-wide association study (GWAS), followed by a replication analysis of 79 candidate single nucleotide polymorphisms (SNPs) in an independent set of 486 patients in Japan. The study end point was HCC diagnosis or confirmation of lack of HCC (at follow-up examinations until December 2014 in the GWAS cohort, and until January 2016 in the replication cohort). We collected clinical and laboratory data from all patients. We analyzed expression levels of candidate gene variants in human hepatic stellate cells, rats with steatohepatitis caused by a choline-deficient L-amino acid-defined diet, and a mouse model of liver injury caused by administration of carbon tetrachloride. We also analyzed expression levels in liver tissues of patients with chronic HCV infection with different stages of fibrosis or tumors vs patients without HCV infection (controls). RESULTS: We found a strong association between the SNP rs17047200, located within the intron of the tolloid like 1 gene (TLL1) on chromosome 4, and development of HCC; there was a genome-wide level of significance when the results of the GWAS and replication study were combined (odds ratio, 2.37; P = 2.66 × 10-8). Multivariate analysis showed rs17047200 AT/TT to be an independent risk factor for HCC (hazard ratio, 1.78; P = .008), along with male sex, older age, lower level of albumin, advanced stage of hepatic fibrosis, presence of diabetes, and higher post-treatment level of α-fetoprotein. Combining the rs17047200 genotype with other factors, we developed prediction models for HCC development in patients with mild or advanced hepatic fibrosis. Levels of TLL1 messenger RNA (mRNA) in human hepatic stellate cells increased with activation. Levels of Tll1 mRNA increased in liver tissues of rodents with hepatic fibrogenesis compared with controls. Levels of TLL1 mRNA increased in liver tissues of patients with progression of fibrosis. Gene expression levels of TLL1 short variants, including isoform 2, were higher in patients with rs17047200 AT/TT. CONCLUSIONS: In a GWAS, we identified the association between the SNP rs17047200, within the intron of TLL1, and development of HCC in patients who achieved an SVR to treatment for chronic HCV infection. We found levels of Tll1/TLL1 mRNA to be increased in rodent models of liver injury and liver tissues of patients with fibrosis, compared with controls. We propose that this SNP might affect splicing of TLL1 mRNA, yielding short variants with high catalytic activity that accelerates hepatic fibrogenesis and carcinogenesis. Further studies are needed to determine how rs17047200 affects TLL1 mRNA levels, splicing, and translation, as well as the prevalence of this variant among other patients with HCC. Tests for the TLL1 SNP might be used to identify patients at risk for HCC after an SVR to treatment of HCV infection.


Assuntos
Carcinoma Hepatocelular/genética , Fígado Gorduroso/genética , Hepatite C Crônica/genética , Neoplasias Hepáticas/genética , RNA Mensageiro/metabolismo , Metaloproteases Semelhantes a Toloide/genética , Fatores Etários , Idoso , Animais , Antivirais/uso terapêutico , Tetracloreto de Carbono , Carcinoma Hepatocelular/metabolismo , Carcinoma Hepatocelular/virologia , Estudos de Casos e Controles , Colina/administração & dosagem , Complicações do Diabetes/complicações , Fígado Gorduroso/etiologia , Feminino , Estudo de Associação Genômica Ampla , Células Estreladas do Fígado/metabolismo , Hepatite C Crônica/tratamento farmacológico , Hepatite C Crônica/metabolismo , Humanos , Íntrons , Cirrose Hepática/genética , Cirrose Hepática/metabolismo , Neoplasias Hepáticas/metabolismo , Neoplasias Hepáticas/virologia , Masculino , Camundongos , Pessoa de Meia-Idade , Polimorfismo de Nucleotídeo Único , Ratos , Fatores de Risco , Albumina Sérica/metabolismo , Fatores Sexuais , Resposta Viral Sustentada , alfa-Fetoproteínas/metabolismo
10.
J Hum Genet ; 63(2): 213-230, 2018 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-29192238

RESUMO

Clarifying allele frequencies of disease-related genetic variants in a population is important in genomic medicine; however, such data is not yet available for the Japanese population. To estimate frequencies of actionable pathogenic variants in the Japanese population, we examined the reported pathological variants in genes recommended by the American College of Medical Genetics and Genomics (ACMG) in our reference panel of genomic variations, 2KJPN, which was created by whole-genome sequencing of 2049 individuals of the resident cohort of the Tohoku Medical Megabank Project. We searched for pathogenic variants in 2KJPN for 57 autosomal ACMG-recommended genes responsible for 26 diseases and then examined their frequencies. By referring to public databases of pathogenic variations, we identified 143 reported pathogenic variants in 2KJPN for the 57 ACMG recommended genes based on a classification system. At the individual level, 21% of the individuals were found to have at least one reported pathogenic allele. We then conducted a literature survey to review the variants and to check for evidence of pathogenicity. Our results suggest that a substantial number of people have reported pathogenic alleles for the ACMG genes, and reviewing variants is indispensable for constructing the information infrastructure of genomic medicine for the Japanese population.


Assuntos
Alelos , Bases de Dados de Ácidos Nucleicos , Frequência do Gene , Estudo de Associação Genômica Ampla , Mutação , Povo Asiático , Feminino , Humanos , Japão , Masculino , Estudos Prospectivos
11.
J Hum Genet ; 62(4): 485-489, 2017 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-28100913

RESUMO

A genome-wide association study (GWAS) for cold medicine-related Stevens-Johnson syndrome (CM-SJS) with severe ocular complications (SOC) was performed in a Japanese population. A recently developed ethnicity-specific array with genome-wide imputation that was based on the whole-genome sequences of 1070 unrelated Japanese individuals was used. Validation analysis with additional samples from Japanese individuals and replication analysis using samples from Korean individuals identified two new susceptibility loci on chromosomes 15 and 16. This study might suggest the usefulness of GWAS using the ethnicity-specific array and genome-wide imputation based on large-scale whole-genome sequences. Our findings contribute to the understanding of genetic predisposition to CM-SJS with SOC.


Assuntos
Oftalmopatias/genética , Antígeno HLA-A2/genética , Recombinases/genética , Síndrome de Stevens-Johnson/genética , Receptor 3 Toll-Like/genética , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Povo Asiático , Proteínas de Ciclo Celular , Criança , Etnicidade , Oftalmopatias/induzido quimicamente , Oftalmopatias/patologia , Feminino , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Japão , Masculino , Pessoa de Meia-Idade , Medicamentos Compostos contra Resfriado, Influenza e Alergia/efeitos adversos , Polimorfismo de Nucleotídeo Único , Síndrome de Stevens-Johnson/complicações , Síndrome de Stevens-Johnson/patologia
12.
BMC Genomics ; 17 Suppl 1: 2, 2016 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-26818838

RESUMO

BACKGROUND: RNA-sequencing (RNA-Seq) has become a popular tool for transcriptome profiling in mammals. However, accurate estimation of allele-specific expression (ASE) based on alignments of reads to the reference genome is challenging, because it contains only one allele on a mosaic haploid genome. Even with the information of diploid genome sequences, precise alignment of reads to the correct allele is difficult because of the high-similarity between the corresponding allele sequences. RESULTS: We propose a Bayesian approach to estimate ASE from RNA-Seq data with diploid genome sequences. In the statistical framework, the haploid choice is modeled as a hidden variable and estimated simultaneously with isoform expression levels by variational Bayesian inference. Through the simulation data analysis, we demonstrate the effectiveness of the proposed approach in terms of identifying ASE compared to the existing approach. We also show that our approach enables better quantification of isoform expression levels compared to the existing methods, TIGAR2, RSEM and Cufflinks. In the real data analysis of the human reference lymphoblastoid cell line GM12878, some autosomal genes were identified as ASE genes, and skewed paternal X-chromosome inactivation in GM12878 was identified. CONCLUSIONS: The proposed method, called ASE-TIGAR, enables accurate estimation of gene expression from RNA-Seq data in an allele-specific manner. Our results show the effectiveness of utilizing personal genomic information for accurate estimation of ASE. An implementation of our method is available at http://nagasakilab.csml.org/ase-tigar .


Assuntos
Regulação da Expressão Gênica , Genoma Humano , RNA/metabolismo , Algoritmos , Alelos , Teorema de Bayes , Linhagem Celular Tumoral , Diploide , Humanos , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Proteínas/genética , Proteínas/metabolismo , RNA/química , RNA/genética , Análise de Sequência de RNA
13.
BMC Genomics ; 17(1): 991, 2016 12 03.
Artigo em Inglês | MEDLINE | ID: mdl-27912743

RESUMO

BACKGROUND: In the estimation of repeat numbers in a short tandem repeat (STR) region from high-throughput sequencing data, two types of strategies are mainly taken: a strategy based on counting repeat patterns included in sequence reads spanning the region and a strategy based on estimating the difference between the actual insert size and the insert size inferred from paired-end reads. The quality of sequence alignment is crucial, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the variations of repeat numbers. RESULTS: We proposed a new dynamic programming based realignment method named STR-realigner that considers repeat patterns in STR regions as prior knowledge. By allowing the size change of repeat patterns with low penalty in STR regions, accurate realignment is expected. For the performance evaluation, publicly available STR variant calling tools were applied to three types of aligned reads: synthetically generated sequencing reads aligned with BWA-MEM, those realigned with STR-realigner, those realigned with ReviSTER, and those realigned with GATK IndelRealigner. From the comparison of root mean squared errors between estimated and true STR region size, the results for the dataset realigned with STR-realigner are better than those for other cases. For real data analysis, we used a real sequencing dataset from Illumina HiSeq 2000 for a parent-offspring trio. RepeatSeq and lobSTR were applied to the sequence reads for these individuals aligned with BWA-MEM, those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. STR-realigner shows the best performance in terms of consistency of the size of estimated STR regions in Mendelian inheritance. Root mean squared error values were also calculated from the comparison of these estimated results with STR region sizes obtained from high coverage PacBio sequencing data, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value. CONCLUSIONS: The effectiveness of the proposed realignment method for STR regions was verified from the comparison with an existing method on both simulation datasets and real whole genome sequencing dataset.


Assuntos
Repetições de Microssatélites , Alinhamento de Sequência/métodos , Software , Algoritmos , Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Reprodutibilidade dos Testes , Análise de Sequência de DNA/métodos
14.
BMC Genomics ; 17(1): 745, 2016 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-27654840

RESUMO

BACKGROUND: Genome-wide association studies have revealed associations between single-nucleotide polymorphisms (SNPs) and phenotypes such as disease symptoms and drug tolerance. To address the small sample size for rare variants, association studies tend to group gene or pathway level variants and evaluate the effect on the set of variants. One of such strategies, known as the sequential kernel association test (SKAT), is a widely used collapsing method. However, the reported p-values from SKAT tend to be biased because the asymptotic property of the statistic is used to calculate the p-value. Although this bias can be corrected by applying permutation procedures for the test statistics, the computational cost of obtaining p-values with high resolution is prohibitive. RESULTS: To address this problem, we devise an adaptive SKAT procedure termed AP-SKAT that efficiently classifies significant SNP sets and ranks them according to the permuted p-values. Our procedure adaptively stops the permutation test when the significance level is outside some confidence interval of the estimated p-value for a binomial distribution. To evaluate the performance, we first compare the power and sample size calculation and the type I error rates estimate of SKAT, SKAT-O, and the proposed procedure using genotype data in the SKAT R package and from 1000 Genome Project. Through computational experiments using whole genome sequencing and SNP array data, we show that our proposed procedure is highly efficient and has comparable accuracy to the standard procedure. CONCLUSIONS: For several types of genetic data, the developed procedure could achieve competitive power and sample size under small and large sample size conditions with controlling considerable type I error rates, and estimate p-values of significant SNP sets that are consistent with those estimated by the standard permutation test within a realistic time. This demonstrates that the procedure is sufficiently powerful for recent whole genome sequencing and SNP array data with increasing numbers of phenotypes. Additionally, this procedure can be used in other association tests by employing alternative methods to calculate the statistics.

15.
BMC Genomics ; 17 Suppl 5: 494, 2016 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-27586631

RESUMO

BACKGROUND: Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches. RESULTS: We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees. CONCLUSIONS: We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.


Assuntos
Repetições de Microssatélites , Algoritmos , Simulação por Computador , Genoma Humano , Humanos , Modelos Estatísticos , Análise de Sequência de DNA
16.
BMC Bioinformatics ; 16 Suppl 1: S4, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25707811

RESUMO

BACKGROUND: With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. RESULTS: We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. CONCLUSIONS: Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.


Assuntos
Alelos , Biologia Computacional/métodos , Variações do Número de Cópias de DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala , Amilases/genética , Teorema de Bayes , Feminino , Genética Populacional , Haplótipos , Humanos , Masculino , Modelos Estatísticos , Linhagem , Fenótipo , Saliva/enzimologia , Utah
17.
BMC Genomics ; 16 Suppl 2: S7, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25708870

RESUMO

BACKGROUND: Human leucocyte antigen (HLA) genes play an important role in determining the outcome of organ transplantation and are linked to many human diseases. Because of the diversity and polymorphisms of HLA loci, HLA typing at high resolution is challenging even with whole-genome sequencing data. RESULTS: We have developed a computational tool, HLA-VBSeq, to estimate the most probable HLA alleles at full (8-digit) resolution from whole-genome sequence data. HLA-VBSeq simultaneously optimizes read alignments to HLA allele sequences and abundance of reads on HLA alleles by variational Bayesian inference. We show the effectiveness of the proposed method over other methods through the analysis of predicting HLA types for HLA class I (HLA-A, -B and -C) and class II (HLA-DQA1,-DQB1 and -DRB1) loci from the simulation data of various depth of coverage, and real sequencing data of human trio samples. CONCLUSIONS: HLA-VBSeq is an efficient and accurate HLA typing method using high-throughput sequencing data without the need of primer design for HLA loci. Moreover, it does not assume any prior knowledge about HLA allele frequencies, and hence HLA-VBSeq is broadly applicable to human samples obtained from a genetically diverse population.


Assuntos
Biologia Computacional/métodos , Genoma Humano , Antígenos HLA/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Teste de Histocompatibilidade/estatística & dados numéricos , Algoritmos , Alelos , Teorema de Bayes , Frequência do Gene , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Teste de Histocompatibilidade/métodos , Humanos , Internet , Polimorfismo Genético , Reprodutibilidade dos Testes
18.
J Hum Genet ; 60(10): 581-7, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26108142

RESUMO

The Tohoku Medical Megabank Organization constructed the reference panel (referred to as the 1KJPN panel), which contains >20 million single nucleotide polymorphisms (SNPs), from whole-genome sequence data from 1070 Japanese individuals. The 1KJPN panel contains the largest number of haplotypes of Japanese ancestry to date. Here, from the 1KJPN panel, we designed a novel custom-made SNP array, named the Japonica array, which is suitable for whole-genome imputation of Japanese individuals. The array contains 659,253 SNPs, including tag SNPs for imputation, SNPs of Y chromosome and mitochondria, and SNPs related to previously reported genome-wide association studies and pharmacogenomics. The Japonica array provides better imputation performance for Japanese individuals than the existing commercially available SNP arrays with both the 1KJPN panel and the International 1000 genomes project panel. For common SNPs (minor allele frequency (MAF)>5%), the genomic coverage of the Japonica array (r(2)>0.8) was 96.9%, that is, almost all common SNPs were covered by this array. Nonetheless, the coverage of low-frequency SNPs (0.5%

Assuntos
Genótipo , Técnicas de Genotipagem/métodos , Haplótipos , Análise de Sequência com Séries de Oligonucleotídeos , Polimorfismo de Nucleotídeo Único , Povo Asiático , Cromossomos Humanos Y/genética , DNA Mitocondrial/genética , Feminino , Estudo de Associação Genômica Ampla , Humanos , Japão , Masculino
19.
BMC Genomics ; 15: 664, 2014 Aug 08.
Artigo em Inglês | MEDLINE | ID: mdl-25103311

RESUMO

BACKGROUND: Next-generation sequencers (NGSs) have become one of the main tools for current biology. To obtain useful insights from the NGS data, it is essential to control low-quality portions of the data affected by technical errors such as air bubbles in sequencing fluidics. RESULTS: We develop a software SUGAR (subtile-based GUI-assisted refiner) which can handle ultra-high-throughput data with user-friendly graphical user interface (GUI) and interactive analysis capability. The SUGAR generates high-resolution quality heatmaps of the flowcell, enabling users to find possible signals of technical errors during the sequencing. The sequencing data generated from the error-affected regions of a flowcell can be selectively removed by automated analysis or GUI-assisted operations implemented in the SUGAR. The automated data-cleaning function based on sequence read quality (Phred) scores was applied to a public whole human genome sequencing data and we proved the overall mapping quality was improved. CONCLUSION: The detailed data evaluation and cleaning enabled by SUGAR would reduce technical problems in sequence read mapping, improving subsequent variant analysis that require high-quality sequence data and mapping results. Therefore, the software will be especially useful to control the quality of variant calls to the low population cells, e.g., cancers, in a sample with technical errors of sequencing procedures.


Assuntos
Biologia Computacional/métodos , Gráficos por Computador , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Estatística como Assunto/métodos , Interface Usuário-Computador , Humanos
20.
BMC Genomics ; 15 Suppl 10: S5, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25560536

RESUMO

BACKGROUND: High-throughput RNA sequencing (RNA-Seq) enables quantification and identification of transcripts at single-base resolution. Recently, longer sequence reads become available thanks to the development of new types of sequencing technologies as well as improvements in chemical reagents for the Next Generation Sequencers. Although several computational methods have been proposed for quantifying gene expression levels from RNA-Seq data, they are not sufficiently optimized for longer reads (e.g. >250 bp). RESULTS: We propose TIGAR2, a statistical method for quantifying transcript isoforms from fixed and variable length RNA-Seq data. Our method models substitution, deletion, and insertion errors of sequencers based on gapped-alignments of reads to the reference cDNA sequences so that sensitive read-aligners such as Bowtie2 and BWA-MEM are effectively incorporated in our pipeline. Also, a heuristic algorithm is implemented in variational Bayesian inference for faster computation. We apply TIGAR2 to both simulation data and real data of human samples and evaluate performance of transcript quantification with TIGAR2 in comparison to existing methods. CONCLUSIONS: TIGAR2 is a sensitive and accurate tool for quantifying transcript isoform abundances from RNA-Seq data. Our method performs better than existing methods for the fixed-length reads (100 bp, 250 bp, 500 bp, and 1000 bp of both single-end and paired-end) and variable-length reads, especially for reads longer than 250 bp.


Assuntos
Biologia Computacional/métodos , Isoformas de RNA/genética , RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Algoritmos , Teorema de Bayes , Perfilação da Expressão Gênica , Variação Genética , Células HeLa , Humanos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA