RESUMO
Insertions and deletions constitute the second most important source of natural genomic variation. Insertions and deletions make up to 25% of genomic variants in humans and are involved in complex evolutionary processes including genomic rearrangements, adaptation, and speciation. Recent advances in long-read sequencing technologies allow detailed inference of insertions and deletion variation in species and populations. Yet, despite their importance, evolutionary studies have traditionally ignored or mishandled insertions and deletions due to a lack of comprehensive methodologies and statistical models of insertions and deletion dynamics. Here, we discuss methods for describing insertions and deletion variation and modeling insertions and deletions over evolutionary time. We provide practical advice for tackling insertions and deletions in genomic sequences and illustrate our discussion with examples of insertions and deletion-induced effects in human and other natural populations and their contribution to evolutionary processes. We outline promising directions for future developments in statistical methodologies that would allow researchers to analyze insertions and deletion variation and their effects in large genomic data sets and to incorporate insertions and deletions in evolutionary inference.
Assuntos
Evolução Molecular , Mutação INDEL , Humanos , Modelos Genéticos , Biologia Computacional/métodos , Animais , Genômica/métodosRESUMO
BACKGROUND: Conventional clinicopathological characteristics insufficiently predict prognosis in oral squamous cell carcinoma (OSCC). We aimed to assess the added predictive value of tumor microenvironment immune cell composition (TMICC) in addition to conventional clinicopathological characteristics. METHODS: Primary tumor samples of 290 OSCC patients were immunohistochemically stained for CD4, CD8, CD20, CD68, CD163, CD57, FoxP3 and Programmed cell Death Ligand 1. Additionally, clinicopathological characteristics were obtained from patients' medical files. Predictive models were trained and validated by conducting Least Absolute Shrinkage and Selection Operator (LASSO) regression analyses with cross-validation. To quantify the added predictive power of TMICC within models, receiver operating characteristic (ROC) analyses were used. RESULTS: Recurrence occurred in 74 patients (25.5%). Conventional clinicopathological characteristics (tumor localization, pathological T-stage, pathological N-stage, extracapsular spread, resection margin, differentiation grade, perineural invasion, lymphovascular invasion) and treatment modality, were used to build a LASSO logistic regression-based predictive model. Addition of TMICC to the model resulted in a comparable AUC of respectively 0.79 (±0.01) and 0.76 (±0.1) in the training and test sets. The model indicated that high numbers of CD4+ T cells protected against recurrence. Lymph node metastasis, extracapsular spread, perineural invasion, positive surgical margins and reception of adjuvant treatment were associated with increased risk for recurrence. CONCLUSIONS: The TMICC, specifically the number of CD4+ T cells, is an independent predictor , however, addition to conventional clinicopathological characteristics does not improve the performance of a predictive model for recurrence in OSCC.
Assuntos
Carcinoma de Células Escamosas , Neoplasias Bucais , Microambiente Tumoral , Humanos , Microambiente Tumoral/imunologia , Neoplasias Bucais/imunologia , Neoplasias Bucais/patologia , Masculino , Feminino , Pessoa de Meia-Idade , Carcinoma de Células Escamosas/imunologia , Carcinoma de Células Escamosas/patologia , Idoso , Adulto , Prognóstico , Recidiva Local de Neoplasia/patologia , Idoso de 80 Anos ou maisRESUMO
INTRODUCTION: Patients in the intensive care unit (ICU) are highly heterogeneous in characteristics, their clinical course, and outcomes. Genetic variability may partly explain the variability and similarity in disease courses observed among critically ill patients and may identify clusters of subgroups. The aim of this study is to conduct a systematic review of all genetic association studies of critically ill patients with their outcomes. METHODS AND ANALYSIS: This systematic review will be conducted and reported according to the HuGE Review Handbook V1.0. We will search PubMed, Embase, and the Cochrane Library for relevant studies. All types of genetic association studies that included acutely admitted medical and surgical adult ICU patients will be considered for this review. All studies will be selected according to predefined selection criteria, evaluated and assessed for risk of bias independently by two reviewers. Risk of bias will be assessed according to the HuGE Review Handbook V1.0 with some modifications reflecting recent insights. We will provide an overview of all included studies by reporting the characteristics of the study designs, the patients included in the studies, the genetic variables, and the outcomes evaluated. ETHICS AND DISSEMINATION: We will use data from peer-reviewed published articles, and hence, there is no requirement for ethics approval. The results of this systematic review will be disseminated through publication in a peer-reviewed scientific journal. SYSTEMATIC REVIEW REGISTRATION: PROSPERO CRD42021209744.
Assuntos
Estado Terminal , Unidades de Terapia Intensiva , Adulto , Humanos , Revisões Sistemáticas como Assunto , Hospitalização , Projetos de Pesquisa , Estudos de Associação Genética , Literatura de Revisão como AssuntoRESUMO
BACKGROUND: Whole genome sequencing is increasingly being used for the diagnosis of patients with rare diseases. However, the diagnostic yields of many studies, particularly those conducted in a healthcare setting, are often disappointingly low, at 25-30%. This is in part because although entire genomes are sequenced, analysis is often confined to in silico gene panels or coding regions of the genome. METHODS: We undertook WGS on a cohort of 122 unrelated rare disease patients and their relatives (300 genomes) who had been pre-screened by gene panels or arrays. Patients were recruited from a broad spectrum of clinical specialties. We applied a bioinformatics pipeline that would allow comprehensive analysis of all variant types. We combined established bioinformatics tools for phenotypic and genomic analysis with our novel algorithms (SVRare, ALTSPLICE and GREEN-DB) to detect and annotate structural, splice site and non-coding variants. RESULTS: Our diagnostic yield was 43/122 cases (35%), although 47/122 cases (39%) were considered solved when considering novel candidate genes with supporting functional data into account. Structural, splice site and deep intronic variants contributed to 20/47 (43%) of our solved cases. Five genes that are novel, or were novel at the time of discovery, were identified, whilst a further three genes are putative novel disease genes with evidence of causality. We identified variants of uncertain significance in a further fourteen candidate genes. The phenotypic spectrum associated with RMND1 was expanded to include polymicrogyria. Two patients with secondary findings in FBN1 and KCNQ1 were confirmed to have previously unidentified Marfan and long QT syndromes, respectively, and were referred for further clinical interventions. Clinical diagnoses were changed in six patients and treatment adjustments made for eight individuals, which for five patients was considered life-saving. CONCLUSIONS: Genome sequencing is increasingly being considered as a first-line genetic test in routine clinical settings and can make a substantial contribution to rapidly identifying a causal aetiology for many patients, shortening their diagnostic odyssey. We have demonstrated that structural, splice site and intronic variants make a significant contribution to diagnostic yield and that comprehensive analysis of the entire genome is essential to maximise the value of clinical genome sequencing.
Assuntos
Variação Genética , Doenças Raras , Humanos , Doenças Raras/diagnóstico , Doenças Raras/genética , Sequenciamento Completo do Genoma , Testes Genéticos , Mutação , Proteínas de Ciclo CelularRESUMO
We review popular unsupervised learning methods for the analysis of high-dimensional data encountered in, for example, genomics, medical imaging, cohort studies, and biobanks. We show that four commonly used methods, principal component analysis, K-means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be written as probabilistic models underpinned by a low-rank matrix factorization. In addition to highlighting their similarities, this formulation clarifies the various assumptions and restrictions of each approach, which eases identifying the appropriate method for specific applications for applied medical researchers. We also touch upon the most important aspects of inference and model selection for the application of these methods to health data.
Assuntos
Algoritmos , Aprendizado de Máquina não Supervisionado , Humanos , Modelos Estatísticos , Genômica , Análise por ConglomeradosRESUMO
Many diseases show patterns of co-occurrence, possibly driven by systemic dysregulation of underlying processes affecting multiple traits. We have developed a method (treeLFA) for identifying such multimorbidities from routine health-care data, which combines topic modeling with an informative prior derived from medical ontology. We apply treeLFA to UK Biobank data and identify a variety of topics representing multimorbidity clusters, including a healthy topic. We find that loci identified using topic weights as traits in a genome-wide association study (GWAS) analysis, which we validated with a range of approaches, only partially overlap with loci from GWASs on constituent single diseases. We also show that treeLFA improves upon existing methods like latent Dirichlet allocation in various ways. Overall, our findings indicate that topic models can characterize multimorbidity patterns and that genetic analysis of these patterns can provide insight into the etiology of complex traits that cannot be determined from the analysis of constituent traits alone.
RESUMO
BACKGROUND: Multimorbidity is associated with poor quality of life, polypharmacy, health care costs and mortality, with those affected potentially benefitting from a healthy lifestyle. We assessed a comprehensive set of lifestyle factors in relation to multimorbidity with major chronic diseases. METHODS: This cross-sectional study utilised baseline data for adults from the prospective Lifelines Cohort in the north of the Netherlands (N = 79,345). We defined multimorbidity as the co-existence of two or more chronic diseases (i.e. cardiovascular disease, cancer, respiratory disease, type 2 diabetes) and evaluated factors in six lifestyle domains (nutrition, physical (in)activity, substance abuse, sleep, stress, relationships) among groups by the number of chronic diseases (≥2, 1, 0). Multinomial logistic regression models were created, adjusted for appropriate confounders, and odds ratios (OR) with 95% confidence intervals (95%CI) were reported. RESULTS: 3,712 participants had multimorbidity (4.7%, age 53.5 ± 12.5 years), and this group tended to have less healthy lifestyles. Compared to those without chronic diseases, those with multimorbidity reported physical inactivity more often (OR, 1.15; 95%CI, 1.06-1.25; not significant for one condition), chronic stress (OR, 2.14; 95%CI, 1.92-2.38) and inadequate sleep (OR, 1.70; 95%CI, 1.41-2.06); as expected, they more often watched television (OR, 1.70; 95%CI, 1.42-2.04) and currently smoked (OR, 1.91; 95%CI, 1.73-2.11), but they also had lower alcohol intakes (OR, 0.66; 95%CI, 0.59-0.74). CONCLUSIONS: Chronic stress and poor sleep, in addition to physical inactivity and smoking, are lifestyle factors of great concern in patients with multimorbidity.
Assuntos
Estilo de Vida , Multimorbidade , Doença Crônica/epidemiologia , Estudos Transversais , Humanos , Estudos Prospectivos , Masculino , Feminino , Adolescente , Adulto Jovem , Adulto , Pessoa de Meia-Idade , Idoso , PrevalênciaRESUMO
BACKGROUND: Timely referral of Parkinson's disease (PD) patients to specialized centers for treatment with device-aided therapies (DAT) is suboptimal. OBJECTIVE: To develop a screening tool for timely referral for DAT in PD and to compare the tool with the published 5-2-1 criteria. METHODS: A cross-sectional, observational study was performed in 8 hospitals in the catchment area of a specialized movement disorder center in the Northern part of the Netherlands. The target population comprised PD patients not yet on DAT visiting the outpatient clinic of participating hospitals. The primary outcome was apparent eligibility for referral for DAT based on consensus by a panel of 5 experts in the field of DAT. Multivariable logistic regression modelling was used to develop a screening tool for eligibility for referral for DAT. Potential predictors were patient and disease characteristics as observed by attending neurologists. RESULTS: In total, 259 consecutive PD patients were included, of whom 17 were deemed eligible for referral for DAT (point prevalence: 6.6%). Presence of response fluctuations and troublesome dyskinesias were the strongest independent predictors of being considered eligible. Both variables were included in the final model, as well as levodopa equivalent daily dose. Decision curve analysis revealed the new model outperforms the 5-2-1 criteria. A simple chart was constructed to provide guidance for referral. Discrimination of this simplified scoring system proved excellent (AUC after bootstrapping: 0.97). CONCLUSIONS: Awaiting external validation, the developed screening tool already appears promising for timely referral and subsequent treatment with DAT in patients with PD.
Assuntos
Discinesias , Doença de Parkinson , Humanos , Doença de Parkinson/terapia , Doença de Parkinson/tratamento farmacológico , Estudos Transversais , Levodopa/uso terapêutico , Discinesias/tratamento farmacológico , Encaminhamento e Consulta , Antiparkinsonianos/uso terapêuticoRESUMO
BACKGROUND: Coronavirus disease 2019 (COVID-19) social distancing measures led to a dramatic decline in non-COVID-19 respiratory virus infections, providing a unique opportunity to study their impact on annual forced expiratory volume in 1â s (FEV1) decline, episodes of temporary drop in lung function (TDLF) suggestive of infection and chronic lung allograft dysfunction (CLAD) in lung transplant recipients (LTRs). METHODS: All FEV1 values of LTRs transplanted between 2009 and April 2020 at the University Medical Center Groningen (Groningen, The Netherlands) were included. Annual FEV1 change was estimated with separate estimates for pre-social distancing (2009-2020) and the year with social distancing measures (2020-2021). Patients were grouped by individual TDLF frequency (frequent/infrequent). Respiratory virus circulation was derived from weekly hospital-wide respiratory virus infection rates. Effect modification by TDLF frequency and respiratory virus circulation was assessed. CLAD and TDLF rates were analysed over time. RESULTS: 479 LTRs (12 775 FEV1 values) were included. Pre-social distancing annual change in FEV1 was -114 (95% CI -133- -94)â mL, while during social distancing FEV1 did not decline: 5 (95% CI -38-48)â mL (difference pre-social distancing versus during social distancing: p<0.001). The frequent TDLF subgroup showed faster annual FEV1 decline compared with the infrequent TDLF subgroup (-150 (95% CI -181- -120) versus -90 (95% CI -115- -65)â mL; p=0.003). During social distancing, we found significantly lower odds for any TDLF (OR 0.53, 95% CI 0.33-0.85; p=0.008) and severe TDLF (OR 0.34, 0.16-0.71; p=0.005) as well as lower CLAD incidence (OR 0.53, 95% CI 0.27-1.02; p=0.060). Effect modification by respiratory virus circulation indicated a significant association between TDLF/CLAD and respiratory viruses. CONCLUSIONS: During COVID-19 social distancing the strong reduction in respiratory virus circulation coincided with markedly less FEV1 decline, fewer episodes of TDLF and possibly less CLAD. Effect modification by respiratory virus circulation suggests an important role for respiratory viruses in lung function decline in LTRs.
Assuntos
COVID-19 , Transplante de Pulmão , Vírus , Humanos , Transplantados , Distanciamento Físico , Seguimentos , PulmãoRESUMO
BACKGROUND: Respiratory syncytial virus (RSV), parainfluenza virus (PIV), and human metapneumovirus (hMPV) are increasingly associated with chronic lung allograft dysfunction (CLAD) in lung transplant recipients (LTR). This systematic review primarily aimed to assess outcomes of RSV/PIV/hMPV infections in LTR and secondarily to assess evidence regarding the efficacy of ribavirin. METHODS: Relevant databases were queried and study outcomes extracted using a standardized method and summarized. RESULTS: Nineteen retrospective and 12 prospective studies were included (total 1060 cases). Pooled 30-day mortality was low (0-3%), but CLAD progression 180-360 days postinfection was substantial (pooled incidences 19-24%) and probably associated with severe infection. Ribavirin trended toward effectiveness for CLAD prevention in exploratory meta-analysis (odds ratio [OR] 0.61, [0.27-1.18]), although results were highly variable between studies. CONCLUSIONS: RSV/PIV/hMPV infection was followed by a high CLAD incidence. Treatment options, including ribavirin, are limited. There is an urgent need for high-quality studies to provide better treatment options for these infections.
Assuntos
Metapneumovirus , Infecções por Paramyxoviridae , Infecções por Vírus Respiratório Sincicial , Vírus Sincicial Respiratório Humano , Infecções Respiratórias , Humanos , Pulmão , Vírus da Parainfluenza 1 Humana , Vírus da Parainfluenza 2 Humana , Infecções por Paramyxoviridae/tratamento farmacológico , Infecções por Paramyxoviridae/epidemiologia , Estudos Prospectivos , Infecções por Vírus Respiratório Sincicial/tratamento farmacológico , Infecções por Vírus Respiratório Sincicial/epidemiologia , Infecções Respiratórias/tratamento farmacológico , Infecções Respiratórias/epidemiologia , Estudos Retrospectivos , Ribavirina/uso terapêutico , TransplantadosRESUMO
Genotyping from sequencing is the basis of emerging strategies in the molecular breeding of polyploid plants. However, compared with the situation for diploids, in which genotyping accuracies are confidently determined with comprehensive benchmarks, polyploids have been neglected; there are no benchmarks measuring genotyping error rates for small variants using real sequencing reads. We previously introduced a variant calling method, Octopus, that accurately calls germline variants in diploids and somatic mutations in tumors. Here, we evaluate Octopus and other popular tools on whole-genome tetraploid and hexaploid data sets created using in silico mixtures of diploid Genome in a Bottle (GIAB) samples. We find that genotyping errors are abundant for typical sequencing depths but that Octopus makes 25% fewer errors than other methods on average. We supplement our benchmarks with concordance analysis in real autotriploid banana data sets.
Assuntos
Benchmarking , Poliploidia , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , HumanosRESUMO
BACKGROUND: Laparoscopic hysterectomy is accepted worldwide as the standard treatment option for early-stage endometrial cancer. However, there are limited data on long-term survival, particularly when no lymphadenectomy is performed. We compared the survival outcomes of total laparoscopic hysterectomy (TLH) and total abdominal hysterectomy (TAH), both without lymphadenectomy, for early-stage endometrial cancer up to 5 years postoperatively. METHODS: Follow-up of a multi-centre, randomised controlled trial comparing TLH and TAH, without routine lymphadenectomy, for women with stage I endometrial cancer. Enrolment was between 2007 and 2009 by 2:1 randomisation to TLH or TAH. Outcomes were disease-free survival (DFS), overall survival (OS), disease-specific survival (DSS), and primary site of recurrence. Multivariable Cox regression analyses were adjusted for age, stage, grade, and radiotherapy with adjusted hazard ratios (aHR) and 95% confidence intervals (95%CI) reported. To test for significance, non-inferiority margins were defined. RESULTS: In total, 279 women underwent a surgical procedure, of whom 263 (94%) had follow-up data. For the TLH (n = 175) and TAH (n = 88) groups, DFS (90.3% vs 84.1%; aHR[recurrence], 0.69; 95%CI, 0.31-1.52), OS (89.2% vs 82.8%; aHR[death], 0.60; 95%CI, 0.30-1.19), and DSS (95.0% vs 89.8%; aHR[death], 0.62; 95%CI, 0.23-1.70) were reported at 5 years. At a 10% significance level, and with a non-inferiority margin of 0.20, the null hypothesis of inferiority was rejected for all three outcomes. There were no port-site or wound metastases, and local recurrence rates were comparable. CONCLUSION: Disease recurrence and 5-year survival rates were comparable between the TLH and TAH groups and comparable to studies with lymphadenectomy, supporting the widespread use of TLH without lymphadenectomy as the primary treatment for early-stage, low-grade endometrial cancer.
Assuntos
Carcinoma Endometrioide/cirurgia , Neoplasias do Endométrio/cirurgia , Histerectomia/métodos , Recidiva Local de Neoplasia/epidemiologia , Adulto , Idoso , Idoso de 80 Anos ou mais , Carcinoma Endometrioide/mortalidade , Carcinoma Endometrioide/patologia , Intervalo Livre de Doença , Neoplasias do Endométrio/mortalidade , Neoplasias do Endométrio/patologia , Feminino , Humanos , Laparoscopia/métodos , Laparotomia/métodos , Excisão de Linfonodo , Pessoa de Meia-Idade , Gradação de Tumores , Estadiamento de Neoplasias , Radioterapia AdjuvanteRESUMO
Endometriosis is a common chronic inflammatory condition causing pelvic pain and infertility in women, with limited treatment options and 50% heritability. We leveraged genetic analyses in two species with spontaneous endometriosis, humans and the rhesus macaque, to uncover treatment targets. We sequenced DNA from 32 human families contributing to a genetic linkage signal on chromosome 7p13-15 and observed significant overrepresentation of predicted deleterious low-frequency coding variants in NPSR1, the gene encoding neuropeptide S receptor 1, in cases (predominantly stage III/IV) versus controls (P = 7.8 × 10-4). Significant linkage to the region orthologous to human 7p13-15 was replicated in a pedigree of 849 rhesus macaques (P = 0.0095). Targeted association analyses in 3194 surgically confirmed, unrelated cases and 7060 controls revealed that a common insertion/deletion variant, rs142885915, was significantly associated with stage III/IV endometriosis (P = 5.2 × 10-5; odds ratio, 1.23; 95% CI, 1.09 to 1.39). Immunohistochemistry, qRT-PCR, and flow cytometry experiments demonstrated that NPSR1 was expressed in glandular epithelium from eutopic and ectopic endometrium, and on monocytes in peritoneal fluid. The NPSR1 inhibitor SHA 68R blocked NPSR1-mediated signaling, proinflammatory TNF-α release, and monocyte chemotaxis in vitro (P < 0.01), and led to a significant reduction of inflammatory cell infiltrate and abdominal pain (P < 0.05) in a mouse model of peritoneal inflammation as well as in a mouse model of endometriosis. We conclude that the NPSR1/NPS system is a genetically validated, nonhormonal target for the treatment of endometriosis with likely increased relevance to stage III/IV disease.
Assuntos
Endometriose , Receptores Acoplados a Proteínas G/genética , Animais , Endometriose/tratamento farmacológico , Endometriose/genética , Endométrio , Feminino , Humanos , Macaca mulatta , Camundongos , Fator de Necrose Tumoral alfaRESUMO
Tracking and understanding data quality, analysis and reproducibility are critical concerns in the biological sciences. This is especially true in genomics where next generation sequencing (NGS) based technologies such as ChIP-seq, RNA-seq and ATAC-seq are generating a flood of genome-scale data. However, such data are usually processed with automated tools and pipelines, generating tabular outputs and static visualisations. Interpretation is normally made at a high level without the ability to visualise the underlying data in detail. Conventional genome browsers are limited to browsing single locations and do not allow for interactions with the dataset as a whole. Multi Locus View (MLV), a web-based tool, has been developed to allow users to fluidly interact with genomics datasets at multiple scales. The user is able to browse the raw data, cluster, and combine the data with other analysis and annotate the data. User datasets can then be shared with other users or made public for quick assessment from the academic community. MLV is publically available at https://mlv.molbiol.ox.ac.uk .
Assuntos
Análise de Sequência de DNA/métodos , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Internet , Análise Numérica Assistida por Computador , RNA-Seq/métodos , Reprodutibilidade dos Testes , Análise de Sequência de RNA/métodos , SoftwareRESUMO
Recent advances in throughput and accuracy mean that the Oxford Nanopore Technologies PromethION platform is a now a viable solution for genome sequencing. Much of the validation of bioinformatic tools for this long-read data has focussed on calling germline variants (including structural variants). Somatic variants are outnumbered many-fold by germline variants and their detection is further complicated by the effects of tumour purity/subclonality. Here, we evaluate the extent to which Nanopore sequencing enables detection and analysis of somatic variation. We do this through sequencing tumour and germline genomes for a patient with diffuse B-cell lymphoma and comparing results with 150 bp short-read sequencing of the same samples. Calling germline single nucleotide variants (SNVs) from specific chromosomes of the long-read data achieved good specificity and sensitivity. However, results of somatic SNV calling highlight the need for the development of specialised joint calling algorithms. We find the comparative genome-wide performance of different tools varies significantly between structural variant types, and suggest long reads are especially advantageous for calling large somatic deletions and duplications. Finally, we highlight the utility of long reads for phasing clinically relevant variants, confirming that a somatic 1.6 Mb deletion and a p.(Arg249Met) mutation involving TP53 are oriented in trans.
Assuntos
Genoma Humano , Células Germinativas , Linfoma Difuso de Grandes Células B/genética , Polimorfismo de Nucleotídeo Único , Sequenciamento Completo do Genoma/métodos , Algoritmos , Sequência de Bases , Mapeamento Cromossômico/métodos , Cromossomos Humanos/genética , Biologia Computacional/métodos , Variações do Número de Cópias de DNA , Genes p53 , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Mutação , Sequenciamento por Nanoporos/métodos , Sensibilidade e Especificidade , Proteína Supressora de Tumor p53/genéticaRESUMO
Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.
Assuntos
Algoritmos , Demografia/história , Genética Populacional , Genoma Humano , Cadeias de Markov , Modelos Genéticos , Povo Asiático , Teorema de Bayes , Simulação por Computador , Variação Genética , História do Século XXI , História Antiga , História Medieval , Humanos , Linhagem , Densidade Demográfica , População BrancaRESUMO
Almost all haplotype-based variant callers were designed specifically for detecting common germline variation in diploid populations, and give suboptimal results in other scenarios. Here we present Octopus, a variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. Octopus combines sequencing reads and prior information to phase-called genotypes of arbitrary ploidy, including those with somatic mutations. We show that Octopus accurately calls germline variants in individuals, including single nucleotide variants, indels and small complex replacements such as microinversions. Using a synthetic tumor data set derived from clean sequencing data from a sample with known germline haplotypes and observed mutations in a large cohort of tumor samples, we show that Octopus is more sensitive to low-frequency somatic variation, yet calls considerably fewer false positives than other methods. Octopus also outputs realigned evidence BAM files to aid validation and interpretation.
Assuntos
Teorema de Bayes , Variação Genética , Genótipo , Haplótipos , Polimorfismo Genético , Software , Algoritmos , Animais , Biologia Computacional , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Modelos GenéticosRESUMO
Predicting the impact of noncoding genetic variation requires interpreting it in the context of three-dimensional genome architecture. We have developed deepC, a transfer-learning-based deep neural network that accurately predicts genome folding from megabase-scale DNA sequence. DeepC predicts domain boundaries at high resolution, learns the sequence determinants of genome folding and predicts the impact of both large-scale structural and single base-pair variations.
Assuntos
Genoma Humano/genética , Genômica/métodos , Modelos Genéticos , Redes Neurais de Computação , Sequência de Bases , Fator de Ligação a CCCTC/genética , Cromatina/genética , Simulação por Computador , Variação Estrutural do Genoma , HumanosRESUMO
Expectation maximization (EM) is a technique for estimating maximum-likelihood parameters of a latent variable model given observed data by alternating between taking expectations of sufficient statistics, and maximizing the expected log likelihood. For situations where sufficient statistics are intractable, stochastic approximation EM (SAEM) is often used, which uses Monte Carlo techniques to approximate the expected log likelihood. Two common implementations of SAEM, Batch EM (BEM) and online EM (OEM), are parameterized by a "learning rate", and their efficiency depend strongly on this parameter. We propose an extension to the OEM algorithm, termed Introspective Online Expectation Maximization (IOEM), which removes the need for specifying this parameter by adapting the learning rate to trends in the parameter updates. We show that our algorithm matches the efficiency of the optimal BEM and OEM algorithms in multiple models, and that the efficiency of IOEM can exceed that of BEM/OEM methods with optimal learning rates when the model has many parameters. Finally we use IOEM to fit two models to a financial time series. A Python implementation is available at https://github.com/luntergroup/IOEM.git.