Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
BMC Med Inform Decis Mak ; 24(1): 167, 2024 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-38877563

RESUMO

BACKGROUND: Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. METHODS: We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. RESULTS: We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. CONCLUSIONS: Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.


Assuntos
Disseminação de Informação , Humanos , Reino Unido , Comportamento Cooperativo , Confidencialidade/normas , Privacidade , Bancos de Espécimes Biológicos , Estudos Prospectivos
2.
Bioinformatics ; 35(14): i218-i224, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510659

RESUMO

MOTIVATION: Human genomic datasets often contain sensitive information that limits use and sharing of the data. In particular, simple anonymization strategies fail to provide sufficient level of protection for genomic data, because the data are inherently identifiable. Differentially private machine learning can help by guaranteeing that the published results do not leak too much information about any individual data point. Recent research has reached promising results on differentially private drug sensitivity prediction using gene expression data. Differentially private learning with genomic data is challenging because it is more difficult to guarantee privacy in high dimensions. Dimensionality reduction can help, but if the dimension reduction mapping is learned from the data, then it needs to be differentially private too, which can carry a significant privacy cost. Furthermore, the selection of any hyperparameters (such as the target dimensionality) needs to also avoid leaking private information. RESULTS: We study an approach that uses a large public dataset of similar type to learn a compact representation for differentially private learning. We compare three representation learning methods: variational autoencoders, principal component analysis and random projection. We solve two machine learning tasks on gene expression of cancer cell lines: cancer type classification, and drug sensitivity prediction. The experiments demonstrate significant benefit from all representation learning methods with variational autoencoders providing the most accurate predictions most often. Our results significantly improve over previous state-of-the-art in accuracy of differentially private drug sensitivity prediction. AVAILABILITY AND IMPLEMENTATION: Code used in the experiments is available at https://github.com/DPBayes/dp-representation-transfer.


Assuntos
Aprendizado de Máquina , Humanos , Neoplasias
3.
BMC Bioinformatics ; 19(1): 367, 2018 Oct 04.
Artigo em Inglês | MEDLINE | ID: mdl-30286713

RESUMO

BACKGROUND: Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. RESULTS: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. CONCLUSIONS: Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.


Assuntos
Variação Genética/genética , Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software
4.
Proc Natl Acad Sci U S A ; 112(42): 13115-20, 2015 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-26438844

RESUMO

Genes with similar transcriptional activation kinetics can display very different temporal mRNA profiles because of differences in transcription time, degradation rate, and RNA-processing kinetics. Recent studies have shown that a splicing-associated RNA production delay can be significant. To investigate this issue more generally, it is useful to develop methods applicable to genome-wide datasets. We introduce a joint model of transcriptional activation and mRNA accumulation that can be used for inference of transcription rate, RNA production delay, and degradation rate given data from high-throughput sequencing time course experiments. We combine a mechanistic differential equation model with a nonparametric statistical modeling approach allowing us to capture a broad range of activation kinetics, and we use Bayesian parameter estimation to quantify the uncertainty in estimates of the kinetic parameters. We apply the model to data from estrogen receptor α activation in the MCF-7 breast cancer cell line. We use RNA polymerase II ChIP-Seq time course data to characterize transcriptional activation and mRNA-Seq time course data to quantify mature transcripts. We find that 11% of genes with a good signal in the data display a delay of more than 20 min between completing transcription and mature mRNA production. The genes displaying these long delays are significantly more likely to be short. We also find a statistical association between high delay and late intron retention in pre-mRNA data, indicating significant splicing-associated production delays in many genes.


Assuntos
Genoma Humano , Modelos Genéticos , RNA/biossíntese , Transcrição Gênica , Receptor alfa de Estrogênio/metabolismo , Humanos , Cinética , Células MCF-7 , RNA/genética , Transdução de Sinais
5.
Bioinformatics ; 32(12): i147-i155, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307611

RESUMO

MOTIVATION: Alternative splicing is an important mechanism in which the regions of pre-mRNAs are differentially joined in order to form different transcript isoforms. Alternative splicing is involved in the regulation of normal physiological functions but also linked to the development of diseases such as cancer. We analyse differential expression and splicing using RNA-sequencing time series in three different settings: overall gene expression levels, absolute transcript expression levels and relative transcript expression levels. RESULTS: Using estrogen receptor α signaling response as a model system, our Gaussian process-based test identifies genes with differential splicing and/or differentially expressed transcripts. We discover genes with consistent changes in alternative splicing independent of changes in absolute expression and genes where some transcripts change whereas others stay constant in absolute level. The results suggest classes of genes with different modes of alternative splicing regulation during the experiment. AVAILABILITY AND IMPLEMENTATION: R and Matlab codes implementing the method are available at https://github.com/PROBIC/diffsplicing An interactive browser for viewing all model fits is available at http://users.ics.aalto.fi/hande/splicingGP/ CONTACT: hande.topa@helsinki.fi or antti.honkela@helsinki.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Processamento Alternativo , Perfilação da Expressão Gênica , Humanos , Isoformas de Proteínas , Precursores de RNA , Análise de Sequência de RNA
6.
BMC Bioinformatics ; 17(Suppl 16): 448, 2016 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-28105909

RESUMO

BACKGROUND: Various ℓ 1-penalised estimation methods such as graphical lasso and CLIME are widely used for sparse precision matrix estimation and learning of undirected network structure from data. Many of these methods have been shown to be consistent under various quantitative assumptions about the underlying true covariance matrix. Intuitively, these conditions are related to situations where the penalty term will dominate the optimisation. RESULTS: We explore the consistency of ℓ 1-based methods for a class of bipartite graphs motivated by the structure of models commonly used for gene regulatory networks. We show that all ℓ 1-based methods fail dramatically for models with nearly linear dependencies between the variables. We also study the consistency on models derived from real gene expression data and note that the assumptions needed for consistency never hold even for modest sized gene networks and ℓ 1-based methods also become unreliable in practice for larger networks. CONCLUSIONS: Our results demonstrate that ℓ 1-penalised undirected network structure learning methods are unable to reliably learn many sparse bipartite graph structures, which arise often in gene expression data. Users of such methods should be aware of the consistency criteria of the methods and check if they are likely to be met in their application of interest.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Aprendizado de Máquina , Transcriptoma , Animais , Humanos , Modelos Estatísticos
7.
Bioinformatics ; 31(24): 3881-9, 2015 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-26315907

RESUMO

MOTIVATION: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared with competing methods. However, exact Bayesian inference is intractable and approximate methods such as Markov chain Monte Carlo and Variational Bayes (VB) are typically used. While providing a high degree of accuracy and modelling flexibility, standard implementations can be prohibitively slow for large datasets and complex transcriptome annotations. RESULTS: We propose a novel approximate inference scheme based on VB and apply it to an existing model of transcript expression inference from RNA-seq data. Recent advances in VB algorithmics are used to improve the convergence of the algorithm beyond the standard Variational Bayes Expectation Maximization algorithm. We apply our algorithm to simulated and biological datasets, demonstrating a significant increase in speed with only very small loss in accuracy of expression level estimation. We carry out a comparative study against seven popular alternative methods and demonstrate that our new algorithm provides excellent accuracy and inter-replicate consistency while remaining competitive in computation time. AVAILABILITY AND IMPLEMENTATION: The methods were implemented in R and C++, and are available as part of the BitSeq project at github.com/BitSeq. The method is also available through the BitSeq Bioconductor package. The source code to reproduce all simulation results can be accessed via github.com/BitSeq/BitSeqVB_benchmarking.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Teorema de Bayes , Humanos , Cadeias de Markov , Método de Monte Carlo
8.
Bioinformatics ; 31(11): 1762-70, 2015 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-25614471

RESUMO

MOTIVATION: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but also monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analyzing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth. RESULTS: We present the beta-binomial Gaussian process model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation. AVAILABILITY AND IMPLEMENTATION: R software implementing the test is available at https://github.com/handetopa/BBGP.


Assuntos
Evolução Molecular , Frequência do Gene , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alelos , Animais , Drosophila/genética , Genômica/métodos , Modelos Estatísticos , Distribuição Normal , Polimorfismo de Nucleotídeo Único , Software
9.
Bioinformatics ; 30(17): 2471-9, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-24845653

RESUMO

MOTIVATION: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. RESULTS: In this article, we develop a content-based exploration and retrieval method for whole-metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples. We observe significant enrichment for diseased gut samples in results of queries with another diseased sample and high accuracy in discriminating between different body sites even though the method is unsupervised. AVAILABILITY AND IMPLEMENTATION: A software implementation of the DSM framework is available at https://github.com/HIITMetagenomics/dsm-framework. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica/métodos , Algoritmos , Mineração de Dados , Diabetes Mellitus Tipo 2/microbiologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Doenças Inflamatórias Intestinais/microbiologia , Microbiota , Análise de Sequência de DNA
10.
PLoS Comput Biol ; 10(5): e1003598, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24830797

RESUMO

Gene transcription mediated by RNA polymerase II (pol-II) is a key step in gene expression. The dynamics of pol-II moving along the transcribed region influence the rate and timing of gene expression. In this work, we present a probabilistic model of transcription dynamics which is fitted to pol-II occupancy time course data measured using ChIP-Seq. The model can be used to estimate transcription speed and to infer the temporal pol-II activity profile at the gene promoter. Model parameters are estimated using either maximum likelihood estimation or via Bayesian inference using Markov chain Monte Carlo sampling. The Bayesian approach provides confidence intervals for parameter estimates and allows the use of priors that capture domain knowledge, e.g. the expected range of transcription speeds, based on previous experiments. The model describes the movement of pol-II down the gene body and can be used to identify the time of induction for transcriptionally engaged genes. By clustering the inferred promoter activity time profiles, we are able to determine which genes respond quickly to stimuli and group genes that share activity profiles and may therefore be co-regulated. We apply our methodology to biological data obtained using ChIP-seq to measure pol-II occupancy genome-wide when MCF-7 human breast cancer cells are treated with estradiol (E2). The transcription speeds we obtain agree with those obtained previously for smaller numbers of genes with the advantage that our approach can be applied genome-wide. We validate the biological significance of the pol-II promoter activity clusters by investigating cluster-specific transcription factor binding patterns and determining canonical pathway enrichment. We find that rapidly induced genes are enriched for both estrogen receptor alpha (ERα) and FOXA1 binding in their proximal promoter regions.


Assuntos
Imunoprecipitação da Cromatina/métodos , RNA Polimerases Dirigidas por DNA/genética , Modelos Genéticos , Modelos Estatísticos , Regiões Promotoras Genéticas/genética , Transcrição Gênica/genética , Ativação Transcricional/genética , Animais , Simulação por Computador , Humanos , Ligação Proteica
11.
Bioinformatics ; 28(13): 1721-8, 2012 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-22563066

RESUMO

MOTIVATION: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression (DE) estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression. RESULTS: We present Bayesian inference of transcripts from sequencing data (BitSeq), a Bayesian approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative expression is represented by Markov chain Monte Carlo samples from the posterior probability distribution of a generative model of the read data. We propose a novel method for DE analysis across replicates which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior. We demonstrate the advantages of our method using simulated data as well as an RNA-seq dataset with technical and biological replication for both studied conditions. AVAILABILITY: The implementation of the transcriptome expression estimation and differential expression analysis, BitSeq, has been written in C++ and Python. The software is available online from http://code.google.com/p/bitseq/, version 0.4 was used for generating results presented in this article.


Assuntos
Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Teorema de Bayes , Variação Genética , Modelos Estatísticos , Alinhamento de Sequência , Software , Transcriptoma
12.
Proc Natl Acad Sci U S A ; 107(17): 7793-8, 2010 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-20385836

RESUMO

We present a computational method for identifying potential targets of a transcription factor (TF) using wild-type gene expression time series data. For each putative target gene we fit a simple differential equation model of transcriptional regulation, and the model likelihood serves as a score to rank targets. The expression profile of the TF is modeled as a sample from a Gaussian process prior distribution that is integrated out using a nonparametric Bayesian procedure. This results in a parsimonious model with relatively few parameters that can be applied to short time series datasets without noticeable overfitting. We assess our method using genome-wide chromatin immunoprecipitation (ChIP-chip) and loss-of-function mutant expression data for two TFs, Twist, and Mef2, controlling mesoderm development in Drosophila. Lists of top-ranked genes identified by our method are significantly enriched for genes close to bound regions identified in the ChIP-chip data and for genes that are differentially expressed in loss-of-function mutants. Targets of Twist display diverse expression profiles, and in this case a model-based approach performs significantly better than scoring based on correlation with TF expression. Our approach is found to be comparable or superior to ranking based on mutant differential expression scores. Also, we show how integrating complementary wild-type spatial expression data can further improve target ranking performance.


Assuntos
Proteínas de Drosophila/metabolismo , Regulação da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Modelos Genéticos , Fatores de Regulação Miogênica/metabolismo , Biologia de Sistemas/métodos , Proteína 1 Relacionada a Twist/metabolismo , Teorema de Bayes , Imunoprecipitação da Cromatina , Regulação da Expressão Gênica/fisiologia , Funções Verossimilhança , Mutação/genética
13.
BMJ Glob Health ; 8(2)2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36792230

RESUMO

The COVID-19 pandemic highlighted the need to prioritise mature digital health and data governance at both national and supranational levels to guarantee future health security. The Riyadh Declaration on Digital Health was a call to action to create the infrastructure needed to share effective digital health evidence-based practices and high-quality, real-time data locally and globally to provide actionable information to more health systems and countries. The declaration proposed nine key recommendations for data and digital health that need to be adopted by the global health community to address future pandemics and health threats. Here, we expand on each recommendation and provide an evidence-based roadmap for their implementation. This policy document serves as a resource and toolkit that all stakeholders in digital health and disaster preparedness can follow to develop digital infrastructure and protocols in readiness for future health threats through robust digital public health leadership.


Assuntos
COVID-19 , Saúde Pública , Humanos , Liderança , Pandemias/prevenção & controle , Saúde Global
14.
Bioinformatics ; 27(7): 1026-7, 2011 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-21300702

RESUMO

UNLABELLED: tigre is an R/Bioconductor package for inference of transcription factor activity and ranking candidate target genes from gene expression time series. The underlying methodology is based on Gaussian process inference on a differential equation model that allows the use of short, unevenly sampled, time series. The method has been designed with efficient parallel implementation in mind, and the package supports parallel operation even without additional software. AVAILABILITY: The tigre package is included in Bioconductor since release 2.6 for R 2.11. The package and a user's guide are available at http://www.bioconductor.org.


Assuntos
Software , Fatores de Transcrição/metabolismo , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos
15.
Nat Commun ; 13(1): 7417, 2022 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-36456554

RESUMO

Opportunistic bacterial pathogen species and their strains that colonise the human gut are generally understood to compete against both each other and the commensal species colonising this ecosystem. Currently we are lacking a population-wide quantification of strain-level colonisation dynamics and the relationship of colonisation potential to prevalence in disease, and how ecological factors might be modulating these. Here, using a combination of latest high-resolution metagenomics and strain-level genomic epidemiology methods we performed a characterisation of the competition and colonisation dynamics for a longitudinal cohort of neonatal gut microbiomes. We found strong inter- and intra-species competition dynamics in the gut colonisation process, but also a number of synergistic relationships among several species belonging to genus Klebsiella, which includes the prominent human pathogen Klebsiella pneumoniae. No evidence of preferential colonisation by hospital-adapted pathogen lineages in either vaginal or caesarean section birth groups was detected. Our analysis further enabled unbiased assessment of strain-level colonisation potential of extra-intestinal pathogenic Escherichia coli (ExPEC) in comparison with their propensity to cause bloodstream infections. Our study highlights the importance of systematic surveillance of bacterial gut pathogens, not only from disease but also from carriage state, to better inform therapies and preventive medicine in the future.


Assuntos
Cesárea , Ecossistema , Feminino , Gravidez , Recém-Nascido , Humanos , Klebsiella , Metagenômica , Parto , Escherichia coli/genética
16.
Patterns (N Y) ; 2(7): 100271, 2021 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-34286296

RESUMO

Differential privacy allows quantifying privacy loss resulting from accession of sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this limitation but would leave open the problem of designing what kind of synthetic data. We propose formulating the problem of private data release through probabilistic modeling. This approach transforms the problem of designing the synthetic data into choosing a model for the data, allowing also the inclusion of prior knowledge, which improves the quality of the synthetic data. We demonstrate empirically, in an epidemiological study, that statistical discoveries can be reliably reproduced from the synthetic data. We expect the method to have broad use in creating high-quality anonymized data twins of key datasets for research.

17.
Microb Genom ; 7(11)2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34779765

RESUMO

Genomic epidemiology is a tool for tracing transmission of pathogens based on whole-genome sequencing. We introduce the mGEMS pipeline for genomic epidemiology with plate sweeps representing mixed samples of a target pathogen, opening the possibility to sequence all colonies on selective plates with a single DNA extraction and sequencing step. The pipeline includes the novel mGEMS read binner for probabilistic assignments of sequencing reads, and the scalable pseudoaligner Themisto. We demonstrate the effectiveness of our approach using closely related samples in a nosocomial setting, obtaining results that are comparable to those based on single-colony picks. Our results lend firm support to more widespread consideration of genomic epidemiology with mixed infection samples.


Assuntos
Genoma Bacteriano , Genômica , Análise de Sequência , Sequenciamento Completo do Genoma
18.
Microbiol Resour Announc ; 10(22): e0136420, 2021 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-34080898

RESUMO

Clostridium botulinum group III is the anaerobic Gram-positive bacterium producing the deadly neurotoxin responsible for animal botulism. Here, we used long-read sequencing to produce four complete genomes from Clostridium botulinum group III neurotoxin types C, D, C/D, and D/C. The protocol for obtaining high-molecular-weight DNA from C. botulinum group III is described.

19.
Wellcome Open Res ; 5: 14, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-34746439

RESUMO

Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP pipeline for identifying and estimating the relative sequence abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our pipeline facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.

20.
Bioinformatics ; 24(16): i70-5, 2008 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-18689843

RESUMO

MOTIVATION: Inference of latent chemical species in biochemical interaction networks is a key problem in estimation of the structure and parameters of the genetic, metabolic and protein interaction networks that underpin all biological processes. We present a framework for Bayesian marginalization of these latent chemical species through Gaussian process priors. RESULTS: We demonstrate our general approach on three different biological examples of single input motifs, including both activation and repression of transcription. We focus in particular on the problem of inferring transcription factor activity when the concentration of active protein cannot easily be measured. We show how the uncertainty in the inferred transcription factor activity can be integrated out in order to derive a likelihood function that can be used for the estimation of regulatory model parameters. An advantage of our approach is that we avoid the use of a coarsegrained discretization of continuous time functions, which would lead to a large number of additional parameters to be estimated. We develop exact (for linear regulation) and approximate (for non-linear regulation) inference schemes, which are much more efficient than competing sampling-based schemes and therefore provide us with a practical toolkit for model-based inference. AVAILABILITY: The software and data for recreating all the experiments in this paper is available in MATLAB from http://www.cs.man. ac.uk/~neill/gpsim.


Assuntos
Modelos Químicos , Modelos Genéticos , RNA Mensageiro/química , RNA Mensageiro/genética , Fatores de Transcrição/química , Fatores de Transcrição/genética , Ativação Transcricional/genética , Simulação por Computador , Modelos Estatísticos , Distribuição Normal
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa