Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
Brief Bioinform ; 22(1): 308-314, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32008042

RESUMO

The use of machine learning (ML) has become prevalent in the genome engineering space, with applications ranging from predicting target site efficiency to forecasting the outcome of repair events. However, jargon and ML-specific accuracy measures have made it hard to assess the validity of individual approaches, potentially leading to misinterpretation of ML results. This review aims to close the gap by discussing ML approaches and pitfalls in the context of CRISPR gene-editing applications. Specifically, we address common considerations, such as algorithm choice, as well as problems, such as overestimating accuracy and data interoperability, by providing tangible examples from the genome-engineering domain. Equipping researchers with the knowledge to effectively use ML to better design gene-editing experiments and predict experimental outcomes will help advance the field more rapidly.


Assuntos
Sistemas CRISPR-Cas , Edição de Genes/métodos , Aprendizado de Máquina , Animais , Edição de Genes/normas , Genômica/métodos , Genômica/normas , Humanos
2.
BMC Genomics ; 16: 1052, 2015 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-26651996

RESUMO

BACKGROUND: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed SPARK engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VARIANTSPARK provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. RESULTS: To demonstrate the capabilities of VARIANTSPARK, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VARIANTSPARK is 80 % faster than the SPARK-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as ADMIXTURE, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. CONCLUSION: The benefits of speed, resource consumption and scalability enables VARIANTSPARK to open up the usage of advanced, efficient machine learning algorithms to genomic data.


Assuntos
Biologia Computacional/métodos , Genótipo , Algoritmos , Análise por Conglomerados , Humanos , Polimorfismo de Nucleotídeo Único , Software
3.
Gigascience ; 9(8)2020 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-32761098

RESUMO

BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. FINDINGS: We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. CONCLUSIONS: Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.


Assuntos
Computação em Nuvem , Estudo de Associação Genômica Ampla , Genômica , Aprendizado de Máquina , Fenótipo
4.
Sci Rep ; 9(1): 2788, 2019 02 26.
Artigo em Inglês | MEDLINE | ID: mdl-30808944

RESUMO

Editing individual nucleotides is a crucial component for validating genomic disease association. It is currently hampered by CRISPR-Cas-mediated "base editing" being limited to certain nucleotide changes, and only achievable within a small window around CRISPR-Cas target sites. The more versatile alternative, HDR (homology directed repair), has a 3-fold lower efficiency with known optimization factors being largely immutable in experiments. Here, we investigated the variable efficiency-governing factors on a novel mouse dataset using machine learning. We found the sequence composition of the single-stranded oligodeoxynucleotide (ssODN), i.e. the repair template, to be a governing factor. Furthermore, different regions of the ssODN have variable influence, which reflects the underlying mechanism of the repair process. Our model improves HDR efficiency by 83% compared to traditionally chosen targets. Using our findings, we developed CUNE (Computational Universal Nucleotide Editor), which enables users to identify and design the optimal targeting strategy using traditional base editing or - for-the-first-time - HDR-mediated nucleotide changes.


Assuntos
Reparo do DNA , Edição de Genes , Aprendizado de Máquina , Animais , Sistemas CRISPR-Cas/genética , Quebras de DNA de Cadeia Dupla , Camundongos , Camundongos Endogâmicos C57BL , Mutação , Oligodesoxirribonucleotídeos/metabolismo , RNA Guia de Cinetoplastídeos/metabolismo
5.
Genome Biol ; 20(1): 171, 2019 08 26.
Artigo em Inglês | MEDLINE | ID: mdl-31446895

RESUMO

BACKGROUND: CRISPR-Cas9 gene-editing technology has facilitated the generation of knockout mice, providing an alternative to cumbersome and time-consuming traditional embryonic stem cell-based methods. An earlier study reported up to 16% efficiency in generating conditional knockout (cKO or floxed) alleles by microinjection of 2 single guide RNAs (sgRNA) and 2 single-stranded oligonucleotides as donors (referred herein as "two-donor floxing" method). RESULTS: We re-evaluate the two-donor method from a consortium of 20 laboratories across the world. The dataset constitutes 56 genetic loci, 17,887 zygotes, and 1718 live-born mice, of which only 15 (0.87%) mice contain cKO alleles. We subject the dataset to statistical analyses and a machine learning algorithm, which reveals that none of the factors analyzed was predictive for the success of this method. We test some of the newer methods that use one-donor DNA on 18 loci for which the two-donor approach failed to produce cKO alleles. We find that the one-donor methods are 10- to 20-fold more efficient than the two-donor approach. CONCLUSION: We propose that the two-donor method lacks efficiency because it relies on two simultaneous recombination events in cis, an outcome that is dwarfed by pervasive accompanying undesired editing events. The methods that use one-donor DNA are fairly efficient as they rely on only one recombination event, and the probability of correct insertion of the donor cassette without unanticipated mutational events is much higher. Therefore, one-donor methods offer higher efficiencies for the routine generation of cKO animal models.


Assuntos
Alelos , Proteína 9 Associada à CRISPR/metabolismo , Sistemas CRISPR-Cas/genética , Animais , Blastocisto/metabolismo , Análise Fatorial , Feminino , Masculino , Proteína 2 de Ligação a Metil-CpG/genética , Proteína 2 de Ligação a Metil-CpG/metabolismo , Camundongos Knockout , Microinjeções , Análise de Regressão , Reprodutibilidade dos Testes
6.
Front Pharmacol ; 9: 749, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30050439

RESUMO

Recent years have seen the development of computational tools to assist researchers in performing CRISPR-Cas9 experiment optimally. More specifically, these tools aim to maximize on-target activity (guide efficiency) while also minimizing potential off-target effects (guide specificity) by analyzing the features of the target site. Nonetheless, currently available tools cannot robustly predict experimental success as prediction accuracy depends on the approximations of the underlying model and how closely the experimental setup matches the data the model was trained on. Here, we present an overview of the available computational tools, their current limitations and future considerations. We discuss new trends around personalized health by taking genomic variants into account when predicting target sites as well as discussing other governing factors that can improve prediction accuracy.

7.
CRISPR J ; 1: 182-190, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-31021206

RESUMO

The activity of CRISPR-Cas9 target sites can be measured experimentally through phenotypic assays or mutation rate and used to build computational models to predict activity of novel target sites. However, currently published models have been reported to perform poorly in situations other than their training conditions. In this study, we hence investigate how different sources of data influence predictive power and identify the best data set for the most robust predictive model. We use the activity of 28,606 target sites and a machine learning approach to train a predictive model of CRISPR-Cas9 activity, outperforming other published methods by an average increase in accuracy of 80% for prediction of the degree of activity and 13% for classification into active and inactive categories. We find that using data sets that measure CRISPR-Cas9 activity through sequencing provides more accurate predictions of activity. Our model, dubbed TUSCAN, is highly scalable, predicting the activity of 5000 target sites in under 7 s, making it suitable for genome-wide screens. We conclude that sophisticated machine learning methods can classify binary CRISPR-Cas9 activity; however, predicting fine-scale activity scores will require larger data sets directly measuring Indel insertion rate.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA