Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
iScience ; 27(5): 109570, 2024 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-38646172

RESUMO

The three-dimensional organization of genomes plays a crucial role in essential biological processes. The segregation of chromatin into A and B compartments highlights regions of activity and inactivity, providing a window into the genomic activities specific to each cell type. Yet, the steep costs associated with acquiring Hi-C data, necessary for studying this compartmentalization across various cell types, pose a significant barrier in studying cell type specific genome organization. To address this, we present a prediction tool called compartment prediction using recurrent neural networks (CoRNN), which predicts compartmentalization of 3D genome using histone modification enrichment. CoRNN demonstrates robust cross-cell-type prediction of A/B compartments with an average AuROC of 90.9%. Cell-type-specific predictions align well with known functional elements, with H3K27ac and H3K36me3 identified as highly predictive histone marks. We further investigate our mispredictions and found that they are located in regions with ambiguous compartmental status. Furthermore, our model's generalizability is validated by predicting compartments in independent tissue samples, which underscores its broad applicability.

2.
JMIR Med Educ ; 10: e51391, 2024 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-38349725

RESUMO

BACKGROUND: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. OBJECTIVE: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. METHODS: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. RESULTS: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. CONCLUSIONS: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.


Assuntos
Aprendizagem , Resolução de Problemas , Humanos , Reprodutibilidade dos Testes , Escolaridade , Idioma
3.
Cell Rep ; 42(12): 113500, 2023 12 26.
Artigo em Inglês | MEDLINE | ID: mdl-38032797

RESUMO

Aging is a major risk factor for many diseases. Accurate methods for predicting age in specific cell types are essential to understand the heterogeneity of aging and to assess rejuvenation strategies. However, classifying organismal age at single-cell resolution using transcriptomics is challenging due to sparsity and noise. Here, we developed CellBiAge, a robust and easy-to-implement machine learning pipeline, to classify the age of single cells in the mouse brain using single-cell transcriptomics. We show that binarization of gene expression values for the top highly variable genes significantly improved test performance across different models, techniques, sexes, and brain regions, with potential age-related genes identified for model prediction. Additionally, we demonstrate CellBiAge's ability to capture exercise-induced rejuvenation in neural stem cells. This study provides a broadly applicable approach for robust classification of organismal age of single cells in the mouse brain, which may aid in understanding the aging process and evaluating rejuvenation methods.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Animais , Camundongos , Análise de Célula Única/métodos , Aprendizado de Máquina , Senescência Celular , Envelhecimento
4.
Microscopy (Oxf) ; 2023 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-37864808

RESUMO

We present a graph neural network (GNN)-based framework applied to large-scale microscopy image segmentation tasks. While deep learning models, like convolutional neural networks (CNNs), have become common for automating image segmentation tasks, they are limited by the image size that can fit in the memory of computational hardware. In a GNN framework, large-scale images are converted into graphs using superpixels (regions of pixels with similar color/intensity values), allowing us to input information from the entire image into the model. By converting images with hundreds of millions of pixels to graphs with thousands of nodes, we can segment large images using memory-limited computational resources. We compare the performance of GNN- and CNN-based segmentation in terms of accuracy, training time and required graphics processing unit memory. Based on our experiments with microscopy images of biological cells and cell colonies, GNN-based segmentation used one to three orders-of-magnitude fewer computational resources with only a change in accuracy of $-2\;%$ to $+0.3\;%$. Furthermore, errors due to superpixel generation can be reduced by either using better superpixel generation algorithms or increasing the number of superpixels, thereby allowing for improvement in the GNN framework's accuracy. This trade-off between accuracy and computational cost over CNN models makes the GNN framework attractive for many large-scale microscopy image segmentation tasks in biology.

5.
Cancer Res ; 83(12): 1984-1999, 2023 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-37101376

RESUMO

Chitinase 3-like 1 (Chi3l1) is a secreted protein that is highly expressed in glioblastoma. Here, we show that Chi3l1 alters the state of glioma stem cells (GSC) to support tumor growth. Exposure of patient-derived GSCs to Chi3l1 reduced the frequency of CD133+SOX2+ cells and increased the CD44+Chi3l1+ cells. Chi3l1 bound to CD44 and induced phosphorylation and nuclear translocation of ß-catenin, Akt, and STAT3. Single-cell RNA sequencing and RNA velocity following incubation of GSCs with Chi3l1 showed significant changes in GSC state dynamics driving GSCs towards a mesenchymal expression profile and reducing transition probabilities towards terminal cellular states. ATAC-seq revealed that Chi3l1 increases accessibility of promoters containing a Myc-associated zinc finger protein (MAZ) transcription factor footprint. Inhibition of MAZ downregulated a set of genes with high expression in cellular clusters that exhibit significant cell state transitions after treatment with Chi3l1, and MAZ deficiency rescued the Chi3L-induced increase of GSC self-renewal. Finally, targeting Chi3l1 in vivo with a blocking antibody inhibited tumor growth and increased the probability of survival. Overall, this work suggests that Chi3l1 interacts with CD44 on the surface of GSCs to induce Akt/ß-catenin signaling and MAZ transcriptional activity, which in turn upregulates CD44 expression in a pro-mesenchymal feed-forward loop. The role of Chi3l1 in regulating cellular plasticity confers a targetable vulnerability to glioblastoma. SIGNIFICANCE: Chi3l1 is a modulator of glioma stem cell states that can be targeted to promote differentiation and suppress growth of glioblastoma.


Assuntos
Neoplasias Encefálicas , Glioblastoma , Glioma , Humanos , Glioblastoma/patologia , beta Catenina/metabolismo , Proteínas Proto-Oncogênicas c-akt/metabolismo , Células-Tronco Neoplásicas/patologia , Glioma/metabolismo , Neoplasias Encefálicas/patologia , Linhagem Celular Tumoral , Proliferação de Células
6.
Med Phys ; 50(8): 4943-4959, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-36847185

RESUMO

PURPOSE: State-of-the-art automated segmentation methods achieve exceptionally high performance on the Brain Tumor Segmentation (BraTS) challenge, a dataset of uniformly processed and standardized magnetic resonance generated images (MRIs) of gliomas. However, a reasonable concern is that these models may not fare well on clinical MRIs that do not belong to the specially curated BraTS dataset. Research using the previous generation of deep learning models indicates significant performance loss on cross-institutional predictions. Here, we evaluate the cross-institutional applicability and generalzsability of state-of-the-art deep learning models on new clinical data. METHODS: We train a state-of-the-art 3D U-Net model on the conventional BraTS dataset comprising low- and high-grade gliomas. We then evaluate the performance of this model for automatic tumor segmentation of brain tumors on in-house clinical data. This dataset contains MRIs of different tumor types, resolutions, and standardization than those found in the BraTS dataset. Ground truth segmentations to validate the automated segmentation for in-house clinical data were obtained from expert radiation oncologists. RESULTS: We report average Dice scores of 0.764, 0.648, and 0.61 for the whole tumor, tumor core, and enhancing tumor, respectively, in the clinical MRIs. These means are higher than numbers reported previously on same institution and cross-institution datasets of different origin using different methods. There is no statistically significant difference when comparing the dice scores to the inter-annotation variability between two expert clinical radiation oncologists. Although performance on the clinical data is lower than on the BraTS data, these numbers indicate that models trained on the BraTS dataset have impressive segmentation performance on previously unseen images obtained at a separate clinical institution. These images differ in the imaging resolutions, standardization pipelines, and tumor types from the BraTS data. CONCLUSIONS: State-of-the-art deep learning models demonstrate promising performance on cross-institutional predictions. They considerably improve on previous models and can transfer knowledge to new types of brain tumors without additional modeling.


Assuntos
Neoplasias Encefálicas , Glioma , Humanos , Neoplasias Encefálicas/diagnóstico por imagem , Glioma/diagnóstico por imagem , Instalações de Saúde
7.
bioRxiv ; 2023 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-36747724

RESUMO

With the rapid advance of single-cell RNA sequencing (scRNA-seq) technology, understanding biological processes at a more refined single-cell level is becoming possible. Gene co-expression estimation is an essential step in this direction. It can annotate functionalities of unknown genes or construct the basis of gene regulatory network inference. This study thoroughly tests the existing gene co-expression estimation methods on simulation datasets with known ground truth co-expression networks. We generate these novel datasets using two simulation processes that use the parameters learned from the experimental data. We demonstrate that these simulations better capture the underlying properties of the real-world single-cell datasets than previously tested simulations for the task. Our performance results on tens of simulated and eight experimental datasets show that all methods produce estimations with a high false discovery rate potentially caused by high-sparsity levels in the data. Finally, we find that commonly used pre-processing approaches, such as normalization and imputation, do not improve the co-expression estimation. Overall, our benchmark setup contributes to the co-expression estimator development, and our study provides valuable insights for the community of single-cell data analyses.

8.
Genes (Basel) ; 15(1)2023 Dec 29.
Artigo em Inglês | MEDLINE | ID: mdl-38254945

RESUMO

Hi-C is a widely used technique to study the 3D organization of the genome. Due to its high sequencing cost, most of the generated datasets are of a coarse resolution, which makes it impractical to study finer chromatin features such as Topologically Associating Domains (TADs) and chromatin loops. Multiple deep learning-based methods have recently been proposed to increase the resolution of these datasets by imputing Hi-C reads (typically called upscaling). However, the existing works evaluate these methods on either synthetically downsampled datasets, or a small subset of experimentally generated sparse Hi-C datasets, making it hard to establish their generalizability in the real-world use case. We present our framework-Hi-CY-that compares existing Hi-C resolution upscaling methods on seven experimentally generated low-resolution Hi-C datasets belonging to various levels of read sparsities originating from three cell lines on a comprehensive set of evaluation metrics. Hi-CY also includes four downstream analysis tasks, such as TAD and chromatin loops recall, to provide a thorough report on the generalizability of these methods. We observe that existing deep learning methods fail to generalize to experimentally generated sparse Hi-C datasets, showing a performance reduction of up to 57%. As a potential solution, we find that retraining deep learning-based methods with experimentally generated Hi-C datasets improves performance by up to 31%. More importantly, Hi-CY shows that even with retraining, the existing deep learning-based methods struggle to recover biological features such as chromatin loops and TADs when provided with sparse Hi-C datasets. Our study, through the Hi-CY framework, highlights the need for rigorous evaluation in the future. We identify specific avenues for improvements in the current deep learning-based Hi-C upscaling methods, including but not limited to using experimentally generated datasets for training.


Assuntos
Aprendizado Profundo , Benchmarking , Linhagem Celular , Cromatina/genética
10.
J Comput Biol ; 29(11): 1213-1228, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36251763

RESUMO

Multiomic single-cell data allow us to perform integrated analysis to understand genomic regulation of biological processes. However, most single-cell sequencing assays are performed on separately sampled cell populations, as applying them to the same single-cell is challenging. Existing unsupervised single-cell alignment algorithms have been primarily benchmarked on coassay experiments. Our investigation revealed that these methods do not perform well for noncoassay single-cell experiments when there is disproportionate cell-type representation across measurement domains. Therefore, we extend our previous work-Single Cell alignment using Optimal Transport (SCOT)-by using unbalanced Gromov-Wasserstein optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements. Our method, SCOTv2, gives state-of-the-art alignment performance across five non-coassay data sets (simulated and real world). It can also integrate multiple (M≥2) single-cell measurements while preserving the self-tuning capabilities and computational tractability of its original version.


Assuntos
Algoritmos , Genômica
11.
J Am Med Inform Assoc ; 29(12): 2014-2022, 2022 11 14.
Artigo em Inglês | MEDLINE | ID: mdl-36149257

RESUMO

OBJECTIVE: Alzheimer's disease (AD) is the most common neurodegenerative disorder with one of the most complex pathogeneses, making effective and clinically actionable decision support difficult. The objective of this study was to develop a novel multimodal deep learning framework to aid medical professionals in AD diagnosis. MATERIALS AND METHODS: We present a Multimodal Alzheimer's Disease Diagnosis framework (MADDi) to accurately detect the presence of AD and mild cognitive impairment (MCI) from imaging, genetic, and clinical data. MADDi is novel in that we use cross-modal attention, which captures interactions between modalities-a method not previously explored in this domain. We perform multi-class classification, a challenging task considering the strong similarities between MCI and AD. We compare with previous state-of-the-art models, evaluate the importance of attention, and examine the contribution of each modality to the model's performance. RESULTS: MADDi classifies MCI, AD, and controls with 96.88% accuracy on a held-out test set. When examining the contribution of different attention schemes, we found that the combination of cross-modal attention with self-attention performed the best, and no attention layers in the model performed the worst, with a 7.9% difference in F1-scores. DISCUSSION: Our experiments underlined the importance of structured clinical data to help machine learning models contextualize and interpret the remaining modalities. Extensive ablation studies showed that any multimodal mixture of input features without access to structured clinical information suffered marked performance losses. CONCLUSION: This study demonstrates the merit of combining multiple input modalities via cross-modal attention to deliver highly accurate AD diagnostic decision support.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Aprendizado Profundo , Humanos , Doença de Alzheimer/diagnóstico , Imageamento por Ressonância Magnética/métodos , Disfunção Cognitiva/diagnóstico , Aprendizado de Máquina
12.
J Comput Biol ; 29(5): 409-424, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35325548

RESUMO

Long-range regulatory interactions among genomic regions are critical for controlling gene expression, and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying genomic organization. To address these limitations, we present a Graph Convolutional Model for Epigenetic Regulation of Gene Expression (GC-MERGE). Using a graph-based framework, the model incorporates important information about long-range interactions via a natural encoding of genomic spatial interactions into the graph representation. It integrates measurements of both the global genomic organization and the local regulatory factors, specifically histone modifications, to not only predict the expression of a given gene of interest but also quantify the importance of its regulatory factors. We apply GC-MERGE to data sets for three cell lines-GM12878 (lymphoblastoid), K562 (myelogenous leukemia), and HUVEC (human umbilical vein endothelial)-and demonstrate its state-of-the-art predictive performance. Crucially, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions contributing to a gene's predicted expression. We provide model explanations for multiple exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal data sets in a graph convolutional framework. More importantly, it enables interpretation of the biological mechanisms driving the model's predictions.


Assuntos
Epigênese Genética , Redes Neurais de Computação , Expressão Gênica , Genoma , Genômica , Humanos
13.
J Comput Biol ; 29(1): 3-18, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35050714

RESUMO

Recent advances in sequencing technologies have allowed us to capture various aspects of the genome at single-cell resolution. However, with the exception of a few of co-assaying technologies, it is not possible to simultaneously apply different sequencing assays on the same single cell. In this scenario, computational integration of multi-omic measurements is crucial to enable joint analyses. This integration task is particularly challenging due to the lack of sample-wise or feature-wise correspondences. We present single-cell alignment with optimal transport (SCOT), an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets. SCOT performs on par with the current state-of-the-art unsupervised alignment methods, is faster, and requires tuning of fewer hyperparameters. More importantly, SCOT uses a self-tuning heuristic to guide hyperparameter selection based on the Gromov-Wasserstein distance. Thus, in the fully unsupervised setting, SCOT aligns single-cell data sets better than the existing methods without requiring any orthogonal correspondence information.


Assuntos
Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Modelos Estatísticos , Aprendizado de Máquina não Supervisionado
14.
J Comput Biol ; 29(1): 19-22, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34985990

RESUMO

Although the availability of various sequencing technologies allows us to capture different genome properties at single-cell resolution, with the exception of a few co-assaying technologies, applying different sequencing assays on the same single cell is impossible. Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data. First, it preserves the local geometry by constructing a k-nearest neighbor (k-NN) graph for each data set (or domain) to capture the intra-domain distances. SCOT then finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection, thus aligning them. SCOT requires tuning only two hyperparameters and is robust to the choice of one. Furthermore, the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available. Thus, SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario. We provide a tutorial for SCOT and make its source code publicly available on GitHub.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Heurística , Humanos , Redes Neurais de Computação , Análise de Sequência/estatística & dados numéricos , Software , Aprendizado de Máquina não Supervisionado
15.
NPJ Digit Med ; 5(1): 5, 2022 Jan 14.
Artigo em Inglês | MEDLINE | ID: mdl-35031687

RESUMO

While COVID-19 diagnosis and prognosis artificial intelligence models exist, very few can be implemented for practical use given their high risk of bias. We aimed to develop a diagnosis model that addresses notable shortcomings of prior studies, integrating it into a fully automated triage pipeline that examines chest radiographs for the presence, severity, and progression of COVID-19 pneumonia. Scans were collected using the DICOM Image Analysis and Archive, a system that communicates with a hospital's image repository. The authors collected over 6,500 non-public chest X-rays comprising diverse COVID-19 severities, along with radiology reports and RT-PCR data. The authors provisioned one internally held-out and two external test sets to assess model generalizability and compare performance to traditional radiologist interpretation. The pipeline was evaluated on a prospective cohort of 80 radiographs, reporting a 95% diagnostic accuracy. The study mitigates bias in AI model development and demonstrates the value of an end-to-end COVID-19 triage platform.

16.
Genome Biol ; 22(1): 279, 2021 09 27.
Artigo em Inglês | MEDLINE | ID: mdl-34579774

RESUMO

BACKGROUND: Mammalian development is associated with extensive changes in gene expression, chromatin accessibility, and nuclear structure. Here, we follow such changes associated with mouse embryonic stem cell differentiation and X inactivation by integrating, for the first time, allele-specific data from these three modalities obtained by high-throughput single-cell RNA-seq, ATAC-seq, and Hi-C. RESULTS: Allele-specific contact decay profiles obtained by single-cell Hi-C clearly show that the inactive X chromosome has a unique profile in differentiated cells that have undergone X inactivation. Loss of this inactive X-specific structure at mitosis is followed by its reappearance during the cell cycle, suggesting a "bookmark" mechanism. Differentiation of embryonic stem cells to follow the onset of X inactivation is associated with changes in contact decay profiles that occur in parallel on both the X chromosomes and autosomes. Single-cell RNA-seq and ATAC-seq show evidence of a delay in female versus male cells, due to the presence of two active X chromosomes at early stages of differentiation. The onset of the inactive X-specific structure in single cells occurs later than gene silencing, consistent with the idea that chromatin compaction is a late event of X inactivation. Single-cell Hi-C highlights evidence of discrete changes in nuclear structure characterized by the acquisition of very long-range contacts throughout the nucleus. Novel computational approaches allow for the effective alignment of single-cell gene expression, chromatin accessibility, and 3D chromosome structure. CONCLUSIONS: Based on trajectory analyses, three distinct nuclear structure states are detected reflecting discrete and profound simultaneous changes not only to the structure of the X chromosomes, but also to that of autosomes during differentiation. Our study reveals that long-range structural changes to chromosomes appear as discrete events, unlike progressive changes in gene expression and chromatin accessibility.


Assuntos
Diferenciação Celular/genética , Expressão Gênica , Células-Tronco Embrionárias Murinas/metabolismo , Inativação do Cromossomo X , Alelos , Animais , Ciclo Celular , Linhagem Celular , Núcleo Celular/genética , Feminino , Genoma , Masculino , Camundongos , RNA-Seq , Análise de Célula Única , Cromossomo X/química
17.
Nucleic Acids Res ; 49(W1): W641-W653, 2021 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-34125906

RESUMO

Uncovering how transcription factors regulate their targets at DNA, RNA and protein levels over time is critical to define gene regulatory networks (GRNs) and assign mechanisms in normal and diseased states. RNA-seq is a standard method measuring gene regulation using an established set of analysis stages. However, none of the currently available pipeline methods for interpreting ordered genomic data (in time or space) use time-series models to assign cause and effect relationships within GRNs, are adaptive to diverse experimental designs, or enable user interpretation through a web-based platform. Furthermore, methods integrating ordered RNA-seq data with protein-DNA binding data to distinguish direct from indirect interactions are urgently needed. We present TIMEOR (Trajectory Inference and Mechanism Exploration with Omics data in R), the first web-based and adaptive time-series multi-omics pipeline method which infers the relationship between gene regulatory events across time. TIMEOR addresses the critical need for methods to determine causal regulatory mechanism networks by leveraging time-series RNA-seq, motif analysis, protein-DNA binding data, and protein-protein interaction networks. TIMEOR's user-catered approach helps non-coders generate new hypotheses and validate known mechanisms. We used TIMEOR to identify a novel link between insulin stimulation and the circadian rhythm cycle. TIMEOR is available at https://github.com/ashleymaeconard/TIMEOR.git and http://timeor.brown.edu.


Assuntos
Regulação da Expressão Gênica , Redes Reguladoras de Genes , RNA-Seq , Software , Ritmo Circadiano/genética , Genômica , Humanos , Insulina/fisiologia , Internet , Mapeamento de Interação de Proteínas , Fatores de Transcrição/metabolismo
18.
Curr Opin Chem Biol ; 65: 35-41, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34107341

RESUMO

A recent deluge of publicly available multi-omics data has fueled the development of machine learning methods aimed at investigating important questions in genomics. Although the motivations for these methods vary, a task that is commonly adopted is that of profile prediction, where predictions are made for one or more forms of biochemical activity along the genome, for example, histone modification, chromatin accessibility, or protein binding. In this review, we give an overview of the research works performing profile prediction, define two broad categories of profile prediction tasks, and discuss the types of scientific questions that can be answered in each.


Assuntos
Genômica , Aprendizado de Máquina , Cromatina/genética , Genoma , Ligação Proteica
19.
Bioinformatics ; 36(Suppl_2): i857-i865, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381828

RESUMO

MOTIVATION: Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task's alphabet size. RESULTS: In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. AVAILABILITY AND IMPLEMENTATION: Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Análise de Sequência de Proteína , Máquina de Vetores de Suporte , Algoritmos , Proteínas , Software
20.
Genome Biol ; 21(1): 282, 2020 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-33213499

RESUMO

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.


Assuntos
Epigenômica , Genômica/métodos , Aprendizado de Máquina , Cromatina , Expressão Gênica , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...