RESUMO
Background: Substance use disorder (SUD) is a heterogeneous disorder. Adapting machine learning algorithms to allow for the parsing of intrapersonal and interpersonal heterogeneity in meaningful ways may accelerate the discovery and implementation of clinically actionable interventions in SUD research.Objectives: Inspired by a study of heavy drinkers that collected daily drinking and substance use (ABQ DrinQ), we develop tools to estimate subject-specific risk trajectories of heavy drinking; estimate and perform inference on patient characteristics and time-varying covariates; and present results in easy-to-use Jupyter notebooks. Methods: We recast support vector machines (SVMs) into a Bayesian model extended to handle mixed effects. We then apply these methods to ABQ DrinQ to model alcohol use patterns. ABQ DrinQ consists of 190 heavy drinkers (44% female) with 109,580 daily observations. Results: We identified male gender (point estimate; 95% credible interval: -0.25;-0.29,-0.21), older age (-0.03;-0.03,-0.03), and time varying usage of nicotine (1.68;1.62,1.73), cannabis (0.05;0.03,0.07), and other drugs (1.16;1.01,1.35) as statistically significant factors of heavy drinking behavior. By adopting random effects to capture the subject-specific longitudinal trajectories, the algorithm outperforms traditional SVM (classifies 84% of heavy drinking days correctly versus 73%). Conclusions: We developed a mixed effects variant of SVM and compare it to the traditional formulation, with an eye toward elucidating the importance of incorporating random effects to account for underlying heterogeneity in SUD data. These tools and examples are packaged into a repository for researchers to explore. Understanding patterns and risk of substance use could be used for developing individualized interventions.
Assuntos
Transtornos Relacionados ao Uso de Substâncias , Máquina de Vetores de Suporte , Teorema de Bayes , Feminino , Humanos , Masculino , Transtornos Relacionados ao Uso de Substâncias/epidemiologiaRESUMO
The growing attention toward the benefits of single-cell RNA sequencing (scRNA-seq) is leading to a myriad of computational packages for the analysis of different aspects of scRNA-seq data. For researchers without advanced programing skills, it is very challenging to combine several packages in order to perform the desired analysis in a simple and reproducible way. Here we present DIscBIO, an open-source, multi-algorithmic pipeline for easy, efficient and reproducible analysis of cellular sub-populations at the transcriptomic level. The pipeline integrates multiple scRNA-seq packages and allows biomarker discovery with decision trees and gene enrichment analysis in a network context using single-cell sequencing read counts through clustering and differential analysis. DIscBIO is freely available as an R package. It can be run either in command-line mode or through a user-friendly computational pipeline using Jupyter notebooks. We showcase all pipeline features using two scRNA-seq datasets. The first dataset consists of circulating tumor cells from patients with breast cancer. The second one is a cell cycle regulation dataset in myxoid liposarcoma. All analyses are available as notebooks that integrate in a sequential narrative R code with explanatory text and output data and images. R users can use the notebooks to understand the different steps of the pipeline and will guide them to explore their scRNA-seq data. We also provide a cloud version using Binder that allows the execution of the pipeline without the need of downloading R, Jupyter or any of the packages used by the pipeline. The cloud version can serve as a tutorial for training purposes, especially for those that are not R users or have limited programing skills. However, in order to do meaningful scRNA-seq analyses, all users will need to understand the implemented methods and their possible options and limitations.
Assuntos
Biomarcadores/análise , Biologia Computacional/métodos , RNA-Seq/métodos , Análise de Célula Única/métodos , Animais , Neoplasias da Mama/sangue , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Ciclo Celular/genética , Conjuntos de Dados como Assunto , Feminino , Redes Reguladoras de Genes , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Lipossarcoma Mixoide/diagnóstico , Lipossarcoma Mixoide/genética , Camundongos , Células Neoplásicas Circulantes/patologia , Software , Peixe-ZebraRESUMO
Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and other rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way in which notebooks are being used leads to unexpected behavior, encourages poor coding practices, and makes it hard to reproduce its results. To better understand good and bad practices used in the development of real notebooks, in prior work we studied 1.4 million notebooks from GitHub. We presented a detailed analysis of their characteristics that impact reproducibility, proposed best practices that can improve the reproducibility, and discussed open challenges that require further research and development. In this paper, we extended the analysis in four different ways to validate the hypothesis uncovered in our original study. First, we separated a group of popular notebooks to check whether notebooks that get more attention have more quality and reproducibility capabilities. Second, we sampled notebooks from the full dataset for an in-depth qualitative analysis of what constitutes the dataset and which features they have. Third, we conducted a more detailed analysis by isolating library dependencies and testing different execution orders. We report how these factors impact the reproducibility rates. Finally, we mined association rules from the notebooks. We discuss patterns we discovered, which provide additional insights into notebook reproducibility. Based on our findings and best practices we proposed, we designed Julynter, a Jupyter Lab extension that identifies potential issues in notebooks and suggests modifications that improve their reproducibility. We evaluate Julynter with a remote user experiment with the goal of assessing Julynter recommendations and usability.
RESUMO
Knowledge regarding cancer stem cell (CSC) morphology is limited, and more extensive studies are therefore required. Image recognition technologies using artificial intelligence (AI) require no previous expertise in image annotation. Herein, we describe the construction of AI models that recognize the CSC morphology in cultures and tumor tissues. The visualization of the AI deep learning process enables insight to be obtained regarding unrecognized structures in an image.
Assuntos
Aprendizado Profundo , Neoplasias , Humanos , Inteligência Artificial , Células-Tronco Neoplásicas , TecnologiaRESUMO
The COVID-19 pandemic has forced the Bioinformatics course to switch from on-site teaching to remote learning. This shift has prompted a change in teaching methods and laboratory activities. Students need to have a basic understanding of DNA sequences and how to analyze them using custom scripts. To facilitate learning, we have modified the course to use Jupyter Notebook, which offers an alternative approach to writing custom scripts for basic DNA sequence analysis. This approach allows students to acquire the necessary skills while working remotely. It is a versatile and user-friendly platform that can be used to combine explanations, code, and results in a single document. This feature enables students to interact with the code and results, making the learning process more engaging and effective. Jupyter Notebook provides a hybrid approach to learning basic Python scripting and genomics that is effective for remote teaching and learning during the COVID-19 pandemic.
RESUMO
The use of digital tools in pharmaceutical manufacturing has gained traction over the past two decades. Whether supporting regulatory filings or attempting to modernize manufacturing processes to adopt new and quickly evolving Industry 4.0 standards, engineers entering the workforce must exhibit proficiency in modeling, simulation, optimization, data processing, and other digital analysis techniques. In this work, a course that addresses digital tools in pharmaceutical manufacturing for chemical engineers was adjusted to utilize a new tool, PharmaPy, instead of traditional chemical engineering simulation tools. Jupyter Notebook was utilized as an instructional and interactive environment to teach students to use PharmaPy, a new, open-source pharmaceutical manufacturing process simulator. Students were then surveyed to see if PharmaPy was able to meet the learning objectives of the course. During the semester, PharmaPy's model library was used to simulate both individual unit operations as well as multiunit pharmaceutical processes. Through the initial survey results, students indicated that: (i) through Jupyter Notebook, learning Python and PharmaPy was approachable from varied coding experience backgrounds and (ii) PharmaPy strengthened their understanding of pharmaceutical manufacturing through active pharmaceutical ingredient process design and development.
RESUMO
Complex signaling and transcriptional programs control the development and physiology of specialized cell types. Genetic perturbations in these programs cause human cancers to arise from a diverse set of specialized cell types and developmental states. Understanding these complex systems and their potential to drive cancer is critical for the development of immunotherapies and druggable targets. Pioneering single-cell multi-omics technologies that analyze transcriptional states have been coupled with the expression of cell-surface receptors. This chapter describes SPaRTAN (Single-cell Proteomic and RNA-based Transcription factor Activity Network), a computational framework, to link transcription factors with cell-surface protein expression. SPaRTAN uses CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) data and cis-regulatory sites to model the effect of interactions between transcription factors and cell-surface receptors on gene expression. We demonstrate the pipeline for SPaRTAN using CITE-seq data from peripheral blood mononuclear cells.
Assuntos
Proteoma , Transcriptoma , Humanos , Fatores de Transcrição/genética , Leucócitos Mononucleares , Proteômica , Análise de Célula ÚnicaRESUMO
Machine learning (ML) models require an extensive, user-driven selection of molecular descriptors in order to learn from chemical structures to predict actives and inactives with a high reliability. In addition, privacy concerns often restrict the access to sufficient data, leading to models with a narrow chemical space. Therefore, we propose a framework of re-trainable models that can be transferred from one local instance to another, and further allow a less extensive descriptor selection. The models are shared via a Jupyter Notebook, allowing the evaluation and implementation of a broader chemical space by keeping most of the tunable parameters pre-defined. This enables the models to be updated in a decentralized, facile, and fast manner. Herein, the method was evaluated with six transporter datasets (BCRP, BSEP, OATP1B1, OATP1B3, MRP3, P-gp), which revealed the general applicability of this approach.
RESUMO
A programming workshop has been developed for biochemists and molecular biologists to introduce them to the power and flexibility of solving problems with Python. The workshop is designed to move users beyond a "plug-and-play" approach that is based on spreadsheets and web applications in their teaching and research to writing scripts to parse large collections of data and to perform dynamic calculations. The live-coding workshop is designed to introduce specific coding skills, as well as provide insight into the broader array of open-access resources and libraries that are available for scientific computation.
Assuntos
Biologia Molecular , SoftwareRESUMO
Bragg coherent X-ray diffraction is a nondestructive method for probing material structure in three dimensions at the nanoscale, with unprecedented resolution in displacement and strain fields. This work presents Gwaihir, a user-friendly and open-source tool to process and analyze Bragg coherent X-ray diffraction data. It integrates the functionalities of the existing packages bcdi and PyNX in the same toolbox, creating a natural workflow and promoting data reproducibility. Its graphical interface, based on Jupyter Notebook widgets, combines an interactive approach for data analysis with a powerful environment designed to link large-scale facilities and scientists.
RESUMO
Objective: Diabetes is a chronic fatal disease that has affected millions of people all over the globe. Type 2 Diabetes Mellitus (T2DM) accounts for 90% of the affected population among all types of diabetes. Millions of T2DM patients remain undiagnosed due to lack of awareness and under resourced healthcare system. So, there is a dire need for a diagnostic and prognostic tool that shall help the healthcare providers, clinicians and practitioners with early prediction and hence can recommend the lifestyle changes required to stop the progression of diabetes. The main objective of this research is to develop a framework based on machine learning techniques using only lifestyle indicators for prediction of T2DM disease. Moreover, prediction model can be used without visiting clinical labs and hospital readmissions. Method: A proposed framework is presented and implemented based on machine learning paradigms using lifestyle indicators for better prediction of T2DM disease. The current research has involved different experts like Diabetologists, Endocrinologists, Dieticians, Nutritionists, etc. for selecting the contributing 1552 instances and 11 attributes lifestyle biological features to promote health and manage complications towards T2DM disease. The dataset has been collected through survey and google forms from different geographical regions. Results: Seven machine learning classifiers were employed namely K-Nearest Neighbour (KNN), Linear Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). Gradient Boosting classifier outperformed best with an accuracy rate of 97.24% for training and 96.90% for testing separately followed by RF, DT, NB, SVM, LR, and KNN as 95.36%, 92.52%, 90.72%, 90.20%, 90.20% and 77.06% respectively. However, in terms of precision, RF achieved high performance (0.980%) and KNN performed the lowest (0.793%). As far as recall is being concerned, GB achieved the highest rate of 0.975% and KNN showed the worst rate of 0.774%. Also, GB is top performed in terms of f1-score. According to the ROCs, GB and NB had a better area under the curve compared to the others. Conclusion: The research developed a realistic health management system for T2DM disease based on machine learning techniques using only lifestyle data for prediction of T2DM. To extend the current study, these models shall be used for different, large and real-time datasets which share the commonality of data with T2DM disease to establish the efficacy of the proposed system.
RESUMO
Metabolic engineering relies on modifying gene expression to regulate protein concentrations and reaction activities. The gene expression is controlled by the promoter sequence, and sequence libraries are used to scan expression activities and to identify correlations between sequence and activity. We introduce a computational workflow called Exp2Ipynb to analyze promoter libraries maximizing information retrieval and promoter design with desired activity. We applied Exp2Ipynb to seven prokaryotic expression libraries to identify optimal experimental design principles. The workflow is open source, available as Jupyter Notebooks and covers the steps to 1) generate a statistical overview to sequence and activity, 2) train machine-learning algorithms, such as random forest, gradient boosting trees and support vector machines, for prediction and extraction of feature importance, 3) evaluate the performance of the estimator, and 4) to design new sequences with a desired activity using numerical optimization. The workflow can perform regression or classification on multiple promoter libraries, across species or reporter proteins. The most accurate predictions in the sample libraries were achieved when the promoters in the library were recognized by a single sigma factor and a unique reporter system. The prediction confidence mostly depends on sample size and sequence diversity, and we present a relationship to estimate their respective effects. The workflow can be adapted to process sequence libraries from other expression-related problems and increase insight to the growing application of high-throughput experiments, providing support for efficient strain engineering.
RESUMO
"Omics" technologies (genomics, transcriptomics, proteomics, metabolomics, etc.) have significantly improved our understanding of biological systems. They have become standard tools in biological research, for example, identifying and unraveling transcriptional networks, building predictive models, and discovering candidate biomarkers. The rapid increase of omics data presents both a challenge and great potential when it comes to providing valuable insights into the underlying patterns of the investigated biological processes. The challenge is extracting, processing, integrating, and interpreting the corresponding datasets. The potential, on the other hand, arises from generation of verifiable hypotheses to understand molecular mechanisms behind biological processes, for example, gene expression patterns. Exploratory data analysis techniques are used to get a first impression of the important characteristics of a dataset and to reveal its underlying structure. However, investigators are often faced with the difficulties of managing the high-dimensional nature of the data. In order to efficiently analyze biological data and to gain a deeper understanding of underlying biological mechanisms, it is essential to have robust and interactive data visualization tools.
Assuntos
Genômica , Metabolômica , Visualização de Dados , Redes Reguladoras de Genes , ProteômicaRESUMO
Many software solutions are available for proteomics and glycomics studies, but none are ideal for the structural analysis of peptidoglycan (PG), the essential and major component of bacterial cell envelopes. It icomprises glycan chains and peptide stems, both containing unusual amino acids and sugars. This has forced the field to rely on manual analysis approaches, which are time-consuming, labour-intensive, and prone to error. The lack of automated tools has hampered the ability to perform high-throughput analyses and prevented the adoption of a standard methodology. Here, we describe a novel tool called PGFinder for the analysis of PG structure and demonstrate that it represents a powerful tool to quantify PG fragments and discover novel structural features. Our analysis workflow, which relies on open-access tools, is a breakthrough towards a consistent and reproducible analysis of bacterial PGs. It represents a significant advance towards peptidoglycomics as a full-fledged discipline.
Assuntos
Bactérias/química , Peptidoglicano/química , Configuração de Carboidratos , Conjuntos de Dados como Assunto , Glicômica , Espectrometria de Massas/métodos , Peptidoglicano/biossíntese , Reprodutibilidade dos Testes , SoftwareRESUMO
Data sharing and reuse are crucial to enhance scientific progress and maximize return of investments in science. Although attitudes are increasingly favorable, data reuse remains difficult due to lack of infrastructures, standards, and policies. The FAIR (findable, accessible, interoperable, reusable) principles aim to provide recommendations to increase data reuse. Because of the broad interpretation of the FAIR principles, maturity indicators are necessary to determine the FAIRness of a dataset. In this work, we propose a reproducible computational workflow to assess data FAIRness in the life sciences. Our implementation follows principles and guidelines recommended by the maturity indicator authoring group and integrates concepts from the literature. In addition, we propose a FAIR balloon plot to summarize and compare dataset FAIRness. We evaluated the feasibility of our method on three real use cases where researchers looked for six datasets to answer their scientific questions. We retrieved information from repositories (ArrayExpress, Gene Expression Omnibus, eNanoMapper, caNanoLab, NanoCommons and ChEMBL), a registry of repositories, and a searchable resource (Google Dataset Search) via application program interfaces (API) wherever possible. With our analysis, we found that the six datasets met the majority of the criteria defined by the maturity indicators, and we showed areas where improvements can easily be reached. We suggest that use of standard schema for metadata and the presence of specific attributes in registries of repositories could increase FAIRness of datasets.
RESUMO
Computational analysis of biological data is becoming increasingly important, especially in this era of big data. Computational analysis of biological data allows efficiently deriving biological insights for given data, and sometimes even counterintuitive ones that may challenge the existing knowledge. Among experimental researchers without any prior exposure to computer programming, computational analysis of biological data has often been considered to be a task reserved for computational biologists. However, thanks to the increasing availability of user-friendly computational resources, experimental researchers can now easily access computational resources, including a scientific computing environment and packages necessary for data analysis. In this regard, we here describe the process of accessing Jupyter Notebook, the most popular Python coding environment, to conduct computational biology. Python is currently a mainstream programming language for biology and biotechnology. In particular, Anaconda and Google Colaboratory are introduced as two representative options to easily launch Jupyter Notebook. Finally, a Python package COBRApy is demonstrated as an example to simulate 1) specific growth rate of Escherichia coli as well as compounds consumed or generated under a minimal medium with glucose as a sole carbon source, and 2) theoretical production yield of succinic acid, an industrially important chemical, using E. coli. This protocol should serve as a guide for further extended computational analyses of biological data for experimental researchers without computational background.
Assuntos
Metabolômica , Modelos Biológicos , Linguagens de Programação , Escherichia coli/metabolismoRESUMO
BioJupies is a web application that enables the automated creation, storage, and deployment of Jupyter Notebooks containing RNA-seq data analyses. Through an intuitive interface, novice users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, gene expression tables, or fetch data from >9,000 published studies containing >300,000 preprocessed RNA-seq samples. Generated notebooks have the executable code of the entire pipeline, rich narrative text, interactive data visualizations, differential expression, and enrichment analyses. The notebooks are permanently stored in the cloud and made available online through a persistent URL. The notebooks are downloadable, customizable, and can run within a Docker container. By providing an intuitive user interface for notebook generation for RNA-seq data analysis, starting from the raw reads all the way to a complete interactive and reproducible report, BioJupies is a useful resource for experimental and computational biologists. BioJupies is freely available as a web-based application from http://biojupies.cloud.
Assuntos
Biologia Computacional/métodos , Análise de Sequência de RNA/métodos , Software , Animais , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , CamundongosRESUMO
Single-cell RNA sequencing (scRNA-seq) has emerged as a popular method to profile gene expression at the resolution of individual cells. While there have been methods and software specifically developed to analyze scRNA-seq data, they are most accessible to users who program. We have created a scRNA-seq clustering analysis GenePattern Notebook that provides an interactive, easy-to-use interface for data analysis and exploration of scRNA-Seq data, without the need to write or view any code. The notebook provides a standard scRNA-seq analysis workflow for pre-processing data, identification of sub-populations of cells by clustering, and exploration of biomarkers to characterize heterogeneous cell populations and delineate cell types.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Software , Transcriptoma , Análise por Conglomerados , HumanosRESUMO
Illumina Infinium DNA methylation arrays are a cost-effective technology to measure DNA methylation at CpG sites genome-wide and across cohorts of normal and cancer samples. While copy number alterations are commonly inferred from array-CGH, SNP arrays, or whole-genome DNA sequencing, Illumina Infinium DNA methylation arrays have been shown to detect copy number alterations at comparable sensitivity. Here we present an accessible, interactive GenePattern notebook for the analysis of copy number variation using Illumina Infinium DNA methylation arrays. The notebook provides a graphical user interface to a workflow using the R/Bioconductor packages minfi and conumee. The environment allows analysis to be performed without the installation of the R software environment, the packages and dependencies, and without the need to write or manipulate code.
Assuntos
Variações do Número de Cópias de DNA , Metilação de DNA , Software , Ilhas de CpG , Humanos , Neoplasias/genética , Análise de Sequência com Séries de OligonucleotídeosRESUMO
Recent advances in big data analytics provide more flexible, efficient, and open tools for researchers to gain insight from healthcare data. Whilst many tools require researchers to develop programs with programming languages like Python, R and so on, which is not a skill set grasped by many researchers in the healthcare data analytics area. To make data science more approachable, we explored existing tools and developed a practice that can help data scientists convert existing analytics pipelines to user-friendly analytics APPs with rich interactions and features of real-time analysis. With this practice, data scientists can develop customized analytics pipelines as APPs in Jupyter Notebook and disseminate them to other researchers easily, and researchers can benefit from the shared notebook to perform analysis tasks or reproduce research results much more easily.