Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 51
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(Supplement_1)2024 Jul 23.
Artículo en Inglés | MEDLINE | ID: mdl-39041916

RESUMEN

This manuscript describes the development of a resource module that is part of a learning platform named 'NIGMS Sandbox for Cloud-based Learning' (https://github.com/NIGMS/NIGMS-Sandbox). The module delivers learning materials on Cloud-based Consensus Pathway Analysis in an interactive format that uses appropriate cloud resources for data access and analyses. Pathway analysis is important because it allows us to gain insights into biological mechanisms underlying conditions. But the availability of many pathway analysis methods, the requirement of coding skills, and the focus of current tools on only a few species all make it very difficult for biomedical researchers to self-learn and perform pathway analysis efficiently. Furthermore, there is a lack of tools that allow researchers to compare analysis results obtained from different experiments and different analysis methods to find consensus results. To address these challenges, we have designed a cloud-based, self-learning module that provides consensus results among established, state-of-the-art pathway analysis techniques to provide students and researchers with necessary training and example materials. The training module consists of five Jupyter Notebooks that provide complete tutorials for the following tasks: (i) process expression data, (ii) perform differential analysis, visualize and compare the results obtained from four differential analysis methods (limma, t-test, edgeR, DESeq2), (iii) process three pathway databases (GO, KEGG and Reactome), (iv) perform pathway analysis using eight methods (ORA, CAMERA, KS test, Wilcoxon test, FGSEA, GSA, SAFE and PADOG) and (v) combine results of multiple analyses. We also provide examples, source code, explanations and instructional videos for trainees to complete each Jupyter Notebook. The module supports the analysis for many model (e.g. human, mouse, fruit fly, zebra fish) and non-model species. The module is publicly available at https://github.com/NIGMS/Consensus-Pathway-Analysis-in-the-Cloud. This manuscript describes the development of a resource module that is part of a learning platform named ``NIGMS Sandbox for Cloud-based Learning'' https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox [1] at the beginning of this Supplement. This module delivers learning materials on the analysis of bulk and single-cell ATAC-seq data in an interactive format that uses appropriate cloud resources for data access and analyses.


Asunto(s)
Nube Computacional , Programas Informáticos , Humanos , Biología Computacional/métodos , Biología Computacional/educación , Animales , Ontología de Genes
2.
J Synchrotron Radiat ; 31(Pt 4): 979-986, 2024 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-38920267

RESUMEN

The management and processing of synchrotron and neutron computed tomography data can be a complex, labor-intensive and unstructured process. Users devote substantial time to both manually processing their data (i.e. organizing data/metadata, applying image filters etc.) and waiting for the computation of iterative alignment and reconstruction algorithms to finish. In this work, we present a solution to these problems: TomoPyUI, a user interface for the well known tomography data processing package TomoPy. This highly visual Python software package guides the user through the tomography processing pipeline from data import, preprocessing, alignment and finally to 3D volume reconstruction. The TomoPyUI systematic intermediate data and metadata storage system improves organization, and the inspection and manipulation tools (built within the application) help to avoid interrupted workflows. Notably, TomoPyUI operates entirely within a Jupyter environment. Herein, we provide a summary of these key features of TomoPyUI, along with an overview of the tomography processing pipeline, a discussion of the landscape of existing tomography processing software and the purpose of TomoPyUI, and a demonstration of its capabilities for real tomography data collected at SSRL beamline 6-2c.

3.
Empir Softw Eng ; 28(1): 7, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36420321

RESUMEN

Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.

4.
Empir Softw Eng ; 28(3): 58, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36968214

RESUMEN

Data science is an exploratory and iterative process that often leads to complex and unstructured code. This code is usually poorly documented and, consequently, hard to understand by a third party. In this paper, we first collect empirical evidence for the non-linearity of data science code from real-world Jupyter notebooks, confirming the need for new approaches that aid in data science code interaction and comprehension. Second, we propose a visualisation method that elucidates implicit workflow information in data science code and assists data scientists in navigating the so-called garden of forking paths in non-linear code. The visualisation also provides information such as the rationale and the identification of the data science pipeline step based on cell annotations. We conducted a user experiment with data scientists to evaluate the proposed method, assessing the influence of (i) different workflow visualisations and (ii) cell annotations on code comprehension. Our results show that visualising the exploration helps the users obtain an overview of the notebook, significantly improving code comprehension. Furthermore, our qualitative analysis provides more insights into the difficulties faced during data science code comprehension.

5.
Am J Drug Alcohol Abuse ; 48(4): 413-421, 2022 07 04.
Artículo en Inglés | MEDLINE | ID: mdl-35196194

RESUMEN

Background: Substance use disorder (SUD) is a heterogeneous disorder. Adapting machine learning algorithms to allow for the parsing of intrapersonal and interpersonal heterogeneity in meaningful ways may accelerate the discovery and implementation of clinically actionable interventions in SUD research.Objectives: Inspired by a study of heavy drinkers that collected daily drinking and substance use (ABQ DrinQ), we develop tools to estimate subject-specific risk trajectories of heavy drinking; estimate and perform inference on patient characteristics and time-varying covariates; and present results in easy-to-use Jupyter notebooks. Methods: We recast support vector machines (SVMs) into a Bayesian model extended to handle mixed effects. We then apply these methods to ABQ DrinQ to model alcohol use patterns. ABQ DrinQ consists of 190 heavy drinkers (44% female) with 109,580 daily observations. Results: We identified male gender (point estimate; 95% credible interval: -0.25;-0.29,-0.21), older age (-0.03;-0.03,-0.03), and time varying usage of nicotine (1.68;1.62,1.73), cannabis (0.05;0.03,0.07), and other drugs (1.16;1.01,1.35) as statistically significant factors of heavy drinking behavior. By adopting random effects to capture the subject-specific longitudinal trajectories, the algorithm outperforms traditional SVM (classifies 84% of heavy drinking days correctly versus 73%). Conclusions: We developed a mixed effects variant of SVM and compare it to the traditional formulation, with an eye toward elucidating the importance of incorporating random effects to account for underlying heterogeneity in SUD data. These tools and examples are packaged into a repository for researchers to explore. Understanding patterns and risk of substance use could be used for developing individualized interventions.


Asunto(s)
Trastornos Relacionados con Sustancias , Máquina de Vectores de Soporte , Teorema de Bayes , Femenino , Humanos , Masculino , Trastornos Relacionados con Sustancias/epidemiología
6.
Int J Mol Sci ; 22(3)2021 Jan 30.
Artículo en Inglés | MEDLINE | ID: mdl-33573289

RESUMEN

The growing attention toward the benefits of single-cell RNA sequencing (scRNA-seq) is leading to a myriad of computational packages for the analysis of different aspects of scRNA-seq data. For researchers without advanced programing skills, it is very challenging to combine several packages in order to perform the desired analysis in a simple and reproducible way. Here we present DIscBIO, an open-source, multi-algorithmic pipeline for easy, efficient and reproducible analysis of cellular sub-populations at the transcriptomic level. The pipeline integrates multiple scRNA-seq packages and allows biomarker discovery with decision trees and gene enrichment analysis in a network context using single-cell sequencing read counts through clustering and differential analysis. DIscBIO is freely available as an R package. It can be run either in command-line mode or through a user-friendly computational pipeline using Jupyter notebooks. We showcase all pipeline features using two scRNA-seq datasets. The first dataset consists of circulating tumor cells from patients with breast cancer. The second one is a cell cycle regulation dataset in myxoid liposarcoma. All analyses are available as notebooks that integrate in a sequential narrative R code with explanatory text and output data and images. R users can use the notebooks to understand the different steps of the pipeline and will guide them to explore their scRNA-seq data. We also provide a cloud version using Binder that allows the execution of the pipeline without the need of downloading R, Jupyter or any of the packages used by the pipeline. The cloud version can serve as a tutorial for training purposes, especially for those that are not R users or have limited programing skills. However, in order to do meaningful scRNA-seq analyses, all users will need to understand the implemented methods and their possible options and limitations.


Asunto(s)
Biomarcadores/análisis , Biología Computacional/métodos , RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Animales , Neoplasias de la Mama/sangre , Neoplasias de la Mama/diagnóstico , Neoplasias de la Mama/genética , Ciclo Celular/genética , Conjuntos de Datos como Asunto , Femenino , Redes Reguladoras de Genes , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Liposarcoma Mixoide/diagnóstico , Liposarcoma Mixoide/genética , Ratones , Células Neoplásicas Circulantes/patología , Programas Informáticos , Pez Cebra
7.
J Undergrad Neurosci Educ ; 20(1): A100-A110, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-35540944

RESUMEN

We designed a final semester research project that allowed students to apply the electrophysiological concepts they learned in a lab course to propose and answer experimental questions without access to laboratory equipment. We created the activity based on lesson plans from Ashley Juavinett and the Allen Institute for Brain Science (AIBS) Allen SDK online examples. An interactive graphic interface was added for students to explore and easily quantify subtle neuronal voltage changes. Before starting the final project, students had experience with conventional extracellular and intracellular recording techniques to record and analyze extracellular action potential firing patterns and intracellular resting, action, and synaptic potentials. They demonstrated their understanding of neural signal transmission in required lab reports using data they gathered before the pandemic shutdown. After students left campus, they continued to analyze data and write lab reports focused on neuronal excitability in snail and fly neurons with data supplied by the instructors. For their final project, students were challenged to answer questions addressing neuronal excitability at both the single neuron and neuronal population level by analyzing and interpreting the open-access, patch clamp recording data from the Allen Cell Types Database using code we provided (Python/Jupyter Notebook). This virtual final semester project allowed students to ask real-world medical and scientific questions from "start to end". Through this project, students developed skills to navigate an extensive online database and gained experience with coding-based data analysis. They chose neuronal populations from human and mouse brains to compare passive properties and neuronal excitability between and within brain areas and across different species and disease states. Additionally, students learned to do simple manipulations of Python code, work remotely in teams, and polish their written scientific presentation skills. This activity could complement other remote learning options such as neuronal simulations. Few online sources offer such a wealth of neuroscience data that students can use for class assignments, and even for research and keystone projects. The activity extends the traditional material often taught in upper-level neuroscience courses, with or without a laboratory section, providing a deeper understanding of the range of excitability properties that neurons express.

8.
Empir Softw Eng ; 26(4): 65, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33994841

RESUMEN

Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and other rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way in which notebooks are being used leads to unexpected behavior, encourages poor coding practices, and makes it hard to reproduce its results. To better understand good and bad practices used in the development of real notebooks, in prior work we studied 1.4 million notebooks from GitHub. We presented a detailed analysis of their characteristics that impact reproducibility, proposed best practices that can improve the reproducibility, and discussed open challenges that require further research and development. In this paper, we extended the analysis in four different ways to validate the hypothesis uncovered in our original study. First, we separated a group of popular notebooks to check whether notebooks that get more attention have more quality and reproducibility capabilities. Second, we sampled notebooks from the full dataset for an in-depth qualitative analysis of what constitutes the dataset and which features they have. Third, we conducted a more detailed analysis by isolating library dependencies and testing different execution orders. We report how these factors impact the reproducibility rates. Finally, we mined association rules from the notebooks. We discuss patterns we discovered, which provide additional insights into notebook reproducibility. Based on our findings and best practices we proposed, we designed Julynter, a Jupyter Lab extension that identifies potential issues in notebooks and suggests modifications that improve their reproducibility. We evaluate Julynter with a remote user experiment with the goal of assessing Julynter recommendations and usability.

9.
Metabolomics ; 16(2): 17, 2020 01 21.
Artículo en Inglés | MEDLINE | ID: mdl-31965332

RESUMEN

INTRODUCTION: Metabolomics data is commonly modelled multivariately using partial least squares discriminant analysis (PLS-DA). Its success is primarily due to ease of interpretation, through projection to latent structures, and transparent assessment of feature importance using regression coefficients and Variable Importance in Projection scores. In recent years several non-linear machine learning (ML) methods have grown in popularity but with limited uptake essentially due to convoluted optimisation and interpretation. Artificial neural networks (ANNs) are a non-linear projection-based ML method that share a structural equivalence with PLS, and as such should be amenable to equivalent optimisation and interpretation methods. OBJECTIVES: We hypothesise that standardised optimisation, visualisation, evaluation and statistical inference techniques commonly used by metabolomics researchers for PLS-DA can be migrated to a non-linear, single hidden layer, ANN. METHODS: We compared a standardised optimisation, visualisation, evaluation and statistical inference techniques workflow for PLS with the proposed ANN workflow. Both workflows were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks on GitHub. RESULTS: The migration of the PLS workflow to a non-linear, single hidden layer, ANN was successful. There was a similarity in significant metabolites determined using PLS model coefficients and ANN Connection Weight Approach. CONCLUSION: We have shown that it is possible to migrate the standardised PLS-DA workflow to simple non-linear ANNs. This result opens the door for more widespread use and to the investigation of transparent interpretation of more complex ANN architectures.


Asunto(s)
Análisis Discriminante , Análisis de los Mínimos Cuadrados , Metabolómica , Redes Neurales de la Computación , Programas Informáticos
10.
J Undergrad Neurosci Educ ; 19(1): A94-A104, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33880096

RESUMEN

Conducting neuroscience research increasingly requires proficiency in coding and the ability to manipulate and analyze large datasets. However, these skills are often not included in typical neurobiology courses, partially because they are seen as accessory rather than central, and partially because of the barriers to entry. Therefore, this lesson plan aims to provide an introduction to coding in Python, a free and user-friendly programming language, for instructors and students alike. In this lesson, students edit Python code in the Jupyter Notebook coding environment to interact with cutting-edge electrophysiology data from the Allen Institute for Brain Science. Students can run their own experiments with these data to compare cell types in mice and humans. Along the way, they gain exposure to Python coding and the role of coding in the field of neuroscience.

11.
BMC Bioinformatics ; 20(1): 234, 2019 May 09.
Artículo en Inglés | MEDLINE | ID: mdl-31072312

RESUMEN

BACKGROUND: The Oxford Nanopore Technologies (ONT) MinION portable sequencer makes it possible to use cutting-edge genomic technologies in the field and the academic classroom. RESULTS: We present NanoDJ, a Jupyter notebook integration of tools for simplified manipulation and assembly of DNA sequences produced by ONT devices. It integrates basecalling, read trimming and quality control, simulation and plotting routines with a variety of widely used aligners and assemblers, including procedures for hybrid assembly. CONCLUSIONS: With the use of Jupyter-facilitated access to self-explanatory contents of applications and the interactive visualization of results, as well as by its distribution into a Docker software container, NanoDJ is aimed to simplify and make more reproducible ONT DNA sequence analysis. The NanoDJ package code, documentation and installation instructions are freely available at https://github.com/genomicsITER/NanoDJ .


Asunto(s)
Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Nanoporos , Análisis de Secuencia de ADN/métodos
12.
J Proteome Res ; 18(4): 1751-1759, 2019 04 05.
Artículo en Inglés | MEDLINE | ID: mdl-30855969

RESUMEN

Reproducibility has become a major concern in biomedical research. In proteomics, bioinformatic workflows can quickly consist of multiple software tools each with its own set of parameters. Their usage involves the definition of often hundreds of parameters as well as data operations to ensure tool interoperability. Hence, a manuscript's methods section is often insufficient to completely describe and reproduce a data analysis workflow. Here we present IsoProt: A complete and reproducible bioinformatic workflow deployed on a portable container environment to analyze data from isobarically labeled, quantitative proteomics experiments. The workflow uses only open source tools and provides a user-friendly and interactive browser interface to configure and execute the different operations. Once the workflow is executed, the results including the R code to perform statistical analyses can be downloaded as an HTML document providing a complete record of the performed analyses. IsoProt therefore represents a reproducible bioinformatics workflow that will yield identical results on any computer platform.


Asunto(s)
Marcaje Isotópico , Proteoma/análisis , Proteómica/métodos , Programas Informáticos , Espectrometría de Masas en Tándem , Animales , Bases de Datos Factuales , Malaria Cerebral/metabolismo , Ratones , Proteoma/química , Proteoma/metabolismo , Reproducibilidad de los Resultados
13.
Metabolomics ; 15(10): 125, 2019 09 14.
Artículo en Inglés | MEDLINE | ID: mdl-31522294

RESUMEN

BACKGROUND: A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike. AIM OF REVIEW: To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science. KEY SCIENTIFIC CONCEPTS OF REVIEW: This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.


Asunto(s)
Nube Computacional , Ciencia de los Datos , Metabolómica , Programas Informáticos , Animales , Humanos
14.
Metabolomics ; 15(12): 150, 2019 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-31728648

RESUMEN

INTRODUCTION: Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models. OBJECTIVES: We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis. METHODS: We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks. RESULTS: There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice. CONCLUSION: The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.


Asunto(s)
Análisis Discriminante , Metabolómica/clasificación , Metabolómica/métodos , Algoritmos , Humanos , Análisis de los Mínimos Cuadrados , Aprendizaje Automático , Redes Neurales de la Computación , Pronóstico , Máquina de Vectores de Soporte
15.
J Neurophysiol ; 116(2): 252-62, 2016 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-27098025

RESUMEN

Neurophysiology requires an extensive workflow of information analysis routines, which often includes incompatible proprietary software, introducing limitations based on financial costs, transfer of data between platforms, and the ability to share. An ecosystem of free open-source software exists to fill these gaps, including thousands of analysis and plotting packages written in Python and R, which can be implemented in a sharable and reproducible format, such as the Jupyter electronic notebook. This tool chain can largely replace current routines by importing data, producing analyses, and generating publication-quality graphics. An electronic notebook like Jupyter allows these analyses, along with documentation of procedures, to display locally or remotely in an internet browser, which can be saved as an HTML, PDF, or other file format for sharing with team members and the scientific community. The present report illustrates these methods using data from electrophysiological recordings of the musk shrew vagus-a model system to investigate gut-brain communication, for example, in cancer chemotherapy-induced emesis. We show methods for spike sorting (including statistical validation), spike train analysis, and analysis of compound action potentials in notebooks. Raw data and code are available from notebooks in data supplements or from an executable online version, which replicates all analyses without installing software-an implementation of reproducible research. This demonstrates the promise of combining disparate analyses into one platform, along with the ease of sharing this work. In an age of diverse, high-throughput computational workflows, this methodology can increase efficiency, transparency, and the collaborative potential of neurophysiological research.


Asunto(s)
Vías Aferentes/fisiología , Encéfalo/fisiología , Difusión de la Información/métodos , Neurofisiología , Programas Informáticos , Estómago/inervación , Animales , Presión Sanguínea/fisiología , Conducta Cooperativa , Estimulación Eléctrica , Masculino , Musarañas , Nervio Vago/fisiología , Flujo de Trabajo
16.
Arch Biochem Biophys ; 589: 18-26, 2016 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-26365033

RESUMEN

Mass spectrometry imaging (MSI) is used in an increasing number of biological applications. Typical MSI datasets contain unique, high-resolution mass spectra from tens of thousands of spatial locations, resulting in raw data sizes of tens of gigabytes per sample. In this paper, we review technical progress that is enabling new biological applications and that is driving an increase in the complexity and size of MSI data. Handling such data often requires specialized computational infrastructure, software, and expertise. OpenMSI, our recently described platform, makes it easy to explore and share MSI datasets via the web - even when larger than 50 GB. Here we describe the integration of OpenMSI with IPython notebooks for transparent, sharable, and replicable MSI research. An advantage of this approach is that users do not have to share raw data along with analyses; instead, data is retrieved via OpenMSI's web API. The IPython notebook interface provides a low-barrier entry point for data manipulation that is accessible for scientists without extensive computational training. Via these notebooks, analyses can be easily shared without requiring any data movement. We provide example notebooks for several common MSI analysis types including data normalization, plotting, clustering, and classification, and image registration.


Asunto(s)
Difusión de la Información/métodos , Espectrometría de Masas , Imagen Molecular , Animales , Conducta Cooperativa , Humanos , Programas Informáticos
17.
Gigascience ; 132024 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-38206590

RESUMEN

BACKGROUND: Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. APPROACH: We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion. RESULTS: Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. CONCLUSIONS: We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.


Asunto(s)
Documentación , Registros , Reproducibilidad de los Resultados , Reproducción , Flujo de Trabajo
18.
Methods Mol Biol ; 2777: 231-256, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38478348

RESUMEN

Knowledge regarding cancer stem cell (CSC) morphology is limited, and more extensive studies are therefore required. Image recognition technologies using artificial intelligence (AI) require no previous expertise in image annotation. Herein, we describe the construction of AI models that recognize the CSC morphology in cultures and tumor tissues. The visualization of the AI deep learning process enables insight to be obtained regarding unrecognized structures in an image.


Asunto(s)
Aprendizaje Profundo , Neoplasias , Humanos , Inteligencia Artificial , Células Madre Neoplásicas , Tecnología
19.
Biochem Mol Biol Educ ; 52(2): 165-178, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-37937712

RESUMEN

Dimensionality reduction techniques are essential in analyzing large 'omics' datasets in biochemistry and molecular biology. Principal component analysis, t-distributed stochastic neighbor embedding, and uniform manifold approximation and projection are commonly used for data visualization. However, these methods can be challenging for students without a strong mathematical background. In this study, intuitive examples were created using COVID-19 data to help students understand the core concepts behind these techniques. In a 4-h practical session, we used these examples to demonstrate dimensionality reduction techniques to 15 postgraduate students from biomedical backgrounds. Using Python and Jupyter notebooks, our goal was to demystify these methods, typically treated as "black boxes", and empower students to generate and interpret their own results. To assess the impact of our approach, we conducted an anonymous survey. The majority of the students agreed that using computers enriched their learning experience (67%) and that Jupyter notebooks were a valuable part of the class (66%). Additionally, 60% of the students reported increased interest in Python, and 40% gained both interest and a better understanding of dimensionality reduction methods. Despite the short duration of the course, 40% of the students reported acquiring research skills necessary in the field. While further analysis of the learning impacts of this approach is needed, we believe that sharing the examples we generated can provide valuable resources for others to use in interactive teaching environments. These examples highlight advantages and limitations of the major dimensionality reduction methods used in modern bioinformatics analysis in an easy-to-understand way.


Asunto(s)
Disciplinas de las Ciencias Biológicas , Estudiantes , Humanos , Aprendizaje , Bioquímica , Motivación
20.
PeerJ Comput Sci ; 9: e1230, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37346615

RESUMEN

The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle-the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA