RESUMEN
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset2-5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
Asunto(s)
Análisis de Datos , Ciencia de los Datos/métodos , Ciencia de los Datos/normas , Conjuntos de Datos como Asunto , Neuroimagen Funcional , Imagen por Resonancia Magnética , Investigadores/organización & administración , Encéfalo/diagnóstico por imagen , Encéfalo/fisiología , Conjuntos de Datos como Asunto/estadística & datos numéricos , Femenino , Humanos , Modelos Logísticos , Masculino , Metaanálisis como Asunto , Modelos Neurológicos , Reproducibilidad de los Resultados , Investigadores/normas , Programas InformáticosRESUMEN
Neuroscience research has evolved to generate increasingly large and complex experimental data sets, and advanced data science tools are taking on central roles in neuroscience research. Neurodata Without Borders (NWB), a standard language for neurophysiology data, has recently emerged as a powerful solution for data management, analysis, and sharing. We here discuss our labs' efforts to implement NWB data science pipelines. We describe general principles and specific use cases that illustrate successes, challenges, and non-trivial decisions in software engineering. We hope that our experience can provide guidance for the neuroscience community and help bridge the gap between experimental neuroscience and data science. Key takeaways from this article are that (1) standardization with NWB requires non-trivial design choices; (2) the general practice of standardization in the lab promotes data awareness and literacy, and improves transparency, rigor, and reproducibility in our science; (3) we offer several feature suggestions to ease the extensibility, publishing/sharing, and usability for NWB standard and users of NWB data.
Asunto(s)
Neurociencias , Animales , Humanos , Ciencia de los Datos/métodos , Ciencia de los Datos/normas , Difusión de la Información/métodos , Neurociencias/normas , Neurociencias/métodos , Programas Informáticos/normasAsunto(s)
COVID-19/epidemiología , Comunicación , Ciencia de los Datos , Visualización de Datos , Pandemias , Salud Pública , Macrodatos , COVID-19/mortalidad , Cambio Climático/estadística & datos numéricos , Ciencia de los Datos/educación , Ciencia de los Datos/normas , Demografía/normas , Etnicidad/estadística & datos numéricos , Mapeo Geográfico , Política de Salud , Humanos , Maryland , Formulación de Políticas , Opinión Pública , Grupos Raciales/estadística & datos numéricos , UniversidadesRESUMEN
Data, including information generated from them by processing and analysis, are an asset with measurable value. The assets that biological research funding produces are the data generated, the information derived from these data, and, ultimately, the discoveries and knowledge these lead to. From the time when Henry Oldenburg published the first scientific journal in 1665 (Proceedings of the Royal Society) to the founding of the United States National Library of Medicine in 1879 to the present, there has been a sustained drive to improve how researchers can record and discover what is known. Researchers' experimental work builds upon years and (collectively) billions of dollars' worth of earlier work. Today, researchers are generating data at ever-faster rates because of advances in instrumentation and technology, coupled with decreases in production costs. Unfortunately, the ability of researchers to manage and disseminate their results has not kept pace, so their work cannot achieve its maximal impact. Strides have recently been made, but more awareness is needed of the essential role that biological data resources, including biocuration, play in maintaining and linking this ever-growing flood of data and information. The aim of this paper is to describe the nature of data as an asset, the role biocurators play in increasing its value, and consistent, practical means to measure effectiveness that can guide planning and justify costs in biological research information resources' development and management.
Asunto(s)
Ciencia de los Datos/métodos , Gestión de la Información/métodos , Agregación de Datos , Ciencia de los Datos/normas , Humanos , Difusión de la Información/métodos , Gestión de la Información/normas , ConocimientoRESUMEN
In this paper we argue that 'informed' consent in Big Data genomic biobanking is frequently less than optimally informative. This is due to the particular features of genomic biobanking research which render it ethically problematic. We discuss these features together with details of consent models aimed to address them. Using insights from consent theory, we provide a detailed analysis of the essential components of informed consent which includes recommendations to improve consent performance. In addition, and using insights from philosophy of mind and language and psycholinguistics we support our analyses by identifying the nature and function of concepts (ideas) operational in human cognition and language together with an implicit coding/decoding model of human communication. We identify this model as the source of patients/participants poor understanding. We suggest an alternative, explicit model of human communication, namely, that of relevance-theoretic inference which obviates the limitations of the code model. We suggest practical strategies to assist health service professionals to ensure that the specific information they provide concerning the proposed treatment or research is used to inform participants' decision to consent. We do not prescribe a standard, formal approach to decision-making where boxes are ticked; rather, we aim to focus attention towards the sorts of considerations and questions that might usefully be borne in mind in any consent situation. We hope that our theorising will be of real practical benefit to nurses and midwives working on the clinical and research front-line of genomic science.
Asunto(s)
Ciencia de los Datos/métodos , Genómica/ética , Consentimiento Informado/ética , Ciencia de los Datos/normas , Genómica/tendencias , Humanos , Consentimiento Informado/normas , Participación del Paciente/psicologíaRESUMEN
Increasingly, users of health and biomedical libraries need assistance with challenges they face in working with their own and others' data. Librarians have a unique opportunity to provide valuable support and assistance in data science and open science but may need to add to their expertise and skill set to have the most impact. This article describes the rationale for and development of the Medical Library Association Data Services Competency, which outlines a set of five key skills for data services and provides a course of study for gaining these skills.
Asunto(s)
Ciencia de los Datos/normas , Bibliotecas Médicas/normas , Asociaciones de Bibliotecas/normas , Servicios de Biblioteca/normas , Competencia Profesional/normas , Humanos , Alfabetización Informacional , Guías de Práctica Clínica como AsuntoRESUMEN
CONTEXT: National data on the epidemiology of firearm injuries and circumstances of firearm deaths are difficult to obtain and often are nonreliable. Since firearm injury and death rates and causes can vary substantially between states, it is critical to consider state-specific data sources. OBJECTIVE: In this study, we illustrate how states can systematically examine demographic characteristics, firearm information, type of wound, toxicology tests, precipitating circumstances, and costs to provide a comprehensive picture of firearm injuries and deaths using data sets from a single state with relatively low rates of firearm injury and death. DESIGN: Cross-sectional study. SETTING: Firearm-related injury data for the period 2005-2014 were obtained from the Rhode Island emergency department and hospital discharge data sets; death data for the same period were obtained from the Rhode Island Violent Death Reporting System. MAIN OUTCOME MEASURE: Descriptive statistics were used. Healthcare Cost and Utilization Project cost-to-charge ratios were used to convert total hospital charges to costs. RESULTS: Most firearm-related emergency department visits (55.8%) and hospital discharges (79.2%) in Rhode Island were from assaults; however, most firearm-related deaths were suicides (60.1%). The annual cost of firearm-related hospitalizations was more than $830 000. Most decedents who died because of firearms tested positive for illicit substances. Nearly a quarter (23.5%) of firearm-related homicides were due to a conflict between the decedent and suspect. More than half (59%) of firearm suicide decedents were reported to have had current mental or physical problems prior to death. CONCLUSIONS: Understanding the state-specific magnitude and patterns (who, where, factors, etc) of firearm injury and death may help inform local injury prevention efforts. States with similar data sets may want to adopt our analyses. Surveillance of firearm-related injury and death is essential. Dissemination of surveillance findings to key stakeholders is critical in improving firearm injury prevention. States that are not part of the National Violent Death Reporting System (NVDRS) could work with their other data sources to obtain a better picture of violent injuries and deaths to make the best use of resources.
Asunto(s)
Ciencia de los Datos/normas , Armas de Fuego/estadística & datos numéricos , Sistema de Registros/normas , Estudios Transversales , Ciencia de los Datos/métodos , Servicio de Urgencia en Hospital , Humanos , Alta del Paciente/estadística & datos numéricos , Vigilancia de la Población/métodos , Sistema de Registros/estadística & datos numéricos , Proyectos de Investigación , Rhode Island/epidemiología , Violencia/estadística & datos numéricosAsunto(s)
Ciencia de los Datos/organización & administración , Bases de Datos Factuales/normas , Calidad de la Atención de Salud/organización & administración , Terminología como Asunto , Ciencia de los Datos/normas , Europa (Continente) , Intercambio de Información en Salud/normas , Humanos , Unidades de Cuidados Intensivos , Mejoramiento de la Calidad/organización & administración , Calidad de la Atención de Salud/normasRESUMEN
BACKGROUND: The WHO announced the epidemic of SARS-CoV2 as a public health emergency of international concern on 30th January 2020. To date, it has spread to more than 200 countries and has been declared a global pandemic. For appropriate preparedness, containment, and mitigation response, the stakeholders and policymakers require prior guidance on the propagation of SARS-CoV2. METHODOLOGY: This study aims to provide such guidance by forecasting the cumulative COVID-19 cases up to 4 weeks ahead for 187 countries, using four data-driven methodologies; autoregressive integrated moving average (ARIMA), exponential smoothing model (ETS), and random walk forecasts (RWF) with and without drift. For these forecasts, we evaluate the accuracy and systematic errors using the Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE), respectively. FINDINGS: The results show that the ARIMA and ETS methods outperform the other two forecasting methods. Additionally, using these forecasts, we generate heat maps to provide a pictorial representation of the countries at risk of having an increase in the cases in the coming 4 weeks of February 2021. CONCLUSION: Due to limited data availability during the ongoing pandemic, less data-hungry short-term forecasting models, like ARIMA and ETS, can help in anticipating the future outbreaks of SARS-CoV2.
Asunto(s)
COVID-19/epidemiología , Ciencia de los Datos/métodos , Modelos Estadísticos , Ciencia de los Datos/normas , Humanos , Guías de Práctica Clínica como Asunto , Programas Informáticos/normasRESUMEN
AIM: To demonstrate a custom spine disease analytical evaluation module that can instantly, accurately, and effectively analyze the widely accepted outcome measurements. MATERIAL AND METHODS: The data input and analysis processes were evaluated on a timely basis to compare the traditional paper-based manual data entry method with the spine surgery database (SSD) platform data entry method. The data of 116 patients in the cervical degenerative patient group were analyzed using the upgraded version of the SSD. The subjects were analyzed with respect to the SF-36 Quality of Life Index, Nurick classification, and Japanese Orthopedic Association score. The manual and computerized patient information analyses were then compared. RESULTS: The developed analysis module enables the instantaneous access and analysis of patient data. More importantly, the paperless evaluation module improves the post-surgery patient status evaluation time by 45.64%. For 116 patients, a physician gains up to 401 min of evaluation time in each preoperative and follow-up period without being subjected to the human errors encountered in paper records. CONCLUSION: It is apparent that customized software solutions are absolutely necessary in patient registration and follow-up processes. The experimental results showed that using the proposed module, patient follow-up and progress analyses can be performed in a fast, effective, and accurate manner.
Asunto(s)
Vértebras Cervicales/cirugía , Ciencia de los Datos/normas , Bases de Datos Factuales/normas , Cuidados Posoperatorios/normas , Cuidados Preoperatorios/normas , Enfermedades de la Columna Vertebral/cirugía , Adulto , Anciano , Ciencia de los Datos/métodos , Femenino , Estudios de Seguimiento , Encuestas Epidemiológicas/métodos , Encuestas Epidemiológicas/normas , Humanos , Masculino , Persona de Mediana Edad , Evaluación de Resultado en la Atención de Salud/métodos , Evaluación de Resultado en la Atención de Salud/normas , Cuidados Posoperatorios/métodos , Cuidados Preoperatorios/métodos , Calidad de Vida , Enfermedades de la Columna Vertebral/diagnóstico , Resultado del Tratamiento , Adulto JovenRESUMEN
The value of training for a data sciences professional is in the eye of the beholder. And dependent on the scope and breadth of that training and the cost and time frame of that training. Value for the employee may differ from value for the employer. The lens is different and value may depend on what lens you look through. Training can be online or on-site, short term with specific focus or longer term with greater breadth and less depth. Career goals should also be considered when determining value. Certification in Spark is not valuable if you do not want to work with Spark. A PhD in management psychology is not as valuable if you do not want to manage people. The fact that training (both certification and degree programs) is valuable is not debatable. Maximizing that value for both employee and employer is always a preferable option. But is it realistic?
Asunto(s)
Ciencia de los Datos/educación , Certificación , Ciencia de los Datos/normas , Educación de Postgrado , HumanosRESUMEN
With the increasing importance of big data in biomedicine, skills in data science are a foundation for the individual career development and for the progress of science. This chapter is a practical guide to working with high-throughput biomedical data. It covers how to understand and set up the computing environment, to start a research project with proper and effective data management, and to perform common bioinformatics tasks such as data wrangling, quality control, statistical analysis, and visualization, with examples on metabolomics data. Concepts and tools related to coding and scripting are discussed. Version control, knitr and Jupyter notebooks are important to project management, collaboration, and research reproducibility. Overall, this chapter describes a core set of skills to work in bioinformatics, and can serve as a reference text at the level of a graduate course and interfacing with data science.
Asunto(s)
Biología Computacional/métodos , Ciencia de los Datos , Metabolómica , Programas Informáticos , Nube Computacional , Biología Computacional/normas , Manejo de Datos , Ciencia de los Datos/métodos , Ciencia de los Datos/normas , Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Humanos , Metabolómica/normas , Metabolómica/estadística & datos numéricosRESUMEN
Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.
Asunto(s)
Genómica/métodos , Secuenciación de Nanoporos/métodos , Secuenciación Completa del Genoma/métodos , Animales , Ciencia de los Datos/métodos , Ciencia de los Datos/normas , Genómica/normas , Humanos , Secuenciación de Nanoporos/normas , Secuenciación Completa del Genoma/normasRESUMEN
Recently, there has been burgeoning interest in developing more effective and robust clinical decision support systems (CDSSs) for oncology. This has been primarily driven by the demands for more personalized and precise medical practice in oncology in the era of so-called Big Data (BD); an era that promises to harness the power of large-scale data flow to revolutionize cancer treatment. This interest in BD analytics has created new opportunities as well as new unmet challenges. These include: routine aggregation and standardization of clinical data; patient privacy; transformation of current analytical approaches to handle such noisy and heterogeneous data; and expanded use of advanced statistical learning methods based on confluence of modern statistical methods and machine learning algorithms. In this review, we present the current status of CDSSs in oncology, the prospects and current challenges of BD analytics, and the promising role of integrated modern statistics and machine learning algorithms in predicting complex clinical endpoints, individualizing treatment rules, and optimizing dynamic personalized treatment regimens. We discuss issues pertaining to these topics and present application examples from an aggregate of experiences. We also discuss the role of human factors in improving the utilization and acceptance of such enhanced CDSSs and how to mitigate possible sources of human error to achieve optimal performance and wider acceptance.
Asunto(s)
Macrodatos , Ciencia de los Datos/normas , Sistemas de Apoyo a Decisiones Clínicas/normas , Oncología Médica/normas , Minería de Datos , Humanos , Aprendizaje Automático , Oncología Médica/tendencias , Evaluación de Procesos y Resultados en Atención de Salud , Medicina de Precisión , Interfaz Usuario-ComputadorRESUMEN
Artificial intelligence (AI), especially deep learning, has the potential to fundamentally alter clinical radiology. AI algorithms, which excel in quantifying complex patterns in data, have shown remarkable progress in applications ranging from self-driving cars to speech recognition. The AI application within radiology, known as radiomics, can provide detailed quantifications of the radiographic characteristics of underlying tissues. This information can be used throughout the clinical care path to improve diagnosis and treatment planning, as well as assess treatment response. This tremendous potential for clinical translation has led to a vast increase in the number of research studies being conducted in the field, a number that is expected to rise sharply in the future. Many studies have reported robust and meaningful findings; however, a growing number also suffer from flawed experimental or analytic designs. Such errors could not only result in invalid discoveries, but also may lead others to perpetuate similar flaws in their own work. This perspective article aims to increase awareness of the issue, identify potential reasons why this is happening, and provide a path forward. Clin Cancer Res; 24(3); 532-4. ©2017 AACR.