RESUMO
Precision medicine promises improved health by accounting for individual variability in genes, environment, and lifestyle. Precision medicine will continue to transform healthcare in the coming decade as it expands in key areas: huge cohorts, artificial intelligence (AI), routine clinical genomics, phenomics and environment, and returning value across diverse populations.
Assuntos
Atenção à Saúde , Medicina de Precisão , Inteligência Artificial , Big Data , Pesquisa Biomédica , Diversidade Cultural , Registros Eletrônicos de Saúde , Humanos , FenômicaRESUMO
This commentary introduces a new clinical trial construct, the Master Observational Trial (MOT), which hybridizes the power of molecularly based master interventional protocols with the breadth of real-world data. The MOT provides a clinical venue to allow molecular medicine to rapidly advance, answers questions that traditional interventional trials generally do not address, and seamlessly integrates with interventional trials in both diagnostic and therapeutic arenas. The result is a more comprehensive data collection ecosystem in precision medicine.
Assuntos
Estudos Observacionais como Assunto/métodos , Medicina de Precisão/métodos , Projetos de Pesquisa/normas , Big Data , Protocolos de Ensaio Clínico como Assunto , Humanos , Terapia de Alvo Molecular/métodos , Terapia de Alvo Molecular/tendências , Estudos Observacionais como Assunto/normasRESUMO
The body-wide human microbiome plays a role in health, but its full diversity remains uncharacterized, particularly outside of the gut and in international populations. We leveraged 9,428 metagenomes to reconstruct 154,723 microbial genomes (45% of high quality) spanning body sites, ages, countries, and lifestyles. We recapitulated 4,930 species-level genome bins (SGBs), 77% without genomes in public repositories (unknown SGBs [uSGBs]). uSGBs are prevalent (in 93% of well-assembled samples), expand underrepresented phyla, and are enriched in non-Westernized populations (40% of the total SGBs). We annotated 2.85 M genes in SGBs, many associated with conditions including infant development (94,000) or Westernization (106,000). SGBs and uSGBs permit deeper microbiome analyses and increase the average mappability of metagenomic reads from 67.76% to 87.51% in the gut (median 94.26%) and 65.14% to 82.34% in the mouth. We thus identify thousands of microbial genomes from yet-to-be-named species, expand the pangenomes of human-associated microbes, and allow better exploitation of metagenomic technologies.
Assuntos
Metagenoma/genética , Metagenômica/métodos , Microbiota/genética , Big Data , Variação Genética/genética , Geografia , Humanos , Estilo de Vida , Filogenia , Análise de Sequência de DNA/métodosRESUMO
As acquiring bigger data becomes easier in experimental brain science, computational and statistical brain science must achieve similar advances to fully capitalize on these data. Tackling these problems will benefit from a more explicit and concerted effort to work together. Specifically, brain science can be further democratized by harnessing the power of community-driven tools, which both are built by and benefit from many different people with different backgrounds and expertise. This perspective can be applied across modalities and scales and enables collaborations across previously siloed communities.
Assuntos
Big Data , Encéfalo/fisiologia , Biologia Computacional , Rede Nervosa/fisiologia , Animais , Biologia Computacional/métodos , Bases de Dados Genéticas , Expressão Gênica/fisiologia , HumanosRESUMO
Over the last decade, biology has begun utilizing 'big data' approaches, resulting in large, comprehensive atlases in modalities ranging from transcriptomics to neural connectomics. However, these approaches must be complemented and integrated with 'small data' approaches to efficiently utilize data from individual labs. Integration of smaller datasets with major reference atlases is critical to provide context to individual experiments, and approaches toward integration of large and small data have been a major focus in many fields in recent years. Here we discuss progress in integration of small data with consortium-sized atlases across multiple modalities, and its potential applications. We then examine promising future directions for utilizing the power of small data to maximize the information garnered from small-scale experiments. We envision that, in the near future, international consortia comprising many laboratories will work together to collaboratively build reference atlases and foundation models using small data methods.
Assuntos
Genômica , Humanos , Genômica/métodos , Big Data , Animais , Conectoma/métodos , Biologia Computacional/métodosRESUMO
Surveys are a crucial tool for understanding public opinion and behaviour, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the effect of survey bias: an instance of the Big Data Paradox1. Here we demonstrate this paradox in estimates of first-dose COVID-19 vaccine uptake in US adults from 9 January to 19 May 2021 from two large surveys: Delphi-Facebook2,3 (about 250,000 responses per week) and Census Household Pulse4 (about 75,000 every two weeks). In May 2021, Delphi-Facebook overestimated uptake by 17 percentage points (14-20 percentage points with 5% benchmark imprecision) and Census Household Pulse by 14 (11-17 percentage points with 5% benchmark imprecision), compared to a retroactively updated benchmark the Centers for Disease Control and Prevention published on 26 May 2021. Moreover, their large sample sizes led to miniscule margins of error on the incorrect estimates. By contrast, an Axios-Ipsos online panel5 with about 1,000 responses per week following survey research best practices6 provided reliable estimates and uncertainty quantification. We decompose observed error using a recent analytic framework1 to explain the inaccuracy in the three surveys. We then analyse the implications for vaccine hesitancy and willingness. We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters more than data quantity, and that compensating the former with the latter is a mathematically provable losing proposition.
Assuntos
Vacinas contra COVID-19/administração & dosagem , Pesquisas sobre Atenção à Saúde , Vacinação/estatística & dados numéricos , Benchmarking , Viés , Big Data , COVID-19/epidemiologia , COVID-19/prevenção & controle , Centers for Disease Control and Prevention, U.S. , Conjuntos de Dados como Assunto/normas , Feminino , Pesquisas sobre Atenção à Saúde/normas , Humanos , Masculino , Projetos de Pesquisa , Tamanho da Amostra , Mídias Sociais , Estados Unidos/epidemiologia , Hesitação Vacinal/estatística & dados numéricosRESUMO
New data sources and AI methods for extracting information are increasingly abundant and relevant to decision-making across societal applications. A notable example is street view imagery, available in over 100 countries, and purported to inform built environment interventions (e.g., adding sidewalks) for community health outcomes. However, biases can arise when decision-making does not account for data robustness or relies on spurious correlations. To investigate this risk, we analyzed 2.02 million Google Street View (GSV) images alongside health, demographic, and socioeconomic data from New York City. Findings demonstrate robustness challenges; built environment characteristics inferred from GSV labels at the intracity level often do not align with ground truth. Moreover, as average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, intervention on features measured by GSV would be misestimated without proper model specification and consideration of this mediation mechanism. Using a causal framework accounting for these mediators, we determined that intervening by improving 10% of samples in the two lowest tertiles of physical inactivity would lead to a 4.17 (95% CI 3.84-4.55) or 17.2 (95% CI 14.4-21.3) times greater decrease in the prevalence of obesity or diabetes, respectively, compared to the same proportional intervention on the number of crosswalks by census tract. This study highlights critical issues of robustness and model specification in using emergent data sources, showing the data may not measure what is intended, and ignoring mediators can result in biased intervention effect estimates.
Assuntos
Big Data , Tomada de Decisões , Saúde Pública , Humanos , Cidade de Nova Iorque , Ambiente Construído , Masculino , FemininoRESUMO
Biomedical data are growing exponentially in both volume and levels of complexity, due to the rapid advancement of technologies and research methodologies. Analyzing these large datasets, referred to collectively as "big data," has become an integral component of research that guides experimentation-driven discovery and a new engine of discovery itself as it uncovers previously unknown connections through mining of existing data. To fully realize the potential of big data, biomedical researchers need access to high-performance-computing (HPC) resources. However, supporting on-premises infrastructure that keeps up with these consistently expanding research needs presents persistent financial and staffing challenges, even for well-resourced institutions. For other institutions, including primarily undergraduate institutions and minority serving institutions, that educate a large portion of the future workforce in the USA, this challenge presents an insurmountable barrier. Therefore, new approaches are needed to provide broad and equitable access to HPC resources to biomedical researchers and students who will advance biomedical research in the future.
Assuntos
Pesquisa Biomédica , Computação em Nuvem , Humanos , Big Data , Biologia Computacional/métodos , Biologia Computacional/educação , Software , Estados UnidosRESUMO
Across many scientific disciplines, the development of computational models and algorithms for generating artificial or synthetic data is gaining momentum. In biology, there is a great opportunity to explore this further as more and more big data at multi-omics level are generated recently. In this opinion, we discuss the latest trends in biological applications based on process-driven and data-driven aspects. Moving ahead, we believe these methodologies can help shape novel multi-omics-scale cellular inferences.
Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Genômica/métodos , Humanos , Big Data , Proteômica/métodos , MultiômicaRESUMO
Over the past 20 years, neuroscience has been propelled forward by theory-driven experimentation. We consider the future outlook for the field in the age of big neural data and powerful artificial intelligence models.
Assuntos
Inteligência Artificial , Neurociências , Big Data , Pesquisa Empírica , Projetos de PesquisaRESUMO
Since its inception, synthetic biology has overcome many technical barriers but is at a crossroads for high-precision biological design. Devising ways to fully utilize big biological data may be the key to achieving greater heights in synthetic biology.
Assuntos
Big Data , Biologia SintéticaRESUMO
The National Genomics Data Center (NGDC), which is a part of the China National Center for Bioinformation (CNCB), provides a family of database resources to support the global academic and industrial communities. With the rapid accumulation of multi-omics data at an unprecedented pace, CNCB-NGDC continuously expands and updates core database resources through big data archiving, integrative analysis and value-added curation. Importantly, NGDC collaborates closely with major international databases and initiatives to ensure seamless data exchange and interoperability. Over the past year, significant efforts have been dedicated to integrating diverse omics data, synthesizing expanding knowledge, developing new resources, and upgrading major existing resources. Particularly, several database resources are newly developed for the biodiversity of protists (P10K), bacteria (NTM-DB, MPA) as well as plant (PPGR, SoyOmics, PlantPan) and disease/trait association (CROST, HervD Atlas, HALL, MACdb, BioKA, BioKA, RePoS, PGG.SV, NAFLDkb). All the resources and services are publicly accessible at https://ngdc.cncb.ac.cn.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Genômica , Big Data , China , Bases de Dados Genéticas/tendências , Eucariotos , InternetRESUMO
MOTIVATION: Given the widespread use of the variant call format (VCF/BCF) coupled with continuous surge in big data, there remains a perpetual demand for fast and flexible methods to manipulate these comprehensive formats across various programming languages. RESULTS: This work presents vcfpp, a C++ API of HTSlib in a single file, providing an intuitive interface to manipulate VCF/BCF files rapidly and safely, in addition to being portable. Moreover, this work introduces the vcfppR package to demonstrate the development of a high-performance R package with vcfpp, allowing for rapid and straightforward variants analyses. AVAILABILITY AND IMPLEMENTATION: vcfpp is available from https://github.com/Zilong-Li/vcfpp under MIT license. vcfppR is available from https://cran.r-project.org/web/packages/vcfppR.
Assuntos
Linguagens de Programação , Software , Big DataRESUMO
Despite the remarkable advances in cancer diagnosis, treatment, and management over the past decade, malignant tumors remain a major public health problem. Further progress in combating cancer may be enabled by personalizing the delivery of therapies according to the predicted response for each individual patient. The design of personalized therapies requires the integration of patient-specific information with an appropriate mathematical model of tumor response. A fundamental barrier to realizing this paradigm is the current lack of a rigorous yet practical mathematical theory of tumor initiation, development, invasion, and response to therapy. We begin this review with an overview of different approaches to modeling tumor growth and treatment, including mechanistic as well as data-driven models based on big data and artificial intelligence. We then present illustrative examples of mathematical models manifesting their utility and discuss the limitations of stand-alone mechanistic and data-driven models. We then discuss the potential of mechanistic models for not only predicting but also optimizing response to therapy on a patient-specific basis. We describe current efforts and future possibilities to integrate mechanistic and data-driven models. We conclude by proposing five fundamental challenges that must be addressed to fully realize personalized care for cancer patients driven by computational models.
Assuntos
Inteligência Artificial , Big Data , Neoplasias , Medicina de Precisão , Humanos , Neoplasias/terapia , Medicina de Precisão/métodos , Simulação por Computador , Modelos Biológicos , Modelagem Computacional Específica para o PacienteAssuntos
Academia , Inteligência Artificial , Big Data , Tecnologia da Informação , Academia/economia , Academia/métodos , Academia/tendências , Inteligência Artificial/economia , Inteligência Artificial/provisão & distribuição , Inteligência Artificial/tendências , Tecnologia da Informação/economia , Tecnologia da Informação/provisão & distribuição , Tecnologia da Informação/tendências , Big Data/economia , Big Data/provisão & distribuiçãoAssuntos
Big Data , Ciência do Cidadão , Ciência do Cidadão/métodos , Ciência do Cidadão/organização & administração , Ciência do Cidadão/tendências , Reprodutibilidade dos Testes , Drosophila melanogaster , Neurociências/métodos , Neurociências/organização & administração , Neurociências/tendências , Crowdsourcing/métodos , Crowdsourcing/tendências , AnimaisRESUMO
Machine learning approaches are increasingly used to extract patterns and insights from the ever-increasing stream of geospatial data, but current approaches may not be optimal when system behaviour is dominated by spatial or temporal context. Here, rather than amending classical machine learning, we argue that these contextual cues should be used as part of deep learning (an approach that is able to extract spatio-temporal features automatically) to gain further process understanding of Earth system science problems, improving the predictive ability of seasonal forecasting and modelling of long-range spatial connections across multiple timescales, for example. The next step will be a hybrid modelling approach, coupling physical process models with the versatility of data-driven machine learning.
Assuntos
Big Data , Simulação por Computador , Aprendizado Profundo , Ciências da Terra/métodos , Previsões/métodos , Reconhecimento Automatizado de Padrão/métodos , Reconhecimento Facial , Feminino , Mapeamento Geográfico , Humanos , Conhecimento , Regressão Psicológica , Reprodutibilidade dos Testes , Estações do Ano , Análise Espaço-Temporal , Fatores de Tempo , Tradução , Incerteza , Tempo (Meteorologia)RESUMO
Many examples of the use of real-world data in the area of pharmacoepidemiology include "big data," such as insurance claims, medical records, or hospital discharge databases. However, "big" is not always better, particularly when studying outcomes with narrow windows of etiologic relevance. Birth defects are such an outcome, for which specificity of exposure timing is critical. Studies with primary data collection can be designed to query details about the timing of medication use, as well as type, dose, frequency, duration, and indication, that can better characterize the "real world." Because birth defects are rare, etiologic studies are typically casecontrol in design, like the National Birth Defects Prevention Study, Birth Defects Study to Evaluate Pregnancy Exposures, and Slone Birth Defects Study. Recall bias can be a concern, but the ability to collect detailed information about both prescription and over-the-counter medication use and other exposures such as diet, family history, and sociodemographic factors is a distinct advantage over claims and medical record data sources. Casecontrol studies with primary data collection are essential to advancing the pharmacoepidemiology of birth defects. This article is part of a Special Collection on Pharmacoepidemiology.
Assuntos
Anormalidades Congênitas , Farmacoepidemiologia , Humanos , Gravidez , Feminino , Farmacoepidemiologia/métodos , Anormalidades Congênitas/epidemiologia , Big Data , Anormalidades Induzidas por Medicamentos/epidemiologia , Coleta de Dados/métodos , Estudos de Casos e ControlesRESUMO
PURPOSE: Immune checkpoint inhibitors (ICIs) have significantly improved the survival of patients with cancer and provided long-term durable benefit. However, ICI-treated patients develop a range of toxicities known as immune-related adverse events (irAEs), which could compromise clinical benefits from these treatments. As the incidence and spectrum of irAEs differs across cancer types and ICI agents, it is imperative to characterize the incidence and spectrum of irAEs in a pan-cancer cohort to aid clinical management. DESIGN: We queried >400 000 trials registered at ClinicalTrials.gov and retrieved a comprehensive pan-cancer database of 71 087 ICI-treated participants from 19 cancer types and 7 ICI agents. We performed data harmonization and cleaning of these trial results into 293 harmonized adverse event categories using Medical Dictionary for Regulatory Activities. RESULTS: We developed irAExplorer (https://irae.tanlab.org), an interactive database that focuses on adverse events in patients administered with ICIs from big data mining. irAExplorer encompasses 71 087 distinct clinical trial participants from 343 clinical trials across 19 cancer types with well-annotated ICI treatment regimens and harmonized adverse event categories. We demonstrated a few of the irAE analyses through irAExplorer and highlighted some associations between treatment- or cancer-specific irAEs. CONCLUSION: The irAExplorer is a user-friendly resource that offers exploration, validation, and discovery of treatment- or cancer-specific irAEs across pan-cancer cohorts. We envision that irAExplorer can serve as a valuable resource to cross-validate users' internal datasets to increase the robustness of their findings.