Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 3.658
Filtrar
1.
BMC Bioinformatics ; 21(Suppl 5): 250, 2020 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-33106154

RESUMO

Biological contextual information helps understand various phenomena occurring in the biological systems consisting of complex molecular relations. The construction of context-specific relational resources vastly relies on laborious manual extraction from unstructured literature. In this paper, we propose COMMODAR, a machine learning-based literature mining framework for context-specific molecular relations using multimodal representations. The main idea of COMMODAR is the feature augmentation by the cooperation of multimodal representations for relation extraction. We leveraged biomedical domain knowledge as well as canonical linguistic information for more comprehensive representations of textual sources. The models based on multiple modalities outperformed those solely based on the linguistic modality. We applied COMMODAR to the 14 million PubMed abstracts and extracted 9214 context-specific molecular relations. All corpora, extracted data, evaluation results, and the implementation code are downloadable at https://github.com/jae-hyun-lee/commodar . CCS CONCEPTS: • Computing methodologies~Information extraction • Computing methodologies~Neural networks • Applied computing~Biological networks.


Assuntos
Mineração de Dados/métodos , Aprendizado de Máquina , PubMed , Publicações
2.
PLoS One ; 15(10): e0237658, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33057328

RESUMO

Breast cancer is the most common invasive cancer and the second leading cause of cancer death in women. and regrettably, this rate is increasing every year. One of the aspects of all cancers, including breast cancer, is the recurrence of the disease, which causes painful consequences to the patients. Moreover, the practical application of data mining in the field of breast cancer can help to provide some necessary information and knowledge required by physicians for accurate prediction of breast cancer recurrence and better decision-making. The main objective of this study is to compare different data mining algorithms to select the most accurate model for predicting breast cancer recurrence. This study is cross-sectional and data gathering of this research performed from June 2018 to June 2019 from the official statistics of Ministry of Health and Medical Education and the Iran Cancer Research Center for patients with breast cancer who had been followed for a minimum of 5 years from February 2014 to April 2019, including 5471 independent records. After initial pre-processing in dataset and variables, seven new and conventional data mining algorithms have been applied that each one represents one kind of data mining approach. Results show that the C5.0 algorithm possibly could be a helpful tool for the prediction of breast cancer recurrence at the stage of distant recurrence and nonrecurrence, especially in the first to third years. also, LN involvement rate, Her2 value, Tumor size, free or closed tumor margin were found to be the most important features in our dataset to predict breast cancer recurrence.


Assuntos
Algoritmos , Neoplasias da Mama/patologia , Mineração de Dados/métodos , Recidiva Local de Neoplasia/patologia , Neoplasias da Mama/metabolismo , Estudos Transversais , Bases de Dados Factuais , Árvores de Decisões , Feminino , Humanos , Bloqueio Interatrial , Irã (Geográfico) , Metástase Linfática/patologia , Modelos Biológicos , Recidiva Local de Neoplasia/etiologia , Recidiva Local de Neoplasia/metabolismo , Redes Neurais de Computação , Receptor ErbB-2/metabolismo , Máquina de Vetores de Suporte
3.
PLoS One ; 15(10): e0241167, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33095814

RESUMO

Understanding the influence of COVID-19 on China's agricultural economy and the Chinese government's emergency measures to ease the economic impacts of viral spread can offer urgently-needed lessons while this virus continues to spread across the globe. Thus, this study collected over 750,000 words upon the topic of COVID-19 and agriculture from the largest two media channels in China: WeChat and Sina Weibo, and employed web crawler technology and text mining method to explore the influence of COVID-19 on agricultural economy and mitigation measures in China. The results show that: (1) the impact of COVID-19 on China's agricultural economy at the very first phase is mainly reflected in eight aspects as crop production, agricultural products supply, livestock production, farmers' income and employment, economic crop development, agricultural products sales model, leisure agriculture development, and agricultural products trade. (2) The government's immediate countermeasures include resuming agricultural production and farmers' work, providing financial support, stabilizing agricultural production and products supply, promoting agricultural products sale, providing subsidies, providing agricultural technology guidance and field management, and providing assistance to poor farmers to reduce poverty. (3) The order of government's immediate countermeasures is not all in line with the order of impact aspects, which indicates that more-tailored policies should be implemented to mitigate the strikes of COVID-19 on China's agricultural economy in the future.


Assuntos
Betacoronavirus , Infecções por Coronavirus/economia , Infecções por Coronavirus/epidemiologia , Produção Agrícola/economia , Mineração de Dados/métodos , Fazendas/economia , Regulamentação Governamental , Pandemias/economia , Pneumonia Viral/economia , Pneumonia Viral/epidemiologia , Animais , China/epidemiologia , Infecções por Coronavirus/virologia , Produção Agrícola/legislação & jurisprudência , Desenvolvimento Econômico/legislação & jurisprudência , Emprego/legislação & jurisprudência , Fazendeiros/legislação & jurisprudência , Fazendas/legislação & jurisprudência , Apoio Financeiro , Humanos , Gado , Pneumonia Viral/virologia , Mídias Sociais
4.
Artigo em Inglês | MEDLINE | ID: mdl-32872350

RESUMO

Emergency room processes are often exposed to the risk of unexpected factors, and process management based on performance measurements is required due to its connectivity to the quality of care. Regarding this, there have been several attempts to propose a method to analyze the emergency room processes. This paper proposes a framework for process performance indicators utilized in emergency rooms. Based on the devil's quadrangle, i.e., time, cost, quality, and flexibility, the paper suggests multiple process performance indicators that can be analyzed using clinical event logs and verify them with a thorough discussion with clinical experts in the emergency department. A case study is conducted with the real-life clinical data collected from a tertiary hospital in Korea tovalidate the proposed method. The case study demonstrated that the proposed indicators are well applied using the clinical data, and the framework is capable of understanding emergency room processes' performance.


Assuntos
Mineração de Dados/métodos , Serviço Hospitalar de Emergência , Avaliação de Processos em Cuidados de Saúde/métodos , Indicadores de Qualidade em Assistência à Saúde , Qualidade da Assistência à Saúde , Sistemas de Informação Hospitalar , Humanos , Modelos Organizacionais , República da Coreia , Fluxo de Trabalho
5.
J Stroke Cerebrovasc Dis ; 29(10): 105151, 2020 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-32912531

RESUMO

BACKGROUND: Understanding and improving EMS stroke care requires linking data from both the prehospital and hospital settings. In the US, such data is collected in separate de-identified registries that cannot be directly linked due to lack of a common, unique patient identifier. In the absence of unique patient identifiers two common approaches to linking databases are deterministic matching, which uses combinations of non-unique matching variables to define matches, and probabilistic matching, which generates estimates of match probability based on the degree of similarity between records. This analysis seeks to compare these two approaches for matching EMS and stroke registry data. METHODS: Stroke cases transported by EMS to Michigan hospitals participating in the Michigan Coverdell Acute Stroke Registry were linked to records from Michigan's EMS Information System (MI-EMSIS) between January 2018 and June 2019. Destination hospital, date-of-service, patient age, date-of-birth, and sex were used to perform deterministic and probabilistic linkages. Match rates and representativeness of the matched samples were compared between the two matching strategies. Multivariable logistic regression was used to identify characteristics associated with successful matching. RESULTS: During the 18-month study period there were 8,828 EMS transported confirmed stroke cases in the registry and 620,907 EMS transports to 38 Coverdell registry-participating hospitals. The probabilistic match linked 5985 (67.7%) strokes to EMS records; the deterministic match linked 4012 (45.5%). Within each strategy the characteristics of matched and unmatched cases were similar, with the exception that deterministically matched cases were less likely to be older than 89 (adjusted odds ratio [aOR]=0.3), white (aOR=0.8), and more likely to have subarachnoid hemorrhage (aOR=1.4) than unmatched cases. CONCLUSION: Probabilistic matching resulted in higher match rates and a more representative sample of EMS transported strokes, suggesting it may be superior in assessing EMS stroke care compared to a deterministic approach.


Assuntos
Mineração de Dados/métodos , Serviços Médicos de Emergência/normas , Serviço Hospitalar de Emergência/normas , Registro Médico Coordenado , Melhoria de Qualidade/normas , Indicadores de Qualidade em Assistência à Saúde/normas , Acidente Vascular Cerebral/terapia , Idoso , Idoso de 80 Anos ou mais , Ambulâncias/normas , Feminino , Humanos , Masculino , Michigan , Pessoa de Meia-Idade , Probabilidade , Sistema de Registros , Acidente Vascular Cerebral/diagnóstico , Resultado do Tratamento
6.
Arch Cardiovasc Dis ; 113(10): 630-641, 2020 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-32888873

RESUMO

BACKGROUND: Pulmonary hypertension (PH) is a heterogeneous, severe and progressive disease with an impact on quality of life and life-expectancy despite specific therapies. AIMS: (i) to compare prognosis significance of each PH subgroup in a cohort from a referral center, (ii) to identify phenotypically distinct high-risk PH patient using machine learning. METHODS: Patients with PH were included from 2002 to 2019 and routinely followed-up. We collected clinical, laboratory, imaging and hemodynamic variables. Four-year survival rate of each subgroups was then compared. Next, phenotypic domains were imputed with 5 eigenvectors for missing values and filtered if the Pearson correlation coefficient was>0.6. Thereafter, agglomerative hierarchical clustering was used for grouping phenotypic variables and patients: a heat map was generated and participants were separated using Penalized Model-Based Clustering. P<0.05 was considered significant. RESULTS: 328 patients were prospectively included (mean age 63±18 yo, 46% male). PH secondary to left heart disease (PH-LHD) and lung disease (PH-LD) had a significantly increased mortality compared to pulmonary arterial hypertension (PAH) patients: HR=2.43, 95%CI=(1.24-4.73) and 2.95, 95%CI=(1.43-6.07) respectively. 25 phenotypic domains were pinpointed and 3 phenogroups identified. Phenogroup 3 had a significantly increased mortality (log-rank P=0.046) compared to the others and was remarkable for predominant pulmonary disease in older male, accumulating cardiovascular risk factors, and simultaneous three major comorbidities: coronary artery disease, chronic kidney disease and interstitial lung disease. CONCLUSION: PH-LHD and PH-LD has 2-fold and 3-fold increase in mortality, respectively compared with PAH. PH patients with simultaneous kidney-cardiac-pulmonary comorbidities were identified as having high-risk of mortality. Specific targeted therapy in this phenogroup should be prospectively evaluated.


Assuntos
Mineração de Dados/métodos , Cardiopatias/epidemiologia , Nefropatias/epidemiologia , Doenças Pulmonares Intersticiais/epidemiologia , Aprendizado de Máquina , Hipertensão Arterial Pulmonar/epidemiologia , Idoso , Idoso de 80 Anos ou mais , Análise por Conglomerados , Comorbidade , Feminino , França/epidemiologia , Nível de Saúde , Cardiopatias/diagnóstico , Cardiopatias/mortalidade , Cardiopatias/fisiopatologia , Humanos , Nefropatias/diagnóstico , Nefropatias/mortalidade , Nefropatias/fisiopatologia , Doenças Pulmonares Intersticiais/diagnóstico , Doenças Pulmonares Intersticiais/mortalidade , Doenças Pulmonares Intersticiais/fisiopatologia , Masculino , Pessoa de Meia-Idade , Fenótipo , Prognóstico , Estudos Prospectivos , Hipertensão Arterial Pulmonar/diagnóstico , Hipertensão Arterial Pulmonar/mortalidade , Hipertensão Arterial Pulmonar/fisiopatologia , Sistema de Registros , Medição de Risco , Fatores de Risco
7.
BMC Bioinformatics ; 21(Suppl 11): 228, 2020 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-32921303

RESUMO

BACKGROUND: The rapid growth of scientific literature has rendered the task of finding relevant information one of the critical problems in almost any research. Search engines, like Google Scholar, Web of Knowledge, PubMed, Scopus, and others, are highly effective in document search; however, they do not allow knowledge extraction. In contrast to the search engines, text-mining systems provide extraction of knowledge with representations in the form of semantic networks. Of particular interest are tools performing a full cycle of knowledge management and engineering, including automated retrieval, integration, and representation of knowledge in the form of semantic networks, their visualization, and analysis. STRING, Pathway Studio, MetaCore, and others are well-known examples of such products. Previously, we developed the Associative Network Discovery System (ANDSystem), which also implements such a cycle. However, the drawback of these systems is dependence on the employed ontologies describing the subject area, which limits their functionality in searching information based on user-specified queries. RESULTS: The ANDDigest system is a new web-based module of the ANDSystem tool, permitting searching within PubMed by using dictionaries from the ANDSystem tool and sets of user-defined keywords. ANDDigest allows performing the search based on complex queries simultaneously, taking into account many types of objects from the ANDSystem's ontology. The system has a user-friendly interface, providing sorting, visualization, and filtering of the found information, including mapping of mentioned objects in text, linking to external databases, sorting of data by publication date, citations number, journal H-indices, etc. The system provides data on trends for identified entities based on dynamics of interest according to the frequency of their mentions in PubMed by years. CONCLUSIONS: The main feature of ANDDigest is its functionality, serving as a specialized search for information about multiple associative relationships of objects from the ANDSystem's ontology vocabularies, taking into account user-specified keywords. The tool can be applied to the interpretation of experimental genetics data, the search for associations between molecular genetics objects, and the preparation of scientific and analytical reviews. It is presently available at https://anddigest.sysbio.ru/ .


Assuntos
Mineração de Dados/métodos , Internet , Software , Bases de Dados Factuais , PubMed
8.
Toxicol Appl Pharmacol ; 406: 115237, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32920000

RESUMO

Improvement of COVID-19 clinical condition was seen in studies where combination of antiretroviral drugs, lopinavir and ritonavir, as well as immunomodulant antimalaric, chloroquine/hydroxychloroquine together with the macrolide-type antibiotic, azithromycin, was used for patient's treatment. Although these drugs are "old", their pharmacological and toxicological profile in SARS-CoV-2 - infected patients are still unknown. Thus, by using in silico toxicogenomic data-mining approach, we aimed to assess both risks and benefits of the COVID-19 treatment with the most promising candidate drugs combinations: lopinavir/ritonavir and chloroquine/hydroxychloroquine + azithromycin. The Comparative Toxicogenomics Database (CTD; http://CTD.mdibl.org), Cytoscape software (https://cytoscape.org) and ToppGene Suite portal (https://toppgene.cchmc.org) served as a foundation in our research. Our results have demonstrated that lopinavir/ritonavir increased the expression of the genes involved in immune response and lipid metabolism (IL6, ICAM1, CCL2, TNF, APOA1, etc.). Chloroquine/hydroxychloroquine + azithromycin interacted with 6 genes (CCL2, CTSB, CXCL8, IL1B, IL6 and TNF), whereas chloroquine and azithromycin affected two additional genes (BCL2L1 and CYP3A4), which might be a reason behind a greater number of consequential diseases. In contrast to lopinavir/ritonavir, chloroquine/hydroxychloroquine + azithromycin downregulated the expression of TNF and IL6. As expected, inflammation, cardiotoxicity, and dyslipidaemias were revealed as the main risks of lopinavir/ritonavir treatment, while chloroquine/hydroxychloroquine + azithromycin therapy was additionally linked to gastrointestinal and skin diseases. According to our results, these drug combinations should be administrated with caution to patients suffering from cardiovascular problems, autoimmune diseases, or acquired and hereditary lipid disorders.


Assuntos
Betacoronavirus , Simulação por Computador , Mineração de Dados/métodos , Toxicogenética/métodos , Antivirais/administração & dosagem , Antivirais/efeitos adversos , Azitromicina/administração & dosagem , Azitromicina/efeitos adversos , Cloroquina/administração & dosagem , Cloroquina/efeitos adversos , Infecções por Coronavirus/tratamento farmacológico , Infecções por Coronavirus/genética , Bases de Dados Genéticas , Quimioterapia Combinada , Redes Reguladoras de Genes/efeitos dos fármacos , Redes Reguladoras de Genes/genética , Humanos , Hidroxicloroquina/administração & dosagem , Hidroxicloroquina/efeitos adversos , Lopinavir/administração & dosagem , Lopinavir/efeitos adversos , Pandemias , Pneumonia Viral/tratamento farmacológico , Pneumonia Viral/genética , Ritonavir/administração & dosagem , Ritonavir/efeitos adversos
9.
Medicine (Baltimore) ; 99(35): e21989, 2020 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-32871949

RESUMO

BACKGROUND: Bipolar disorder (BD), a common kind of mood disorder with frequent recurrence, high rates of additional comorbid conditions and poor compliance, has an unclear pathogenesis. The Gene Expression Omnibus (GEO) database is a gene expression database created and maintained by the National Center for Biotechnology Information. Researchers can download expression data online for bioinformatics analysis, especially for cancer research. However, there is little research on the use of such bioinformatics analysis methodologies for mental illness by downloading differential expression data from the GEO database. METHODS: Publicly available data were downloaded from the GEO database (GSE12649, GSE5388 and GSE5389), and differentially expressed genes (DEGs) were extracted by using the online tool GEO2R. A Venn diagram was used to screen out common DEGs between postmortem brain tissues and normal tissues. Functional annotation and pathway enrichment analysis of DEGs were performed by using Gene ontology and Kyoto Encyclopedia of Genes and Genomes analyses, respectively. Furthermore, a protein-protein interaction network was constructed to identify hub genes. RESULTS: A total of 289 DEGs were found, among which 5 of 10 hub genes [HSP90AA1, HSP90AB 1, UBE2N, UBE3A, and CUL1] were identified as susceptibility genes whose expression was downregulated. Gene ontology and Kyoto Encyclopedia of Genes and Genomes analyses showed that variations in these 5 hub genes were obviously enriched in protein folding, protein polyubiquitination, apoptotic process, protein binding, the ubiquitin-mediated proteolysis pathway, and protein processing in the endoplasmic reticulum pathway. These findings strongly suggested that HSP90AA1, UBE3A, and CUL 1, which had large areas under the curve in receiver operator curves (P < .05), were potential diagnostic markers for BD. CONCLUSION: Although there are 3 hub genes [HSP90AA1, UBE3A, and CUL 1] that are tightly correlated with the occurrence of BD, mainly based on routine bioinformatics methods for cancer-related disease, the feasibility of applying this single GEO bioinformatics approach for mental illness is questionable, given the significant differences between mental illness and cancer-related diseases.


Assuntos
Transtorno Bipolar/genética , Proteínas Culina/genética , Mineração de Dados/métodos , Proteínas de Choque Térmico HSP90/genética , Ubiquitina-Proteína Ligases/genética , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Pré-Escolar , Feminino , Predisposição Genética para Doença , Humanos , Lactente , Masculino , Pessoa de Meia-Idade , Neoplasias/genética , Mapas de Interação de Proteínas , Adulto Jovem
10.
Medicine (Baltimore) ; 99(39): e22172, 2020 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-32991410

RESUMO

Osteoporosis is a severe chronic skeletal disorder that increases the risks of disability and mortality; however, the mechanism of this disease and the protein markers for prognosis of osteoporosis have not been well characterized. This study aims to characterize the imbalanced serum proteostasis, the disturbed pathways, and potential serum markers in osteoporosis by using a set of bioinformatic analyses. In the present study, the large-scale proteomics datasets (PXD006464) were adopted from the Proteome Xchange database and processed with MaxQuant. The differentially expressed serum proteins were identified. The biological process and molecular function were analyzed. The protein-protein interactions and subnetwork modules were constructed. The signaling pathways were enriched. We identified 209 upregulated and 230 downregulated serum proteins. The bioinformatic analyses revealed a highly overlapped functional protein classification and the gene ontology terms between the upregulated and downregulated protein groups. Protein-protein interactions and pathway analyses showed a high enrichment in protein synthesis, inflammation, and immune response in the upregulated proteins, and cell adhesion and cytoskeleton regulation in the downregulated proteins. Our findings greatly expand the current view of the roles of serum proteins in osteoporosis and shed light on the understanding of its underlying mechanisms and the discovery of serum proteins as potential markers for the prognosis of osteoporosis.


Assuntos
Mineração de Dados/métodos , Osteoporose/sangue , Proteoma/fisiologia , Biomarcadores , Adesão Celular/fisiologia , Biologia Computacional , Citoesqueleto/metabolismo , Regulação para Baixo , Humanos , Mediadores da Inflamação/metabolismo , Mapas de Interação de Proteínas/fisiologia , Proteômica , Regulação para Cima
11.
BMC Bioinformatics ; 21(1): 414, 2020 Sep 22.
Artigo em Inglês | MEDLINE | ID: mdl-32962627

RESUMO

BACKGROUND: Gene selection refers to find a small subset of discriminant genes from the gene expression profiles. How to select genes that affect specific phenotypic traits effectively is an important research work in the field of biology. The neural network has better fitting ability when dealing with nonlinear data, and it can capture features automatically and flexibly. In this work, we propose an embedded gene selection method using neural network. The important genes can be obtained by calculating the weight coefficient after the training is completed. In order to solve the problem of black box of neural network and further make the training results interpretable in neural network, we use the idea of knockoffs to construct the knockoff feature genes of the original feature genes. This method not only make each feature gene to compete with each other, but also make each feature gene compete with its knockoff feature gene. This approach can help to select the key genes that affect the decision-making of neural networks. RESULTS: We use maize carotenoids, tocopherol methyltransferase, raffinose family oligosaccharides and human breast cancer dataset to do verification and analysis. CONCLUSIONS: The experiment results demonstrate that the knockoffs optimizing neural network method has better detection effect than the other existing algorithms, and specially for processing the nonlinear gene expression and phenotype data.


Assuntos
Mineração de Dados/métodos , Redes Neurais de Computação , Transcriptoma , Neoplasias da Mama/genética , Biologia Computacional/métodos , Feminino , Regulação da Expressão Gênica , Humanos , Zea mays/enzimologia , Zea mays/genética , Zea mays/metabolismo
12.
PLoS One ; 15(9): e0232644, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32877430

RESUMO

When trying to identify new potential therapeutic protein targets, access to data and knowledge is increasingly important. In a field where new resources and data sources become available every day, it is crucial to be able to take a step back and look at the wider picture in order to identify potential drug targets. While this task is routinely performed by bespoke literature searches, it is often time-consuming and lacks uniformity when comparing multiple targets at one time. To address this challenge, we developed TargetDB, a tool that aggregates public information available on given target(s) (links to disease, safety, 3D structures, ligandability, novelty, etc.) and assembles it in an easy to read output ready for the researcher to analyze. In addition, we developed a target scoring system based on the desirable attributes of good therapeutic targets and machine learning classification system to categorize novel targets as having promising or challenging tractrability. In this manuscript, we present the methodology used to develop TargetDB as well as test cases.


Assuntos
Mineração de Dados/métodos , Bases de Dados como Assunto , Algoritmos , Animais , Doença , Desenvolvimento de Medicamentos , Humanos , Aprendizado de Máquina , Camundongos , Modelos Químicos , Proteínas , Software
13.
J Stroke Cerebrovasc Dis ; 29(9): 105042, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-32807454

RESUMO

BACKGROUND: Text mining with automatic extraction of key features is gaining increasing importance in science and particularly medicine due to the rapidly increasing number of publications. OBJECTIVES: Here we evaluate the current potential of sentiment analysis and machine learning to extract the importance of the reported results and conclusions of randomized trials on stroke. METHODS: PubMed abstracts of 200 recent reports of randomized trials were reviewed and manually classified according to the estimated importance of the studies. Importance of the papers was classified as "game changer", "suggestive", "maybe" "negative result". Algorithmic sentiment analysis was subsequently used on both the "Results" and the "Conclusions" paragraphs, resulting in a numerical output for polarity and subjectivity. The result of the human assessment was then compared to polarity and subjectivity. In addition, a neural network using the Keras platform built on Tensorflow and Python was trained to map the "Results" and "Conclusions" to the dichotomized human assessment (1: "game changer" or "suggestive"; 0:"maybe" or "negative", or no results reported). 120 abstracts were used as the training set and 80 as the test set. RESULTS: 9 out of the 200 reports were classified manually as "game changer", 40 as "suggestive", 73 as "maybe" and 32 and "negative"; 46 abstracts did not contain any results. Polarity was generally higher for the "Conclusions" than for the "Results". Polarity was highest for the "Conclusions" classified as "suggestive". Subjectivity was also higher in the classes "suggestive" and "maybe" than in the classes "game changer" and "negative". The trained neural network provided a correct dichotomized output with an accuracy of 71% based on the "Results" and 73% based on "Conclusions" . CONCLUSIONS: Current statistical approaches to text analysis can grasp the impact of scientific medical abstracts to a certain degree. Sentiment analysis showed that mediocre results are apparently written in more enthusiastic words than clearly positive or negative results.


Assuntos
Indexação e Redação de Resumos/métodos , Mineração de Dados/métodos , Aprendizado de Máquina , Reconhecimento Automatizado de Padrão , PubMed , Ensaios Clínicos Controlados Aleatórios como Assunto , Aprendizado Profundo , Humanos
14.
PLoS One ; 15(8): e0228520, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32857775

RESUMO

Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.


Assuntos
Mineração de Dados/métodos , Ciência de Dados/métodos , Algoritmos , Big Data , Compressão de Dados , Humanos , Aprendizado de Máquina , Metanálise como Assunto , Modelos Teóricos , Fenômenos Físicos , Prognóstico , Reprodutibilidade dos Testes , Software
15.
PLoS Comput Biol ; 16(8): e1008049, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32822341

RESUMO

Tissue morphogenesis relies on repeated use of dynamic behaviors at the levels of intracellular structures, individual cells, and cell groups. Rapidly accumulating live imaging datasets make it increasingly important to formalize and automate the task of mapping recurrent dynamic behaviors (motifs), as it is done in speech recognition and other data mining applications. Here, we present a "template-based search" approach for accurate mapping of sub- to multi-cellular morphogenetic motifs using a time series data mining framework. We formulated the task of motif mapping as a subsequence matching problem and solved it using dynamic time warping, while relying on high throughput graph-theoretic algorithms for efficient exploration of the search space. This formulation allows our algorithm to accurately identify the complete duration of each instance and automatically label different stages throughout its progress, such as cell cycle phases during cell division. To illustrate our approach, we mapped cell intercalations during germband extension in the early Drosophila embryo. Our framework enabled statistical analysis of intercalary cell behaviors in wild-type and mutant embryos, comparison of temporal dynamics in contracting and growing junctions in different genotypes, and the identification of a novel mode of iterative cell intercalation. Our formulation of tissue morphogenesis using time series opens new avenues for systematic decomposition of tissue morphogenesis.


Assuntos
Biologia Computacional/métodos , Processamento de Imagem Assistida por Computador/métodos , Morfogênese/fisiologia , Algoritmos , Animais , Divisão Celular/fisiologia , Mineração de Dados/métodos , Drosophila/citologia , Drosophila/embriologia , Embrião não Mamífero/citologia , Embrião não Mamífero/embriologia , Feminino , Masculino , Microscopia Confocal , Fatores de Tempo
16.
PLoS One ; 15(8): e0236331, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32756613

RESUMO

This paper investigates event extraction and early event classification in contiguous spatio-temporal data streams, where events need to be classified using partial information, i.e. while the event is ongoing. The framework incorporates an event extraction algorithm and an early event classification algorithm. We apply this framework to synthetic and real problems and demonstrate its reliability and broad applicability. The algorithms and data are available in the R package eventstream, and other code in the supplementary material.


Assuntos
Algoritmos , Mineração de Dados/métodos , Big Data , Conservação dos Recursos Naturais , Monitoramento Ambiental/métodos , Tecnologia de Fibra Óptica/métodos , Dióxido de Nitrogênio/análise
17.
Int J Behav Nutr Phys Act ; 17(1): 94, 2020 07 23.
Artigo em Inglês | MEDLINE | ID: mdl-32703217

RESUMO

PURPOSE: A data mining approach was applied to establish a multilevel hierarchy predicting physical activity (PA) behavior, and to methodologically identify the correlates of PA behavior. METHODS: Cross-sectional data from the population-based Northern Finland Birth Cohort 1966 study, collected in the most recent follow-up at age 46, were used to create a hierarchy using the chi-square automatic interaction detection (CHAID) decision tree technique for predicting PA behavior. PA behavior is defined as active or inactive based on machine-learned activity profiles, which were previously created through a multidimensional (clustering) approach on continuous accelerometer-measured activity intensities in one week. The input variables (predictors) used for decision tree fitting consisted of individual, demographical, psychological, behavioral, environmental, and physical factors. Using generalized linear mixed models, we also analyzed how factors emerging from the model were associated with three PA metrics, including daily time (minutes per day) in sedentary (SED), light PA (LPA), and moderate-to-vigorous PA (MVPA), to assure the relative importance of methodologically identified factors. RESULTS: Of the 4582 participants with valid accelerometer data at the latest follow-up, 2701 and 1881 had active and inactive profiles, respectively. We used a total of 168 factors as input variables to classify these two PA behaviors. Out of these 168 factors, the decision tree selected 36 factors of different domains from which 54 subgroups of participants were formed. The emerging factors from the model explained minutes per day in SED, LPA, and/or MVPA, including body fat percentage (SED: B = 26.5, LPA: B = - 16.1, and MVPA: B = - 11.7), normalized heart rate recovery 60 s after exercise (SED: B = -16.1, LPA: B = 9.9, and MVPA: B = 9.6), average weekday total sitting time (SED: B = 34.1, LPA: B = -25.3, and MVPA: B = -5.8), and extravagance score (SED: B = 6.3 and LPA: B = - 3.7). CONCLUSIONS: Using data mining, we established a data-driven model composed of 36 different factors of relative importance from empirical data. This model may be used to identify subgroups for multilevel intervention allocation and design. Additionally, this study methodologically discovered an extensive set of factors that can be a basis for additional hypothesis testing in PA correlates research.


Assuntos
Mineração de Dados/métodos , Árvores de Decisões , Exercício Físico , Comportamento Sedentário , Acelerometria , Tecido Adiposo/fisiologia , Algoritmos , Estudos Transversais , Feminino , Finlândia/epidemiologia , Seguimentos , Frequência Cardíaca , Humanos , Masculino , Pessoa de Meia-Idade , Postura Sentada , Inquéritos e Questionários
18.
PLoS One ; 15(7): e0235147, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32609749

RESUMO

Digital datasets in several health care facilities, as hospitals and prehospital services, accumulated data from thousands of patients for more than a decade. In general, there is no local team with enough experts with the required different skills capable of analyzing them in entirety. The integration of those abilities usually demands a relatively long-period and is cost. Considering that scenario, this paper proposes a new Feature Sensitivity technique that can automatically deal with a large dataset. It uses a criterion-based sampling strategy from the Optimization based on Phylogram Analysis. Called FS-opa, the new approach seems proper for dealing with any types of raw data from health centers and manipulate their entire datasets. Besides, FS-opa can find the principal features for the construction of inference models without depending on expert knowledge of the problem domain. The selected features can be combined with usual statistical or machine learning methods to perform predictions. The new method can mine entire datasets from scratch. FS-opa was evaluated using a relatively large dataset from electronic health records of mental disorder prehospital services in Brazil. Cox's approach was integrated to FS-opa to generate survival analysis models related to the length of stay (LOS) in hospitals, assuming that it is a relevant aspect that can benefit estimates of the efficiency of hospitals and the quality of patient treatments. Since FS-opa can work with raw datasets, no knowledge from the problem domain was used to obtain the preliminary prediction models found. Results show that FS-opa succeeded in performing a feature sensitivity analysis using only the raw data available. In this way, FS-opa can find the principal features without bias of an inference model, since the proposed method does not use it. Moreover, the experiments show that FS-opa can provide models with a useful trade-off according to their representativeness and parsimony. It can benefit further analyses by experts since they can focus on aspects that benefit problem modeling.


Assuntos
Mineração de Dados , Registros Eletrônicos de Saúde , Transtornos Mentais/diagnóstico , Adulto , Algoritmos , Brasil/epidemiologia , Mineração de Dados/métodos , Conjuntos de Dados como Assunto , Humanos , Transtornos Mentais/epidemiologia , Transtornos Mentais/terapia , Modelos de Riscos Proporcionais
19.
Diabetes Metab Syndr ; 14(5): 1121-1132, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32659695

RESUMO

BACKGROUND AND AIMS: Covid-19 is a global pandemic that requires a global and integrated response of all national medical and healthcare systems. Covid-19 exposed the need for timely response and data sharing on fast spreading global pandemics. In this study, we investigate the scientific research response from the early stages of the pandemic, and we review key findings on how the early warning systems developed in previous epidemics responded to contain the virus. METHODS: We conducted data mining of scientific literature records from the Web of Science Core Collection, using the topics Covid-19, mortality, immunity, and vaccine. The individual records are analysed in isolation, and the analysis is compared with records on all Covid-19 research topics combined. The data records are analysed with commutable statistical methods, including R Studio's Bibliometrix package, and the Web of Science data mining tool. RESULTS: From historical analysis of scientific data records on viruses, pandemics and mortality, we identified that Chinese universities have not been leading on these topics historically. However, during the early stages of the Covid-19 pandemic, the Chinese universities are strongly dominating the research on these topics. Despite the current political and trade disputes, we found strong collaboration in Covid-19 research between the US and China. From the analysis on Covid-19 and immunity, we wanted to identify the relationship between different risk factors discussed in the news media. We identified a few clusters, containing references to exercise, inflammation, smoking, obesity and many additional factors. From the analysis on Covid-19 and vaccine, we discovered that although the USA is leading in volume of scientific research on Covid-19 vaccine, the leading 3 research institutions (Fudan, Melbourne, Oxford) are not based in the USA. Hence, it is difficult to predict which country would be first to produce a Covid-19 vaccine. CONCLUSIONS: We analysed the conceptual structure maps with factorial analysis and multiple correspondence analysis (MCA), and identified multiple relationships between keywords, synonyms and concepts, related to Covid-19 mortality, immunity, and vaccine development. We present integrated and corelated knowledge from 276 records on Covid-19 and mortality, 71 records on Covid-19 and immunity, and 189 records on Covid-19 vaccine.


Assuntos
Betacoronavirus/isolamento & purificação , Pesquisa Biomédica , Infecções por Coronavirus/mortalidade , Infecções por Coronavirus/prevenção & controle , Mineração de Dados/métodos , Pandemias/prevenção & controle , Pneumonia Viral/mortalidade , Pneumonia Viral/prevenção & controle , Vacinas Virais/uso terapêutico , Infecções por Coronavirus/imunologia , Infecções por Coronavirus/virologia , Humanos , Pneumonia Viral/imunologia , Pneumonia Viral/virologia
20.
PLoS One ; 15(7): e0234618, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32649675

RESUMO

How would an inventor, entrepreneur, investor, or patent examiner quantify the extent to which the inventive claims listed in a patent document align with patent specification? Since a specification that is poorly aligned with the inventive claims can render an invention unpatentable and can invalidate an already issued patent, an effective measure of alignment is necessary. We define a novel measure of drafting alignment using Latent Dirichlet Allocation (LDA). The measure is defined for each patent document by first identifying the latent topics underlying the claims and the specification, and then using the Hellinger distance to find the proximity between the topical coverages. We demonstrate the use of the novel measure for data processing patent documents related to cybersecurity. The properties of the proposed measure are further investigated using exploratory data analysis, and it is shown that generally alignment is positively associated with the prior patenting efforts as well as the tendency to include figures in a document.


Assuntos
Mineração de Dados/métodos , Invenções/tendências , Humanos , Patentes como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA