Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 128
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37195463

RESUMO

MOTIVATION: Identifying organellar DNA, such as mitochondrial or plastid sequences, inside a whole genome assembly, remains challenging and requires biological background knowledge. To address this, we developed ODNA based on genome annotation and machine learning to fulfill. RESULTS: ODNA is a software that classifies organellar DNA sequences within a genome assembly by machine learning based on a predefined genome annotation workflow. We trained our model with 829 769 DNA sequences from 405 genome assemblies and achieved high predictive performance (e.g. matthew's correlation coefficient of 0.61 for mitochondria and 0.73 for chloroplasts) on independent validation data, thus outperforming existing approaches significantly. AVAILABILITY AND IMPLEMENTATION: Our software ODNA is freely accessible as a web service at https://odna.mathematik.uni-marburg.de and can also be run in a docker container. The source code can be found at https://gitlab.com/mosga/odna and the processed data at Zenodo (DOI: 10.5281/zenodo.7506483).


Assuntos
Mitocôndrias , Organelas , Análise de Sequência de DNA , Mitocôndrias/genética , Software , Aprendizado de Máquina , DNA
2.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37988152

RESUMO

SUMMARY: Federated learning enables collaboration in medicine, where data is scattered across multiple centers without the need to aggregate the data in a central cloud. While, in general, machine learning models can be applied to a wide range of data types, graph neural networks (GNNs) are particularly developed for graphs, which are very common in the biomedical domain. For instance, a patient can be represented by a protein-protein interaction (PPI) network where the nodes contain the patient-specific omics features. Here, we present our Ensemble-GNN software package, which can be used to deploy federated, ensemble-based GNNs in Python. Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation. We exemplary show the results from a public dataset of 981 patients and 8469 genes from the Cancer Genome Atlas (TCGA). AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/pievos101/Ensemble-GNN, and the data at Zenodo (DOI: 10.5281/zenodo.8305122).


Assuntos
Metilação de DNA , Aprendizado de Máquina , Humanos , Redes Neurais de Computação , Mapas de Interação de Proteínas , Software
3.
Nucleic Acids Res ; 50(5): e30, 2022 03 21.
Artigo em Inglês | MEDLINE | ID: mdl-34908135

RESUMO

The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.


Assuntos
Biologia Computacional/métodos , DNA , Fractais
4.
Gut ; 72(4): 612-623, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-35882562

RESUMO

OBJECTIVE: Oesophageal cancer (EC) is the sixth leading cause of cancer-related deaths. Oesophageal adenocarcinoma (EA), with Barrett's oesophagus (BE) as a precursor lesion, is the most prevalent EC subtype in the Western world. This study aims to contribute to better understand the genetic causes of BE/EA by leveraging genome wide association studies (GWAS), genetic correlation analyses and polygenic risk modelling. DESIGN: We combined data from previous GWAS with new cohorts, increasing the sample size to 16 790 BE/EA cases and 32 476 controls. We also carried out a transcriptome wide association study (TWAS) using expression data from disease-relevant tissues to identify BE/EA candidate genes. To investigate the relationship with reported BE/EA risk factors, a linkage disequilibrium score regression (LDSR) analysis was performed. BE/EA risk models were developed combining clinical/lifestyle risk factors with polygenic risk scores (PRS) derived from the GWAS meta-analysis. RESULTS: The GWAS meta-analysis identified 27 BE and/or EA risk loci, 11 of which were novel. The TWAS identified promising BE/EA candidate genes at seven GWAS loci and at five additional risk loci. The LDSR analysis led to the identification of novel genetic correlations and pointed to differences in BE and EA aetiology. Gastro-oesophageal reflux disease appeared to contribute stronger to the metaplastic BE transformation than to EA development. Finally, combining PRS with BE/EA risk factors improved the performance of the risk models. CONCLUSION: Our findings provide further insights into BE/EA aetiology and its relationship to risk factors. The results lay the foundation for future follow-up studies to identify underlying disease mechanisms and improving risk prediction.


Assuntos
Adenocarcinoma , Esôfago de Barrett , Neoplasias Esofágicas , Humanos , Esôfago de Barrett/patologia , Estudo de Associação Genômica Ampla , Neoplasias Esofágicas/patologia , Adenocarcinoma/patologia
5.
Brief Bioinform ; 22(2): 642-663, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33147627

RESUMO

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.


Assuntos
COVID-19/prevenção & controle , Biologia Computacional , SARS-CoV-2/isolamento & purificação , Pesquisa Biomédica , COVID-19/epidemiologia , COVID-19/virologia , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética
6.
Bioinformatics ; 38(8): 2278-2286, 2022 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-35139148

RESUMO

MOTIVATION: Limited data access has hindered the field of precision medicine from exploring its full potential, e.g. concerning machine learning and privacy and data protection rules.Our study evaluates the efficacy of federated Random Forests (FRF) models, focusing particularly on the heterogeneity within and between datasets. We addressed three common challenges: (i) number of parties, (ii) sizes of datasets and (iii) imbalanced phenotypes, evaluated on five biomedical datasets. RESULTS: The FRF outperformed the average local models and performed comparably to the data-centralized models trained on the entire data. With an increasing number of models and decreasing dataset size, the performance of local models decreases drastically. The FRF, however, do not decrease significantly. When combining datasets of different sizes, the FRF vastly improve compared to the average local models. We demonstrate that the FRF remain more robust and outperform the local models by analyzing different class-imbalances.Our results support that FRF overcome boundaries of clinical research and enables collaborations across institutes without violating privacy or legal regulations. Clinicians benefit from a vast collection of unbiased data aggregated from different geographic locations, demographics and other varying factors. They can build more generalizable models to make better clinical decisions, which will have relevance, especially for patients in rural areas and rare or geographically uncommon diseases, enabling personalized treatment. In combination with secure multi-party computation, federated learning has the power to revolutionize clinical practice by increasing the accuracy and robustness of healthcare AI and thus paving the way for precision medicine. AVAILABILITY AND IMPLEMENTATION: The implementation of the federated random forests can be found at https://featurecloud.ai/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Privacidade , Algoritmo Florestas Aleatórias , Aprendizado de Máquina , Medicina de Precisão , Atenção à Saúde
7.
Bioinformatics ; 38(2): 325-334, 2022 01 03.
Artigo em Inglês | MEDLINE | ID: mdl-34613360

RESUMO

MOTIVATION: Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. RESULTS: In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. AVAILABILITY AND IMPLEMENTATION: Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Antibacterianos , Farmacorresistência Bacteriana , Animais , Humanos , Antibacterianos/farmacologia , Farmacorresistência Bacteriana/genética , Ciprofloxacina , Aprendizado de Máquina , Genômica , Bactérias/genética
8.
Infection ; 51(6): 1809-1818, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37828369

RESUMO

PURPOSE AND METHODS: The emergence of coronavirus disease 2019 (COVID-19) has once again affirmed the significant threat of respiratory infections to global public health and the utmost importance of prompt diagnosis in managing and mitigating any pandemic. The nucleic acid amplification test (NAAT) is the primary detection method for most pathogens. Loop-mediated isothermal amplification (LAMP) is a rapid, simple, sensitive, and specific epitome of isothermal NAAT performed using a set of four to six primers. Primer design is a fundamental step in LAMP assays, with several complexities and experimental screening requirements. To address this challenge, an online database is presented here. Its workflow comprises three steps: literature aggregation, data curation, and database and website implementation. RESULTS: LAMPPrimerBank ( https://lampprimerbank.mathematik.uni-marburg.de ) is a manually curated database dedicated to experimentally validated LAMP primers, their peculiarities of assays, and accompanying literature, with a primary emphasis on respiratory pathogens. LAMPPrimerBank, with its user-friendly web interface and an open application programming interface, enables the accelerated and facile exploration, comparison, and exportation of LAMP primer sequences and their respective information from the massively scattered literature. LAMPPrimerBank currently comprises LAMP primers for diagnosing viral, bacterial, and fungal respiratory pathogens. Additionally, to address the challenge of false-positive results generated by nonspecific amplifications, LAMPPrimerBank computationally predicted and visualized the sizes of LAMP products for recorded primer sets in the database. CONCLUSION: LAMPPrimerBank, as a pioneering database in the rapidly expanding field of isothermal NAAT, endeavors to confront the two challenges of the LAMP: primer design and discrimination of false-positive results.


Assuntos
COVID-19 , Técnicas de Diagnóstico Molecular , Humanos , Sensibilidade e Especificidade , Técnicas de Diagnóstico Molecular/métodos , COVID-19/diagnóstico , Técnicas de Amplificação de Ácido Nucleico/métodos
9.
Infection ; 51(5): 1491-1501, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36961624

RESUMO

PURPOSE: Malaria is a life-threatening mosquito-borne disease caused by Plasmodium parasites, mainly in tropical and subtropical countries. Plasmodium falciparum (P. falciparum) is the most prevalent cause on the African continent and responsible for most malaria-related deaths globally. Important medical needs are biomarkers for disease severity or disease outcome. A potential source of easily accessible biomarkers are blood-borne small extracellular vesicles (sEVs). METHODS: We performed an EV Array to find proteins on plasma sEVs that are differentially expressed in malaria patients. Plasma samples from 21 healthy subjects and 15 malaria patients were analyzed. The EV array contained 40 antibodies to capture sEVs, which were then visualized with a cocktail of biotin-conjugated CD9, CD63, and CD81 antibodies. RESULTS: We detected significant differences in the protein decoration of sEVs between healthy subjects and malaria patients. We found CD106 to be the best discrimination marker based on receiver operating characteristic (ROC) analysis with an area under the curve of > 0.974. Additional ensemble feature selection revealed CD106, Osteopontin, CD81, major histocompatibility complex class II DR (HLA-DR), and heparin binding EGF like growth factor (HBEGF) together with thrombocytes to be a feature panel for discrimination between healthy and malaria. TNF-R-II correlated with HLA-A/B/C as well as CD9 with CD81, whereas Osteopontin negatively correlated with CD81 and CD9. Pathway analysis linked the herein identified proteins to IFN-γ signaling. CONCLUSION: sEV-associated proteins can discriminate between healthy individuals and malaria patients and are candidates for future predictive biomarkers. TRIAL REGISTRATION: The trial was registered in the Deutsches Register Klinischer Studien (DRKS-ID: DRKS00012518).


Assuntos
Vesículas Extracelulares , Malária Falciparum , Malária , Animais , Humanos , Proteoma/metabolismo , Osteopontina/metabolismo , Malária/diagnóstico , Biomarcadores , Malária Falciparum/diagnóstico , Vesículas Extracelulares/metabolismo
10.
J Med Internet Res ; 25: e47540, 2023 08 29.
Artigo em Inglês | MEDLINE | ID: mdl-37642995

RESUMO

Artificial intelligence (AI) and data sharing go hand in hand. In order to develop powerful AI models for medical and health applications, data need to be collected and brought together over multiple centers. However, due to various reasons, including data privacy, not all data can be made publicly available or shared with other parties. Federated and swarm learning can help in these scenarios. However, in the private sector, such as between companies, the incentive is limited, as the resulting AI models would be available for all partners irrespective of their individual contribution, including the amount of data provided by each party. Here, we explore a potential solution to this challenge as a viewpoint, aiming to establish a fairer approach that encourages companies to engage in collaborative data analysis and AI modeling. Within the proposed approach, each individual participant could gain a model commensurate with their respective data contribution, ultimately leading to better diagnostic tools for all participants in a fair manner.


Assuntos
Inteligência Artificial , Análise de Dados , Disseminação de Informação
11.
J Med Internet Res ; 25: e42621, 2023 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-37436815

RESUMO

BACKGROUND: Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. OBJECTIVE: Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. METHODS: The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. RESULTS: FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. CONCLUSIONS: FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.


Assuntos
Algoritmos , Inteligência Artificial , Humanos , Ocupações em Saúde , Software , Redes de Comunicação de Computadores , Privacidade
12.
Bioinformatics ; 36(22-23): 5514-5515, 2021 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-33258916

RESUMO

MOTIVATION: The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies-a crucial step toward unlocking the biology of the organism of interest-has remained a complex challenge that often requires advanced bioinformatics expertise. RESULTS: Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. AVAILABILITY AND IMPLEMENTATION: We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Eucariotos
13.
PLoS Comput Biol ; 17(4): e1008901, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-33822781

RESUMO

[This corrects the article DOI: 10.1371/journal.pcbi.1008259.].

14.
Dig Dis ; 40(5): 644-653, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34469884

RESUMO

BACKGROUND: In current general practice, elevated serum concentrations of liver enzymes are still regarded as an indicator of non-alcoholic fatty liver disease (NAFLD) or non-alcoholic steatohepatitis (NASH). In this study, we analyzed if an adjustment of the upper limit of normal (ULN) for serum liver enzymes can improve their diagnostic accuracy. METHODS: Data from 363 morbidly obese patients (42.5 ± 10.3 years old; mean BMI: 52 ± 8.5 kg/m2), who underwent bariatric surgery, was retrospectively analyzed. NAFL and NASH were defined histologically according to non-alcoholic fatty liver activity score (NAS) and according to steatosis activity fibrosis (SAF) score for 2 separate analyses, respectively. RESULTS: In 121 women (45%) and 45 men (46%), elevated values for at least one serum parameter (ALT, AST, γGT) were present. The serum concentrations of ALT (p < 0.0001), AST (p < 0.0001) and γGT (p = 0.0023) differed significantly between NAFL and NASH, irrespective of the applied histological classification method. Concentrations of all 3 serum parameters correlated significantly positively with the NAS and the SAF score, with correlation coefficients between 0.33 (ALT/NAS) and 0.40 (γGT/SAF). The area under the curves to separate NAFL and NASH by liver enzymes achieved a maximum of 0.70 (ALT applied to NAS-based classification). For 95% specificity, the ULN for ALT would be 47.5 U/L; for 95% sensitivity, the ULN for ALT would be 17.5 U/L, resulting in 62% uncategorized patients. CONCLUSION: ALT, AST, and γGT are unsuitable for non-invasive screening or diagnosis of NAFL or NASH. Utilizing liver enzymes as an indicator for NAFLD or NASH should generally be questioned.


Assuntos
Hepatopatia Gordurosa não Alcoólica , Obesidade Mórbida , Adulto , Alanina Transaminase , Algoritmos , Feminino , Humanos , Fígado/patologia , Masculino , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/diagnóstico , Obesidade Mórbida/complicações , Obesidade Mórbida/cirurgia , Estudos Retrospectivos , gama-Glutamiltransferase
15.
Bioinformatics ; 36(11): 3322-3326, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32129840

RESUMO

SUMMARY: The development of de novo DNA synthesis, polymerase chain reaction (PCR), DNA sequencing and molecular cloning gave researchers unprecedented control over DNA and DNA-mediated processes. To reduce the error probabilities of these techniques, DNA composition has to adhere to method-dependent restrictions. To comply with such restrictions, a synthetic DNA fragment is often adjusted manually or by using custom-made scripts. In this article, we present MESA (Mosla Error Simulator), a web application for the assessment of DNA fragments based on limitations of DNA synthesis, amplification, cloning, sequencing methods and biological restrictions of host organisms. Furthermore, MESA can be used to simulate errors during synthesis, PCR, storage and sequencing processes. AVAILABILITY AND IMPLEMENTATION: MESA is available at mesa.mosla.de, with the source code available at github.com/umr-ds/mesa_dna_sim. CONTACT: dominik.heider@uni-marburg.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Software , DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala , Reação em Cadeia da Polimerase , Análise de Sequência de DNA
16.
Bioinformatics ; 36(1): 272-279, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31225868

RESUMO

MOTIVATION: Classification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons. RESULTS: We could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences. AVAILABILITY AND IMPLEMENTATION: https://cran.r-project.org/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Proteínas , Análise de Sequência de Proteína , Sequência de Aminoácidos , Teoria dos Jogos , Proteínas/química , Análise de Sequência de Proteína/métodos , Máquina de Vetores de Suporte
17.
Mol Ecol ; 30(9): 2131-2144, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33682183

RESUMO

It is known that microorganisms are essential for the functioning of ecosystems, but the extent to which microorganisms respond to different environmental variables in their natural habitats is not clear. In the current study, we present a methodological framework to quantify the covariation of the microbial community of a habitat and environmental variables of this habitat. It is built on theoretical considerations of systems ecology, makes use of state-of-the-art machine learning techniques and can be used to identify bioindicators. We apply the framework to a data set containing operational taxonomic units (OTUs) as well as more than twenty physicochemical and geographic variables measured in a large-scale survey of European lakes. While a large part of variation (up to 61%) in many environmental variables can be explained by microbial community composition, some variables do not show significant covariation with the microbial lake community. Moreover, we have identified OTUs that act as "multitask" bioindicators, i.e., that are indicative for multiple environmental variables, and thus could be candidates for lake water monitoring schemes. Our results represent, for the first time, a quantification of the covariation of the lake microbiome and a wide array of environmental variables for lake ecosystems. Building on the results and methodology presented here, it will be possible to identify microbial taxa and processes that are essential for functioning and stability of lake ecosystems.


Assuntos
Lagos , Microbiota , Ecologia , Aprendizado de Máquina , Microbiota/genética
18.
BMC Med Inform Decis Mak ; 21(1): 294, 2021 10 26.
Artigo em Inglês | MEDLINE | ID: mdl-34702225

RESUMO

Machine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.


Assuntos
Inteligência Artificial , Aprendizado de Máquina , Humanos
19.
BMC Bioinformatics ; 21(1): 526, 2020 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-33198651

RESUMO

BACKGROUND: Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. RESULTS: We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub ( https://github.com/MW55/Natrix ) or as a Docker container on DockerHub ( https://hub.docker.com/r/mw55/natrix ). CONCLUSION: Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Fluxo de Trabalho , Análise por Conglomerados , DNA Ambiental/genética , DNA Ambiental/isolamento & purificação , Análise de Dados , Bases de Dados Genéticas , Inundações , Microbiota/genética , Reprodutibilidade dos Testes
20.
Clin Gastroenterol Hepatol ; 18(3): 728-735.e4, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31712073

RESUMO

BACKGROUND & AIMS: The prevalence of nonalcoholic steatohepatitis (NASH) associated hepatocellular carcinoma (HCC) is increasing. However, strategies for detection of early-stage HCC in patients with NASH have limitations. We assessed the ability of the GALAD score, which determines risk of HCC based on patient sex; age; and serum levels of α-fetoprotein (AFP), AFP isoform L3 (AFP-L3), and des-gamma-carboxy prothrombin (DCP), to detect HCC in patients with NASH. METHODS: We performed a case-control study of 125 patients with HCC (20% within Milan Criteria) and 231 patients without HCC (NASH controls) from 8 centers in Germany. We compared the performance of serum AFP, AFP-L3, or DCP vs GALAD score to identify patients with HCC using receiver operating characteristic curves and corresponding area under the curve (AUC) analyses. We also analyzed data from 389 patients with NASH under surveillance for HCC in Japan, followed for a median of 167 months. During the 5-year screening period, 26 patients developed HCC. To compensate for irregular intervals of data points, we performed locally weighted scatterplot smoothing, linear regression, and a non-linear curve fit to assess development of GALAD before HCC development. RESULTS: The GALAD score identified patients with any stage HCC with an AUC of 0.96 - significantly greater than values for serum levels of AFP (AUC, 0.88), AFP-L3 (AUC, 0.86) or DCP (AUC, 0.87). AUC values for the GALAD score were consistent in patients with cirrhosis (AUC, 0.93) and without cirrhosis (AUC, 0.98). For detection of HCC within Milan Criteria, the GALAD score achieved an AUC of 0.91, with a sensitivity of 68% and specificity of 95% at a cutoff of -0.63. In a pilot Japanese cohort study, the mean GALAD score was higher in patients with NASH who developed HCC than in those who did not develop HCC as early as 1.5 years before HCC diagnosis. GALAD scores were above -0.63 approximately 200 days before the diagnosis of HCC. CONCLUSIONS: In a case-control study performed in Germany and a pilot cohort study in Japan, we found the GALAD score may detect HCC with high levels of accuracy in patients with NASH, with and without cirrhosis. The GALAD score can detect patients with early-stage HCC, and might facilitate surveillance of patients with NASH, who are often obese, which limits the sensitivity of detection of liver cancer by ultrasound.


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Hepatopatia Gordurosa não Alcoólica , Biomarcadores , Biomarcadores Tumorais , Carcinoma Hepatocelular/diagnóstico , Carcinoma Hepatocelular/epidemiologia , Estudos de Casos e Controles , Estudos de Coortes , Humanos , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/epidemiologia , Hepatopatia Gordurosa não Alcoólica/complicações , Hepatopatia Gordurosa não Alcoólica/diagnóstico , Projetos Piloto , Precursores de Proteínas , Protrombina , Curva ROC , Sensibilidade e Especificidade , alfa-Fetoproteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA