Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 131
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 39(5)2023 05 04.
Artículo en Inglés | MEDLINE | ID: mdl-37195463

RESUMEN

MOTIVATION: Identifying organellar DNA, such as mitochondrial or plastid sequences, inside a whole genome assembly, remains challenging and requires biological background knowledge. To address this, we developed ODNA based on genome annotation and machine learning to fulfill. RESULTS: ODNA is a software that classifies organellar DNA sequences within a genome assembly by machine learning based on a predefined genome annotation workflow. We trained our model with 829 769 DNA sequences from 405 genome assemblies and achieved high predictive performance (e.g. matthew's correlation coefficient of 0.61 for mitochondria and 0.73 for chloroplasts) on independent validation data, thus outperforming existing approaches significantly. AVAILABILITY AND IMPLEMENTATION: Our software ODNA is freely accessible as a web service at https://odna.mathematik.uni-marburg.de and can also be run in a docker container. The source code can be found at https://gitlab.com/mosga/odna and the processed data at Zenodo (DOI: 10.5281/zenodo.7506483).


Asunto(s)
Mitocondrias , Orgánulos , Análisis de Secuencia de ADN , Mitocondrias/genética , Programas Informáticos , Aprendizaje Automático , ADN
2.
Bioinformatics ; 39(11)2023 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-37988152

RESUMEN

SUMMARY: Federated learning enables collaboration in medicine, where data is scattered across multiple centers without the need to aggregate the data in a central cloud. While, in general, machine learning models can be applied to a wide range of data types, graph neural networks (GNNs) are particularly developed for graphs, which are very common in the biomedical domain. For instance, a patient can be represented by a protein-protein interaction (PPI) network where the nodes contain the patient-specific omics features. Here, we present our Ensemble-GNN software package, which can be used to deploy federated, ensemble-based GNNs in Python. Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation. We exemplary show the results from a public dataset of 981 patients and 8469 genes from the Cancer Genome Atlas (TCGA). AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/pievos101/Ensemble-GNN, and the data at Zenodo (DOI: 10.5281/zenodo.8305122).


Asunto(s)
Metilación de ADN , Aprendizaje Automático , Humanos , Redes Neurales de la Computación , Mapas de Interacción de Proteínas , Programas Informáticos
3.
Respirology ; 2024 Oct 24.
Artículo en Inglés | MEDLINE | ID: mdl-39448064

RESUMEN

BACKGROUND AND OBJECTIVE: Chronic obstructive pulmonary disease (COPD) exhibits diverse patterns of disease progression, due to underlying disease activity. We hypothesized that changes in static hyperinflation or KCO % predicted would reveal subgroups with disease progression unidentified by preestablished markers (FEV1, SGRQ, exacerbation history) and associated with unique baseline biomarker profiles. We explored 18-month measures of disease progression associated with 18-54-month mortality, including changes in hyperinflation parameters and transfer factor, in a large German COPD cohort. METHODS: Analysing data of 1364 patients from the German observational COSYCONET-cohort, disease progression and improvement patterns were assessed for their impact on mortality via Cox hazard regression models. Association of biomarkers and COPD Assessment test items with phenotypes of disease progression or improvement were evaluated using logistic regression and random forest models. RESULTS: Increased risk of 18-54-month mortality was linked to decrease in KCO % predicted (7.5% increments) and FEV1 (20 mL increments), increase in RV/TLC (2% increments) and SGRQ (≥6 points), and an exacerbation grade of 2 at 18 months. Decrease in KCO % predicted ≥7.5% and an increase of RV/TLC ≥2% were the most frequent measures of 18-month disease progression occurring in ~52% and ~46% of patients, respectively. IL-6 and CRP thresholds exhibited significant associations with medium- and long-term disease measures. CONCLUSION: In a multicentric cohort of COPD, new markers of current disease activity predicted mid-term mortality and could not be anticipated by baseline biomarkers.

4.
Nucleic Acids Res ; 50(5): e30, 2022 03 21.
Artículo en Inglés | MEDLINE | ID: mdl-34908135

RESUMEN

The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.


Asunto(s)
Biología Computacional/métodos , ADN , Fractales
5.
Gut ; 72(4): 612-623, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-35882562

RESUMEN

OBJECTIVE: Oesophageal cancer (EC) is the sixth leading cause of cancer-related deaths. Oesophageal adenocarcinoma (EA), with Barrett's oesophagus (BE) as a precursor lesion, is the most prevalent EC subtype in the Western world. This study aims to contribute to better understand the genetic causes of BE/EA by leveraging genome wide association studies (GWAS), genetic correlation analyses and polygenic risk modelling. DESIGN: We combined data from previous GWAS with new cohorts, increasing the sample size to 16 790 BE/EA cases and 32 476 controls. We also carried out a transcriptome wide association study (TWAS) using expression data from disease-relevant tissues to identify BE/EA candidate genes. To investigate the relationship with reported BE/EA risk factors, a linkage disequilibrium score regression (LDSR) analysis was performed. BE/EA risk models were developed combining clinical/lifestyle risk factors with polygenic risk scores (PRS) derived from the GWAS meta-analysis. RESULTS: The GWAS meta-analysis identified 27 BE and/or EA risk loci, 11 of which were novel. The TWAS identified promising BE/EA candidate genes at seven GWAS loci and at five additional risk loci. The LDSR analysis led to the identification of novel genetic correlations and pointed to differences in BE and EA aetiology. Gastro-oesophageal reflux disease appeared to contribute stronger to the metaplastic BE transformation than to EA development. Finally, combining PRS with BE/EA risk factors improved the performance of the risk models. CONCLUSION: Our findings provide further insights into BE/EA aetiology and its relationship to risk factors. The results lay the foundation for future follow-up studies to identify underlying disease mechanisms and improving risk prediction.


Asunto(s)
Adenocarcinoma , Esófago de Barrett , Neoplasias Esofágicas , Humanos , Esófago de Barrett/patología , Estudio de Asociación del Genoma Completo , Neoplasias Esofágicas/patología , Adenocarcinoma/patología
6.
Brief Bioinform ; 22(2): 642-663, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33147627

RESUMEN

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.


Asunto(s)
COVID-19/prevención & control , Biología Computacional , SARS-CoV-2/aislamiento & purificación , Investigación Biomédica , COVID-19/epidemiología , COVID-19/virología , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética
7.
Bioinformatics ; 38(8): 2278-2286, 2022 04 12.
Artículo en Inglés | MEDLINE | ID: mdl-35139148

RESUMEN

MOTIVATION: Limited data access has hindered the field of precision medicine from exploring its full potential, e.g. concerning machine learning and privacy and data protection rules.Our study evaluates the efficacy of federated Random Forests (FRF) models, focusing particularly on the heterogeneity within and between datasets. We addressed three common challenges: (i) number of parties, (ii) sizes of datasets and (iii) imbalanced phenotypes, evaluated on five biomedical datasets. RESULTS: The FRF outperformed the average local models and performed comparably to the data-centralized models trained on the entire data. With an increasing number of models and decreasing dataset size, the performance of local models decreases drastically. The FRF, however, do not decrease significantly. When combining datasets of different sizes, the FRF vastly improve compared to the average local models. We demonstrate that the FRF remain more robust and outperform the local models by analyzing different class-imbalances.Our results support that FRF overcome boundaries of clinical research and enables collaborations across institutes without violating privacy or legal regulations. Clinicians benefit from a vast collection of unbiased data aggregated from different geographic locations, demographics and other varying factors. They can build more generalizable models to make better clinical decisions, which will have relevance, especially for patients in rural areas and rare or geographically uncommon diseases, enabling personalized treatment. In combination with secure multi-party computation, federated learning has the power to revolutionize clinical practice by increasing the accuracy and robustness of healthcare AI and thus paving the way for precision medicine. AVAILABILITY AND IMPLEMENTATION: The implementation of the federated random forests can be found at https://featurecloud.ai/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Privacidad , Bosques Aleatorios , Aprendizaje Automático , Medicina de Precisión , Atención a la Salud
8.
Bioinformatics ; 38(2): 325-334, 2022 01 03.
Artículo en Inglés | MEDLINE | ID: mdl-34613360

RESUMEN

MOTIVATION: Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. RESULTS: In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. AVAILABILITY AND IMPLEMENTATION: Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Antibacterianos , Farmacorresistencia Bacteriana , Animales , Humanos , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Ciprofloxacina , Aprendizaje Automático , Genómica , Bacterias/genética
9.
Infection ; 51(6): 1809-1818, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37828369

RESUMEN

PURPOSE AND METHODS: The emergence of coronavirus disease 2019 (COVID-19) has once again affirmed the significant threat of respiratory infections to global public health and the utmost importance of prompt diagnosis in managing and mitigating any pandemic. The nucleic acid amplification test (NAAT) is the primary detection method for most pathogens. Loop-mediated isothermal amplification (LAMP) is a rapid, simple, sensitive, and specific epitome of isothermal NAAT performed using a set of four to six primers. Primer design is a fundamental step in LAMP assays, with several complexities and experimental screening requirements. To address this challenge, an online database is presented here. Its workflow comprises three steps: literature aggregation, data curation, and database and website implementation. RESULTS: LAMPPrimerBank ( https://lampprimerbank.mathematik.uni-marburg.de ) is a manually curated database dedicated to experimentally validated LAMP primers, their peculiarities of assays, and accompanying literature, with a primary emphasis on respiratory pathogens. LAMPPrimerBank, with its user-friendly web interface and an open application programming interface, enables the accelerated and facile exploration, comparison, and exportation of LAMP primer sequences and their respective information from the massively scattered literature. LAMPPrimerBank currently comprises LAMP primers for diagnosing viral, bacterial, and fungal respiratory pathogens. Additionally, to address the challenge of false-positive results generated by nonspecific amplifications, LAMPPrimerBank computationally predicted and visualized the sizes of LAMP products for recorded primer sets in the database. CONCLUSION: LAMPPrimerBank, as a pioneering database in the rapidly expanding field of isothermal NAAT, endeavors to confront the two challenges of the LAMP: primer design and discrimination of false-positive results.


Asunto(s)
COVID-19 , Técnicas de Diagnóstico Molecular , Humanos , Sensibilidad y Especificidad , Técnicas de Diagnóstico Molecular/métodos , COVID-19/diagnóstico , Técnicas de Amplificación de Ácido Nucleico/métodos
10.
Infection ; 51(5): 1491-1501, 2023 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-36961624

RESUMEN

PURPOSE: Malaria is a life-threatening mosquito-borne disease caused by Plasmodium parasites, mainly in tropical and subtropical countries. Plasmodium falciparum (P. falciparum) is the most prevalent cause on the African continent and responsible for most malaria-related deaths globally. Important medical needs are biomarkers for disease severity or disease outcome. A potential source of easily accessible biomarkers are blood-borne small extracellular vesicles (sEVs). METHODS: We performed an EV Array to find proteins on plasma sEVs that are differentially expressed in malaria patients. Plasma samples from 21 healthy subjects and 15 malaria patients were analyzed. The EV array contained 40 antibodies to capture sEVs, which were then visualized with a cocktail of biotin-conjugated CD9, CD63, and CD81 antibodies. RESULTS: We detected significant differences in the protein decoration of sEVs between healthy subjects and malaria patients. We found CD106 to be the best discrimination marker based on receiver operating characteristic (ROC) analysis with an area under the curve of > 0.974. Additional ensemble feature selection revealed CD106, Osteopontin, CD81, major histocompatibility complex class II DR (HLA-DR), and heparin binding EGF like growth factor (HBEGF) together with thrombocytes to be a feature panel for discrimination between healthy and malaria. TNF-R-II correlated with HLA-A/B/C as well as CD9 with CD81, whereas Osteopontin negatively correlated with CD81 and CD9. Pathway analysis linked the herein identified proteins to IFN-γ signaling. CONCLUSION: sEV-associated proteins can discriminate between healthy individuals and malaria patients and are candidates for future predictive biomarkers. TRIAL REGISTRATION: The trial was registered in the Deutsches Register Klinischer Studien (DRKS-ID: DRKS00012518).


Asunto(s)
Vesículas Extracelulares , Malaria Falciparum , Malaria , Animales , Humanos , Proteoma/metabolismo , Osteopontina/metabolismo , Malaria/diagnóstico , Biomarcadores , Malaria Falciparum/diagnóstico , Vesículas Extracelulares/metabolismo
11.
J Med Internet Res ; 25: e47540, 2023 08 29.
Artículo en Inglés | MEDLINE | ID: mdl-37642995

RESUMEN

Artificial intelligence (AI) and data sharing go hand in hand. In order to develop powerful AI models for medical and health applications, data need to be collected and brought together over multiple centers. However, due to various reasons, including data privacy, not all data can be made publicly available or shared with other parties. Federated and swarm learning can help in these scenarios. However, in the private sector, such as between companies, the incentive is limited, as the resulting AI models would be available for all partners irrespective of their individual contribution, including the amount of data provided by each party. Here, we explore a potential solution to this challenge as a viewpoint, aiming to establish a fairer approach that encourages companies to engage in collaborative data analysis and AI modeling. Within the proposed approach, each individual participant could gain a model commensurate with their respective data contribution, ultimately leading to better diagnostic tools for all participants in a fair manner.


Asunto(s)
Inteligencia Artificial , Análisis de Datos , Difusión de la Información
12.
J Med Internet Res ; 25: e42621, 2023 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-37436815

RESUMEN

BACKGROUND: Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. OBJECTIVE: Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. METHODS: The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. RESULTS: FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. CONCLUSIONS: FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.


Asunto(s)
Algoritmos , Inteligencia Artificial , Humanos , Empleos en Salud , Programas Informáticos , Redes de Comunicación de Computadores , Privacidad
13.
Bioinformatics ; 36(22-23): 5514-5515, 2021 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-33258916

RESUMEN

MOTIVATION: The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies-a crucial step toward unlocking the biology of the organism of interest-has remained a complex challenge that often requires advanced bioinformatics expertise. RESULTS: Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. AVAILABILITY AND IMPLEMENTATION: We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Programas Informáticos , Eucariontes
14.
PLoS Comput Biol ; 17(4): e1008901, 2021 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-33822781

RESUMEN

[This corrects the article DOI: 10.1371/journal.pcbi.1008259.].

15.
Dig Dis ; 40(5): 644-653, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-34469884

RESUMEN

BACKGROUND: In current general practice, elevated serum concentrations of liver enzymes are still regarded as an indicator of non-alcoholic fatty liver disease (NAFLD) or non-alcoholic steatohepatitis (NASH). In this study, we analyzed if an adjustment of the upper limit of normal (ULN) for serum liver enzymes can improve their diagnostic accuracy. METHODS: Data from 363 morbidly obese patients (42.5 ± 10.3 years old; mean BMI: 52 ± 8.5 kg/m2), who underwent bariatric surgery, was retrospectively analyzed. NAFL and NASH were defined histologically according to non-alcoholic fatty liver activity score (NAS) and according to steatosis activity fibrosis (SAF) score for 2 separate analyses, respectively. RESULTS: In 121 women (45%) and 45 men (46%), elevated values for at least one serum parameter (ALT, AST, γGT) were present. The serum concentrations of ALT (p < 0.0001), AST (p < 0.0001) and γGT (p = 0.0023) differed significantly between NAFL and NASH, irrespective of the applied histological classification method. Concentrations of all 3 serum parameters correlated significantly positively with the NAS and the SAF score, with correlation coefficients between 0.33 (ALT/NAS) and 0.40 (γGT/SAF). The area under the curves to separate NAFL and NASH by liver enzymes achieved a maximum of 0.70 (ALT applied to NAS-based classification). For 95% specificity, the ULN for ALT would be 47.5 U/L; for 95% sensitivity, the ULN for ALT would be 17.5 U/L, resulting in 62% uncategorized patients. CONCLUSION: ALT, AST, and γGT are unsuitable for non-invasive screening or diagnosis of NAFL or NASH. Utilizing liver enzymes as an indicator for NAFLD or NASH should generally be questioned.


Asunto(s)
Enfermedad del Hígado Graso no Alcohólico , Obesidad Mórbida , Adulto , Alanina Transaminasa , Algoritmos , Femenino , Humanos , Hígado/patología , Masculino , Persona de Mediana Edad , Enfermedad del Hígado Graso no Alcohólico/diagnóstico , Obesidad Mórbida/complicaciones , Obesidad Mórbida/cirugía , Estudios Retrospectivos , gamma-Glutamiltransferasa
16.
Bioinformatics ; 36(1): 272-279, 2020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-31225868

RESUMEN

MOTIVATION: Classification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons. RESULTS: We could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences. AVAILABILITY AND IMPLEMENTATION: https://cran.r-project.org/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Redes Neurales de la Computación , Proteínas , Análisis de Secuencia de Proteína , Secuencia de Aminoácidos , Teoría del Juego , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Máquina de Vectores de Soporte
17.
Bioinformatics ; 36(11): 3322-3326, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32129840

RESUMEN

SUMMARY: The development of de novo DNA synthesis, polymerase chain reaction (PCR), DNA sequencing and molecular cloning gave researchers unprecedented control over DNA and DNA-mediated processes. To reduce the error probabilities of these techniques, DNA composition has to adhere to method-dependent restrictions. To comply with such restrictions, a synthetic DNA fragment is often adjusted manually or by using custom-made scripts. In this article, we present MESA (Mosla Error Simulator), a web application for the assessment of DNA fragments based on limitations of DNA synthesis, amplification, cloning, sequencing methods and biological restrictions of host organisms. Furthermore, MESA can be used to simulate errors during synthesis, PCR, storage and sequencing processes. AVAILABILITY AND IMPLEMENTATION: MESA is available at mesa.mosla.de, with the source code available at github.com/umr-ds/mesa_dna_sim. CONTACT: dominik.heider@uni-marburg.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
ADN , Programas Informáticos , ADN/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Reacción en Cadena de la Polimerasa , Análisis de Secuencia de ADN
18.
Mol Ecol ; 30(9): 2131-2144, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33682183

RESUMEN

It is known that microorganisms are essential for the functioning of ecosystems, but the extent to which microorganisms respond to different environmental variables in their natural habitats is not clear. In the current study, we present a methodological framework to quantify the covariation of the microbial community of a habitat and environmental variables of this habitat. It is built on theoretical considerations of systems ecology, makes use of state-of-the-art machine learning techniques and can be used to identify bioindicators. We apply the framework to a data set containing operational taxonomic units (OTUs) as well as more than twenty physicochemical and geographic variables measured in a large-scale survey of European lakes. While a large part of variation (up to 61%) in many environmental variables can be explained by microbial community composition, some variables do not show significant covariation with the microbial lake community. Moreover, we have identified OTUs that act as "multitask" bioindicators, i.e., that are indicative for multiple environmental variables, and thus could be candidates for lake water monitoring schemes. Our results represent, for the first time, a quantification of the covariation of the lake microbiome and a wide array of environmental variables for lake ecosystems. Building on the results and methodology presented here, it will be possible to identify microbial taxa and processes that are essential for functioning and stability of lake ecosystems.


Asunto(s)
Lagos , Microbiota , Ecología , Aprendizaje Automático , Microbiota/genética
19.
BMC Med Inform Decis Mak ; 21(1): 294, 2021 10 26.
Artículo en Inglés | MEDLINE | ID: mdl-34702225

RESUMEN

Machine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.


Asunto(s)
Inteligencia Artificial , Aprendizaje Automático , Humanos
20.
BMC Bioinformatics ; 21(1): 526, 2020 Nov 16.
Artículo en Inglés | MEDLINE | ID: mdl-33198651

RESUMEN

BACKGROUND: Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. RESULTS: We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub ( https://github.com/MW55/Natrix ) or as a Docker container on DockerHub ( https://hub.docker.com/r/mw55/natrix ). CONCLUSION: Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Flujo de Trabajo , Análisis por Conglomerados , ADN Ambiental/genética , ADN Ambiental/aislamiento & purificación , Análisis de Datos , Bases de Datos Genéticas , Inundaciones , Microbiota/genética , Reproducibilidad de los Resultados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA