Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38441258

RESUMO

MOTIVATION: Automatic cell type annotation methods assign cell type labels to new datasets by extracting relationships from a reference RNA-seq dataset. However, due to the limited resolution of gene expression features, there is always uncertainty present in the label assignment. To enhance the reliability and robustness of annotation, most machine learning methods address this uncertainty by providing a full reject option, i.e. when the predicted confidence score of a cell type label falls below a user-defined threshold, no label is assigned and no prediction is made. As a better alternative, some methods deploy hierarchical models and consider a so-called partial rejection by returning internal nodes of the hierarchy as label assignment. However, because a detailed experimental analysis of various rejection approaches is missing in the literature, there is currently no consensus on best practices. RESULTS: We evaluate three annotation approaches (i) full rejection, (ii) partial rejection, and (iii) no rejection for both flat and hierarchical probabilistic classifiers. Our findings indicate that hierarchical classifiers are superior when rejection is applied, with partial rejection being the preferred rejection approach, as it preserves a significant amount of label information. For optimal rejection implementation, the rejection threshold should be determined through careful examination of a method's rejection behavior. Without rejection, flat and hierarchical annotation perform equally well, as long as the cell type hierarchy accurately captures transcriptomic relationships. AVAILABILITY AND IMPLEMENTATION: Code is freely available at https://github.com/Latheuni/Hierarchical_reject and https://doi.org/10.5281/zenodo.10697468.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Reprodutibilidade dos Testes , Incerteza , Aprendizado de Máquina , Análise de Célula Única , Análise de Sequência de RNA
2.
Nat Plants ; 10(3): 390-401, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-38467801

RESUMO

Scientific testing including stable isotope ratio analysis (SIRA) and trace element analysis (TEA) is critical for establishing plant origin, tackling deforestation and enforcing economic sanctions. Yet methods combining SIRA and TEA into robust models for origin verification and determination are lacking. Here we report a (1) large Eastern European timber reference database (Betula, Fagus, Pinus, Quercus) tailored to sanctioned products following the Ukraine invasion; (2) statistical test to verify samples against a claimed origin; (3) probabilistic model of SIRA, TEA and genus distribution data, using Gaussian processes, to determine timber harvest location. Our verification method rejects 40-60% of simulated false claims, depending on the spatial scale of the claim, and maintains a low probability of rejecting correct origin claims. Our determination method predicts harvest location within 180 to 230 km of true location. Our results showcase the power of combining data types with probabilistic modelling to identify and scrutinize timber harvest location claims.


Assuntos
Fagus , Pinus , Ucrânia , Betula , Genes de Plantas
3.
BMC Bioinformatics ; 25(1): 59, 2024 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-38321386

RESUMO

The prediction of interactions between novel drugs and biological targets is a vital step in the early stage of the drug discovery pipeline. Many deep learning approaches have been proposed over the last decade, with a substantial fraction of them sharing the same underlying two-branch architecture. Their distinction is limited to the use of different types of feature representations and branches (multi-layer perceptrons, convolutional neural networks, graph neural networks and transformers). In contrast, the strategy used to combine the outputs (embeddings) of the branches has remained mostly the same. The same general architecture has also been used extensively in the area of recommender systems, where the choice of an aggregation strategy is still an open question. In this work, we investigate the effectiveness of three different embedding aggregation strategies in the area of drug-target interaction (DTI) prediction. We formally define these strategies and prove their universal approximator capabilities. We then present experiments that compare the different strategies on benchmark datasets from the area of DTI prediction, showcasing conditions under which specific strategies could be the obvious choice.


Assuntos
Benchmarking , Descoberta de Drogas , Fontes de Energia Elétrica , Redes Neurais de Computação
4.
Methods Mol Biol ; 2516: 51-59, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35922621

RESUMO

A major goal in synthetic biology is the engineering of synthetic gene circuits with a predictable, controlled and designed outcome. This creates a need for building blocks that can modulate gene expression without interference with the native cell system. A tool allowing forward engineering of promoters with predictable transcription initiation frequency is still lacking. Promoter libraries specific for σ70 to ensure the orthogonality of gene expression were built in Escherichia coli and labeled using fluorescence-activated cell sorting to obtain high-throughput DNA sequencing data to train a convolutional neural network. We were able to confirm in vivo that the model is able to predict the promoter transcription initiation frequency (TIF) of new promoter sequences. Here, we provide an online tool for promoter design (ProD) in E. coli, which can be used to tailor output sequences of desired promoter TIF or predict the TIF of a custom sequence.


Assuntos
Proteínas de Escherichia coli , Escherichia coli , Escherichia coli/genética , Escherichia coli/metabolismo , Proteínas de Escherichia coli/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala , Regiões Promotoras Genéticas , Biologia Sintética
5.
Bioinformatics ; 38(3): 597-603, 2022 01 12.
Artigo em Inglês | MEDLINE | ID: mdl-34718418

RESUMO

MOTIVATION: The adoption of current single-cell DNA methylation sequencing protocols is hindered by incomplete coverage, outlining the need for effective imputation techniques. The task of imputing single-cell (methylation) data requires models to build an understanding of underlying biological processes. RESULTS: We adapt the transformer neural network architecture to operate on methylation matrices through combining axial attention with sliding window self-attention. The obtained CpG Transformer displays state-of-the-art performances on a wide range of scBS-seq and scRRBS-seq datasets. Furthermore, we demonstrate the interpretability of CpG Transformer and illustrate its rapid transfer learning properties, allowing practitioners to train models on new datasets with a limited computational and time budget. AVAILABILITY AND IMPLEMENTATION: CpG Transformer is freely available at https://github.com/gdewael/cpg-transformer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metilação de DNA , Epigenoma , Sequência de Bases , Análise de Sequência de DNA/métodos , Redes Neurais de Computação
6.
Artigo em Inglês | MEDLINE | ID: mdl-33125335

RESUMO

In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.


Assuntos
Escherichia coli , Genômica , Sequência de Bases , Aprendizado de Máquina , Redes Neurais de Computação
7.
Intensive Crit Care Nurs ; 68: 103117, 2022 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-34393009

RESUMO

OBJECTIVE: To determine risk factors for pressure injury in distinct intensive care subpopulations according to admission type (Medical; Surgical elective; Surgery emergency; Trauma/Burns). METHODOLOGY/DESIGN: Predictive modelling using generalised linear mixed models with backward elimination on prospectively gathered data of 13 044 adult intensive care patients. SETTINGS: 1110 intensive care units, 89 countries worldwide. MAIN OUTCOME MEASURES: Pressure injury risk factors. RESULTS: A generalised linear mixed model including admission type outperformed a model without admission type (p = 0.004). Admission type Trauma/Burns was not withheld in the model and excluded from further analyses. For the other three admission types (Medical, Surgical elective, and Surgical emergency), backward elimination resulted in distinct prediction models with 23, 17, and 16 predictors, respectively, and five common predictors only. The Area Under the Receiver Operating Curve was 0.79 for Medical admissions; and 0.88 for both the Surgical elective and Surgical emergency models. CONCLUSIONS: Risk factors for pressure injury differ according to whether intensive care patients have been admitted for medical reasons, or elective or emergency surgery. Prediction models for pressure injury should target distinct subpopulations with differing pressure injury risk profiles. Type of intensive care admission is a simple and easily retrievable parameter to distinguish between such subgroups.


Assuntos
Cuidados Críticos , Unidades de Terapia Intensiva , Lesão por Pressão , Adulto , Humanos , Mortalidade Hospitalar , Hospitalização , Estudos Retrospectivos , Fatores de Risco , Curva ROC
8.
Comput Struct Biotechnol J ; 19: 6157-6168, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34938408

RESUMO

Today machine learning methods are commonly deployed for bacterial species identification using MALDI-TOF mass spectrometry data. However, most of the studies reported in literature only consider very traditional machine learning methods on small datasets that contain a limited number of species. In this paper we present benchmarking results on an unprecedented scale for a wide range of machine learning methods, using datasets that contain almost 100,000 spectra and more than 1000 different species. The size and the diversity of the data allow to compare three important identification scenarios that are often not distinguished in literature, i.e., identification for novel biological replicates, novel strains and novel species that are not present in the training data. The results demonstrate that in all three scenarios acceptable identification rates are obtained, but the numbers are typically lower than those reported in studies with a more limited analysis. Using hierarchical classification methods, we also demonstrate that taxonomic information is in general not well preserved in MALDI-TOF mass spectrometry data. For the novel species scenario, we apply for the first time neural networks with Monte Carlo dropout, which have shown to be successful in other domains, such as computer vision, for the detection of novel species.

9.
Biotechnol Adv ; 53: 107858, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34695560

RESUMO

Machine learning is becoming an integral part of the Design-Build-Test-Learn cycle in biotechnology. Machine learning models learn from collected datasets such as omics data and predict a defined outcome, which has led to both production improvements and predictive tools in the field. Robust prediction of the behavior of microbial cell factories and production processes not only greatly increases our understanding of the function of such systems, but also provides significant savings of development time. However, many pitfalls when modeling biological data - bad fit, noisy data, model instability, low data quantity and imbalances in the data - cause models to suffer in their performance. Here we provide an accessible, in-depth analysis on the problems created by these pitfalls, as well as means of their detection and mediation, with a focus on supervised learning. Assessing the state of the art, we show that, currently, in-depth analyses of model performance are often absent and must be improved. This review provides a toolbox for the analysis of model robustness and performance, and simultaneously proposes a standard for the community to facilitate future work. It is further accompanied by an interactive online tutorial on the discussed issues.


Assuntos
Biotecnologia , Aprendizado de Máquina
10.
mSystems ; 6(5): e0055121, 2021 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-34546074

RESUMO

Microbiome management research and applications rely on temporally resolved measurements of community composition. Current technologies to assess community composition make use of either cultivation or sequencing of genomic material, which can become time-consuming and/or laborious in case high-throughput measurements are required. Here, using data from a shrimp hatchery as an economically relevant case study, we combined 16S rRNA gene amplicon sequencing and flow cytometry data to develop a computational workflow that allows the prediction of taxon abundances based on flow cytometry measurements. The first stage of our pipeline consists of a classifier to predict the presence or absence of the taxon of interest, with yielded an average accuracy of 88.13% ± 4.78% across the top 50 operational taxonomic units (OTUs) of our data set. In the second stage, this classifier was combined with a regression model to predict the relative abundances of the taxon of interest, which yielded an average R2 of 0.35 ± 0.24 across the top 50 OTUs of our data set. Application of the models to flow cytometry time series data showed that the generated models can predict the temporal dynamics of a large fraction of the investigated taxa. Using cell sorting, we validated that the model correctly associates taxa to regions in the cytometric fingerprint, where they are detected using 16S rRNA gene amplicon sequencing. Finally, we applied the approach of our pipeline to two other data sets of microbial ecosystems. This pipeline represents an addition to the expanding toolbox for flow cytometry-based monitoring of bacterial communities and complements the current plating- and marker gene-based methods. IMPORTANCE Monitoring of microbial community composition is crucial for both microbiome management research and applications. Existing technologies, such as plating and amplicon sequencing, can become laborious and expensive when high-throughput measurements are required. In recent years, flow cytometry-based measurements of community diversity have been shown to correlate well with those derived from 16S rRNA gene amplicon sequencing in several aquatic ecosystems, suggesting that there is a link between the taxonomic community composition and phenotypic properties as derived through flow cytometry. Here, we further integrated 16S rRNA gene amplicon sequencing and flow cytometry survey data in order to construct models that enable the prediction of both the presence and the abundances of individual bacterial taxa in mixed communities using flow cytometric fingerprinting. The developed pipeline holds great potential to be integrated into routine monitoring schemes and early warning systems for biotechnological applications.

11.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33834200

RESUMO

The effectiveness of deep learning methods can be largely attributed to the automated extraction of relevant features from raw data. In the field of functional genomics, this generally concerns the automatic selection of relevant nucleotide motifs from DNA sequences. To benefit from automated learning methods, new strategies are required that unveil the decision-making process of trained models. In this paper, we present a new approach that has been successful in gathering insights on the transcription process in Escherichia coli. This work builds upon a transformer-based neural network framework designed for prokaryotic genome annotation purposes. We find that the majority of subunits (attention heads) of the model are specialized towards identifying transcription factors and are able to successfully characterize both their binding sites and consensus sequences, uncovering both well-known and potentially novel elements involved in the initiation of the transcription process. With the specialization of the attention heads occurring automatically, we believe transformer models to be of high interest towards the creation of explainable neural networks in this field.


Assuntos
Aprendizado Profundo , Escherichia coli/genética , Genoma Bacteriano , Genômica/métodos , Sítio de Iniciação de Transcrição , Sequência de Bases , Sítios de Ligação , DNA Bacteriano/genética , DNA Bacteriano/metabolismo , Escherichia coli/metabolismo , Regiões Promotoras Genéticas/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
13.
mSphere ; 6(1)2021 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-33536320

RESUMO

Microbial flow cytometry can rapidly characterize the status of microbial communities. Upon measurement, large amounts of quantitative single-cell data are generated, which need to be analyzed appropriately. Cytometric fingerprinting approaches are often used for this purpose. Traditional approaches either require a manual annotation of regions of interest, do not fully consider the multivariate characteristics of the data, or result in many community-describing variables. To address these shortcomings, we propose an automated model-based fingerprinting approach based on Gaussian mixture models, which we call PhenoGMM. The method successfully quantifies changes in microbial community structure based on flow cytometry data, which can be expressed in terms of cytometric diversity. We evaluate the performance of PhenoGMM using data sets from both synthetic and natural ecosystems and compare the method with a generic binning fingerprinting approach. PhenoGMM supports the rapid and quantitative screening of microbial community structure and dynamics.IMPORTANCE Microorganisms are vital components in various ecosystems on Earth. In order to investigate the microbial diversity, researchers have largely relied on the analysis of 16S rRNA gene sequences from DNA. Flow cytometry has been proposed as an alternative technology to characterize microbial community diversity and dynamics. The technology enables a fast measurement of optical properties of individual cells. So-called fingerprinting techniques are needed in order to describe microbial community diversity and dynamics based on flow cytometry data. In this work, we propose a more advanced fingerprinting strategy based on Gaussian mixture models. We evaluated our workflow on data sets from both synthetic and natural ecosystems, illustrating its general applicability for the analysis of microbial flow cytometry data. PhenoGMM supports a rapid and quantitative analysis of microbial community structure using flow cytometry.


Assuntos
Citometria de Fluxo/métodos , Microbiota , Distribuição Normal , Biodiversidade
14.
ISME J ; 15(1): 354-358, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-32879459

RESUMO

Variations in the gut microbiome have been associated with changes in health state such as Crohn's disease (CD). Most surveys characterize the microbiome through analysis of the 16S rRNA gene. An alternative technology that can be used is flow cytometry. In this report, we reanalyzed a disease cohort that has been characterized by both technologies. Changes in microbial community structure are reflected in both types of data. We demonstrate that cytometric fingerprints can be used as a diagnostic tool in order to classify samples according to CD state. These results highlight the potential of flow cytometry to perform rapid diagnostics of microbiome-associated diseases.


Assuntos
Doença de Crohn , Microbioma Gastrointestinal , Microbiota , Doença de Crohn/diagnóstico , Fezes , Humanos , RNA Ribossômico 16S/genética
15.
Nat Commun ; 11(1): 5822, 2020 11 16.
Artigo em Inglês | MEDLINE | ID: mdl-33199691

RESUMO

To engineer synthetic gene circuits, molecular building blocks are developed which can modulate gene expression without interference, mutually or with the host's cell machinery. As the complexity of gene circuits increases, automated design tools and tailored building blocks to ensure perfect tuning of all components in the network are required. Despite the efforts to develop prediction tools that allow forward engineering of promoter transcription initiation frequency (TIF), such a tool is still lacking. Here, we use promoter libraries of E. coli sigma factor 70 (σ70)- and B. subtilis σB-, σF- and σW-dependent promoters to construct prediction models, capable of both predicting promoter TIF and orthogonality of the σ-specific promoters. This is achieved by training a convolutional neural network with high-throughput DNA sequencing data from fluorescence-activated cell sorted promoter libraries. This model functions as the base of the online promoter design tool (ProD), providing tailored promoters for tailored genetic systems.


Assuntos
Bacillus subtilis/genética , Escherichia coli/genética , Regiões Promotoras Genéticas , Fator sigma/metabolismo , Sequência de Bases , Fluorescência , Biblioteca Gênica , Genótipo , Modelos Genéticos , Reprodutibilidade dos Testes , Iniciação da Transcrição Genética
16.
PLoS One ; 15(6): e0235117, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32584872

RESUMO

Early prediction of in-hospital mortality can improve patient outcome. Current prediction models for in-hospital mortality focus mainly on specific pathologies. Structured pathology data is hospital-wide readily available and is primarily used for e.g. financing purposes. We aim to build a predictive model at admission using the International Classification of Diseases (ICD) codes as predictors and investigate the effect of the self-evident DNR ("Do Not Resuscitate") diagnosis codes and palliative care codes. We compare the models using ICD-10-CM codes with Risk of Mortality (RoM) and Charlson Comorbidity Index (CCI) as predictors using the Random Forests modeling approach. We use the Present on Admission flag to distinguish which diagnoses are present on admission. The study is performed in a single center (Ghent University Hospital) with the inclusion of 36 368 patients, all discharged in 2017. Our model at admission using ICD-10-CM codes (AUCROC = 0.9477) outperforms the model using RoM (AUCROC = 0.8797 and CCI (AUCROC = 0.7435). We confirmed that DNR and palliative care codes have a strong impact on the model resulting in a decrease of 7% for the ICD model (AUCROC = 0.8791) at admission. We therefore conclude that a model with a sufficient predictive performance can be derived from structured pathology data, and if real-time available, can serve as a prerequisite to develop a practical clinical decision support system for physicians.


Assuntos
Bases de Dados Factuais , Mortalidade Hospitalar , Hospitalização , Modelos Biológicos , Cuidados Paliativos , Patologia Clínica , Adulto , Idoso , Idoso de 80 Anos ou mais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Valor Preditivo dos Testes
17.
Anal Chem ; 92(11): 7523-7531, 2020 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-32330016

RESUMO

In diagnostics of infectious diseases, matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (MALDI-TOF MS) can be applied for the identification of pathogenic microorganisms. However, to achieve a trustworthy identification from MALDI-TOF MS data, a significant amount of biomass should be considered. The bacterial load that potentially occurs in a sample is therefore routinely amplified by culturing, which is a time-consuming procedure. In this paper, we show that culturing can be avoided by conducting MALDI-TOF MS on individual bacterial cells. This results in a more rapid identification of species with an acceptable accuracy. We propose a deep learning architecture to analyze the data and compare its performance with traditional supervised machine learning algorithms. We illustrate our workflow on a large data set that contains bacterial species related to urinary tract infections. Overall we obtain accuracies up to 85% in discriminating five different species.


Assuntos
Aprendizado Profundo , Bactérias Gram-Negativas/citologia , Bactérias Gram-Negativas/patogenicidade , Bactérias Gram-Positivas/citologia , Bactérias Gram-Positivas/patogenicidade , Análise de Célula Única , Aerossóis/química , Bactérias Gram-Negativas/isolamento & purificação , Bactérias Gram-Positivas/isolamento & purificação , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz
18.
Cytometry A ; 97(7): 713-726, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-31889414

RESUMO

Investigating phenotypic heterogeneity can help to better understand and manage microbial communities. However, characterizing phenotypic heterogeneity remains a challenge, as there is no standardized analysis framework. Several optical tools are available, such as flow cytometry and Raman spectroscopy, which describe optical properties of the individual cell. In this work, we compare Raman spectroscopy and flow cytometry to study phenotypic heterogeneity in bacterial populations. The growth stages of three replicate Escherichia coli populations were characterized using both technologies. Our findings show that flow cytometry detects and quantifies shifts in phenotypic heterogeneity at the population level due to its high-throughput nature. Raman spectroscopy, on the other hand, offers a much higher resolution at the single-cell level (i.e., more biochemical information is recorded). Therefore, it can identify distinct phenotypic populations when coupled with analyses tailored toward single-cell data. In addition, it provides information about biomolecules that are present, which can be linked to cell functionality. We propose a computational workflow to distinguish between bacterial phenotypic populations using Raman spectroscopy and validated this approach with an external data set. We recommend using flow cytometry to quantify phenotypic heterogeneity at the population level, and Raman spectroscopy to perform a more in-depth analysis of heterogeneity at the single-cell level. © 2019 International Society for Advancement of Cytometry.


Assuntos
Bactérias , Análise Espectral Raman , Escherichia coli/genética , Citometria de Fluxo , Fenótipo , Análise de Célula Única
19.
Brief Bioinform ; 21(1): 262-271, 2020 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-30329015

RESUMO

Supervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Many supervised techniques for network prediction use linear models on a possibly nonlinear pairwise feature representation of edges. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using a model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. This distinction matters because (i) the performance might dramatically differ between the prediction settings and (ii) tuning the model hyperparameters to obtain the best possible model depends on the setting of interest. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings. In this work we discuss a state-of-the-art kernel-based network inference technique called two-step kernel ridge regression. We show that this regression model can be trained efficiently, with a time complexity scaling with the number of vertices rather than the number of edges. Furthermore, this framework leads to a series of cross-validation shortcuts that allow one to rapidly estimate the model performance for any relevant network prediction setting. This allows computational biologists to fully assess the capabilities of their models. The machine learning techniques with the algebraic shortcuts are implemented in the RLScore software package: https://github.com/aatapa/RLScore.

20.
mSystems ; 4(5)2019 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-31506260

RESUMO

High-nucleic-acid (HNA) and low-nucleic-acid (LNA) bacteria are two operational groups identified by flow cytometry (FCM) in aquatic systems. A number of reports have shown that HNA cell density correlates strongly with heterotrophic production, while LNA cell density does not. However, which taxa are specifically associated with these groups, and by extension, productivity has remained elusive. Here, we addressed this knowledge gap by using a machine learning-based variable selection approach that integrated FCM and 16S rRNA gene sequencing data collected from 14 freshwater lakes spanning a broad range in physicochemical conditions. There was a strong association between bacterial heterotrophic production and HNA absolute cell abundances (R 2 = 0.65), but not with the more abundant LNA cells. This solidifies findings, mainly from marine systems, that HNA and LNA bacteria could be considered separate functional groups, the former contributing a disproportionately large share of carbon cycling. Taxa selected by the models could predict HNA and LNA absolute cell abundances at all taxonomic levels. Selected operational taxonomic units (OTUs) ranged from low to high relative abundance and were mostly lake system specific (89.5% to 99.2%). A subset of selected OTUs was associated with both LNA and HNA groups (12.5% to 33.3%), suggesting either phenotypic plasticity or within-OTU genetic and physiological heterogeneity. These findings may lead to the identification of system-specific putative ecological indicators for heterotrophic productivity. Generally, our approach allows for the association of OTUs with specific functional groups in diverse ecosystems in order to improve our understanding of (microbial) biodiversity-ecosystem functioning relationships.IMPORTANCE A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...