Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 5.537
Filter
Add more filters

Publication year range
1.
Proc Natl Acad Sci U S A ; 121(17): e2318362121, 2024 Apr 23.
Article in English | MEDLINE | ID: mdl-38630718

ABSTRACT

Design of hardware based on biological principles of neuronal computation and plasticity in the brain is a leading approach to realizing energy- and sample-efficient AI and learning machines. An important factor in selection of the hardware building blocks is the identification of candidate materials with physical properties suitable to emulate the large dynamic ranges and varied timescales of neuronal signaling. Previous work has shown that the all-or-none spiking behavior of neurons can be mimicked by threshold switches utilizing material phase transitions. Here, we demonstrate that devices based on a prototypical metal-insulator-transition material, vanadium dioxide (VO2), can be dynamically controlled to access a continuum of intermediate resistance states. Furthermore, the timescale of their intrinsic relaxation can be configured to match a range of biologically relevant timescales from milliseconds to seconds. We exploit these device properties to emulate three aspects of neuronal analog computation: fast (~1 ms) spiking in a neuronal soma compartment, slow (~100 ms) spiking in a dendritic compartment, and ultraslow (~1 s) biochemical signaling involved in temporal credit assignment for a recently discovered biological mechanism of one-shot learning. Simulations show that an artificial neural network using properties of VO2 devices to control an agent navigating a spatial environment can learn an efficient path to a reward in up to fourfold fewer trials than standard methods. The phase relaxations described in our study may be engineered in a variety of materials and can be controlled by thermal, electrical, or optical stimuli, suggesting further opportunities to emulate biological learning in neuromorphic hardware.


Subject(s)
Learning , Neural Networks, Computer , Computers , Brain/physiology , Neurons/physiology
2.
Proc Natl Acad Sci U S A ; 121(8): e2314228121, 2024 Feb 20.
Article in English | MEDLINE | ID: mdl-38363866

ABSTRACT

In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as the presence or absence of a variable or an edge. Consequently, false-positive error or false-negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false-positive and false-negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false-positive and false-negative errors. We describe model selection procedures that provide false-positive error control in our general setting, and we illustrate their utility with numerical experiments.

3.
Proc Natl Acad Sci U S A ; 121(34): e2402267121, 2024 Aug 20.
Article in English | MEDLINE | ID: mdl-39136986

ABSTRACT

Despite ethical and historical arguments for removing race from clinical algorithms, the consequences of removal remain unclear. Here, we highlight a largely undiscussed consideration in this debate: varying data quality of input features across race groups. For example, family history of cancer is an essential predictor in cancer risk prediction algorithms but is less reliably documented for Black participants and may therefore be less predictive of cancer outcomes. Using data from the Southern Community Cohort Study, we assessed whether race adjustments could allow risk prediction models to capture varying data quality by race, focusing on colorectal cancer risk prediction. We analyzed 77,836 adults with no history of colorectal cancer at baseline. The predictive value of self-reported family history was greater for White participants than for Black participants. We compared two cancer risk prediction algorithms-a race-blind algorithm which included standard colorectal cancer risk factors but not race, and a race-adjusted algorithm which additionally included race. Relative to the race-blind algorithm, the race-adjusted algorithm improved predictive performance, as measured by goodness of fit in a likelihood ratio test (P-value: <0.001) and area under the receiving operating characteristic curve among Black participants (P-value: 0.006). Because the race-blind algorithm underpredicted risk for Black participants, the race-adjusted algorithm increased the fraction of Black participants among the predicted high-risk group, potentially increasing access to screening. More broadly, this study shows that race adjustments may be beneficial when the data quality of key predictors in clinical algorithms differs by race group.


Subject(s)
Algorithms , Colorectal Neoplasms , Humans , Colorectal Neoplasms/diagnosis , Colorectal Neoplasms/ethnology , Colorectal Neoplasms/epidemiology , Male , Female , Middle Aged , Data Accuracy , White People/statistics & numerical data , Black or African American/statistics & numerical data , Risk Factors , Aged , Adult , Cohort Studies , Racial Groups/statistics & numerical data , Risk Assessment/methods
4.
Proc Natl Acad Sci U S A ; 121(23): e2317772121, 2024 Jun 04.
Article in English | MEDLINE | ID: mdl-38820000

ABSTRACT

Stopping power is the rate at which a material absorbs the kinetic energy of a charged particle passing through it-one of many properties needed over a wide range of thermodynamic conditions in modeling inertial fusion implosions. First-principles stopping calculations are classically challenging because they involve the dynamics of large electronic systems far from equilibrium, with accuracies that are particularly difficult to constrain and assess in the warm-dense conditions preceding ignition. Here, we describe a protocol for using a fault-tolerant quantum computer to calculate stopping power from a first-quantized representation of the electrons and projectile. Our approach builds upon the electronic structure block encodings of Su et al. [PRX Quant. 2, 040332 (2021)], adapting and optimizing those algorithms to estimate observables of interest from the non-Born-Oppenheimer dynamics of multiple particle species at finite temperature. We also work out the constant factors associated with an implementation of a high-order Trotter approach to simulating a grid representation of these systems. Ultimately, we report logical qubit requirements and leading-order Toffoli costs for computing the stopping power of various projectile/target combinations relevant to interpreting and designing inertial fusion experiments. We estimate that scientifically interesting and classically intractable stopping power calculations can be quantum simulated with roughly the same number of logical qubits and about one hundred times more Toffoli gates than is required for state-of-the-art quantum simulations of industrially relevant molecules such as FeMoco or P450.

5.
Proc Natl Acad Sci U S A ; 121(14): e2316616121, 2024 Apr 02.
Article in English | MEDLINE | ID: mdl-38551839

ABSTRACT

Motivated by the implementation of a SARS-Cov-2 sewer surveillance system in Chile during the COVID-19 pandemic, we propose a set of mathematical and algorithmic tools that aim to identify the location of an outbreak under uncertainty in the network structure. Given an upper bound on the number of samples we can take on any given day, our framework allows us to detect an unknown infected node by adaptively sampling different network nodes on different days. Crucially, despite the uncertainty of the network, the method allows univocal detection of the infected node, albeit at an extra cost in time. This framework relies on a specific and well-chosen strategy that defines new nodes to test sequentially, with a heuristic that balances the granularity of the information obtained from the samples. We extensively tested our model in real and synthetic networks, showing that the uncertainty of the underlying graph only incurs a limited increase in the number of iterations, indicating that the methodology is applicable in practice.


Subject(s)
COVID-19 , Pandemics , Humans , Uncertainty , COVID-19/epidemiology , Disease Outbreaks , SARS-CoV-2
6.
Proc Natl Acad Sci U S A ; 121(1): e2313269120, 2024 Jan 02.
Article in English | MEDLINE | ID: mdl-38147549

ABSTRACT

Quantum computers have been proposed to solve a number of important problems such as discovering new drugs, new catalysts for fertilizer production, breaking encryption protocols, optimizing financial portfolios, or implementing new artificial intelligence applications. Yet, to date, a simple task such as multiplying 3 by 5 is beyond existing quantum hardware. This article examines the difficulties that would need to be solved for quantum computers to live up to their promises. I discuss the whole stack of technologies that has been envisioned to build a quantum computer from the top layers (the actual algorithms and associated applications) down to the very bottom ones (the quantum hardware, its control electronics, cryogeny, etc.) while not forgetting the crucial intermediate layer of quantum error correction.

7.
Brief Bioinform ; 25(5)2024 Jul 25.
Article in English | MEDLINE | ID: mdl-39302340

ABSTRACT

The Hardy-Weinberg equilibrium (HWE) assumption is essential to many population genetics models. Multiple tests were developed to test its applicability in observed genotypes. Current methods are divided into exact tests applicable to small populations and a small number of alleles, and approximate goodness-of-fit tests. Existing tests cannot handle ambiguous typing in multi-allelic loci. We here present a novel exact test Unambiguous Multi Allelic Test (UMAT) not limited to the number of alleles and population size, based on a perturbative approach around the current observations. We show its accuracy in the detection of deviation from HWE. We then propose an additional model to handle ambiguous typing using either sampling into UMAT or a goodness-of-fit test test with a variance estimate taking ambiguity into account, named Asymptotic Statistical Test with Ambiguity (ASTA). We show the accuracy of ASTA and the possibility of detecting the source of deviation from HWE. We apply these tests to the HLA loci to reproduce multiple previously reported deviations from HWE, and a large number of new ones.


Subject(s)
Genetics, Population , Humans , Polymorphism, Genetic , Models, Genetic , Alleles , Gene Frequency , Genotype , Genetic Loci
8.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-39007596

ABSTRACT

Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.


Subject(s)
Algorithms , Computational Biology , Cluster Analysis , Computational Biology/methods , Gene Expression Profiling/methods , Gene Expression Profiling/statistics & numerical data , Humans
9.
Brief Bioinform ; 25(5)2024 Jul 25.
Article in English | MEDLINE | ID: mdl-39129360

ABSTRACT

The genetic blueprint for the essential functions of life is encoded in DNA, which is translated into proteins-the engines driving most of our metabolic processes. Recent advancements in genome sequencing have unveiled a vast diversity of protein families, but compared with the massive search space of all possible amino acid sequences, the set of known functional families is minimal. One could say nature has a limited protein "vocabulary." A major question for computational biologists, therefore, is whether this vocabulary can be expanded to include useful proteins that went extinct long ago or have never evolved (yet). By merging evolutionary algorithms, machine learning, and bioinformatics, we can develop highly customized "designer proteins." We dub the new subfield of computational evolution, which employs evolutionary algorithms with DNA string representations, biologically accurate molecular evolution, and bioinformatics-informed fitness functions, Evolutionary Algorithms Simulating Molecular Evolution.


Subject(s)
Algorithms , Computational Biology , Evolution, Molecular , Computational Biology/methods , Proteins/genetics , Proteins/chemistry , Proteins/metabolism , Computer Simulation
10.
Proc Natl Acad Sci U S A ; 120(6): e2207959120, 2023 02 07.
Article in English | MEDLINE | ID: mdl-36716366

ABSTRACT

Colonies of the arboreal turtle ant create networks of trails that link nests and food sources on the graph formed by branches and vines in the canopy of the tropical forest. Ants put down a volatile pheromone on the edges as they traverse them. At each vertex, the next edge to traverse is chosen using a decision rule based on the current pheromone level. There is a bidirectional flow of ants around the network. In a previous field study, it was observed that the trail networks approximately minimize the number of vertices, thus solving a variant of the popular shortest path problem without any central control and with minimal computational resources. We propose a biologically plausible model, based on a variant of the reinforced random walk on a graph, which explains this observation and suggests surprising algorithms for the shortest path problem and its variants. Through simulations and analysis, we show that when the rate of flow of ants does not change, the dynamics converges to the path with the minimum number of vertices, as observed in the field. The dynamics converges to the shortest path when the rate of flow increases with time, so the colony can solve the shortest path problem merely by increasing the flow rate. We also show that to guarantee convergence to the shortest path, bidirectional flow and a decision rule dividing the flow in proportion to the pheromone level are necessary, but convergence to approximately short paths is possible with other decision rules.


Subject(s)
Ants , Animals , Trees , Algorithms , Pheromones , Forests
11.
Proc Natl Acad Sci U S A ; 120(41): e2301842120, 2023 10 10.
Article in English | MEDLINE | ID: mdl-37782786

ABSTRACT

One of the most troubling trends in criminal investigations is the growing use of "black box" technology, in which law enforcement rely on artificial intelligence (AI) models or algorithms that are either too complex for people to understand or they simply conceal how it functions. In criminal cases, black box systems have proliferated in forensic areas such as DNA mixture interpretation, facial recognition, and recidivism risk assessments. The champions and critics of AI argue, mistakenly, that we face a catch 22: While black box AI is not understandable by people, they assume that it produces more accurate forensic evidence. In this Article, we question this assertion, which has so powerfully affected judges, policymakers, and academics. We describe a mature body of computer science research showing how "glass box" AI-designed to be interpretable-can be more accurate than black box alternatives. Indeed, black box AI performs predictably worse in settings like the criminal system. Debunking the black box performance myth has implications for forensic evidence, constitutional criminal procedure rights, and legislative policy. Absent some compelling-or even credible-government interest in keeping AI as a black box, and given the constitutional rights and public safety interests at stake, we argue that a substantial burden rests on the government to justify black box AI in criminal cases. We conclude by calling for judicial rulings and legislation to safeguard a right to interpretable forensic AI.


Subject(s)
Artificial Intelligence , Criminals , Humans , Forensic Medicine , Law Enforcement , Algorithms
12.
Proc Natl Acad Sci U S A ; 120(50): e2213020120, 2023 Dec 12.
Article in English | MEDLINE | ID: mdl-38051772

ABSTRACT

Algorithms of social media platforms are often criticized for recommending ideologically congenial and radical content to their users. Despite these concerns, evidence on such filter bubbles and rabbit holes of radicalization is inconclusive. We conduct an audit of the platform using 100,000 sock puppets that allow us to systematically and at scale isolate the influence of the algorithm in recommendations. We test 1) whether recommended videos are congenial with regard to users' ideology, especially deeper in the watch trail and whether 2) recommendations deeper in the trail become progressively more extreme and come from problematic channels. We find that YouTube's algorithm recommends congenial content to its partisan users, although some moderate and cross-cutting exposure is possible and that congenial recommendations increase deeper in the trail for right-leaning users. We do not find meaningful increases in ideological extremity of recommendations deeper in the trail, yet we show that a growing proportion of recommendations comes from channels categorized as problematic (e.g., "IDW," "Alt-right," "Conspiracy," and "QAnon"), with this increase being most pronounced among the very-right users. Although the proportion of these problematic recommendations is low (max of 2.5%), they are still encountered by over 36.1% of users and up to 40% in the case of very-right users.

13.
Proc Natl Acad Sci U S A ; 120(46): e2314092120, 2023 Nov 14.
Article in English | MEDLINE | ID: mdl-37931095

ABSTRACT

Recently, graph neural network (GNN)-based algorithms were proposed to solve a variety of combinatorial optimization problems [M. J. Schuetz, J. K. Brubaker, H. G. Katzgraber, Nat. Mach. Intell.4, 367-377 (2022)]. GNN was tested in particular on randomly generated instances of these problems. The publication [M. J. Schuetz, J. K. Brubaker, H. G. Katzgraber, Nat. Mach. Intell.4, 367-377 (2022)] stirred a debate whether the GNN-based method was adequately benchmarked against best prior methods. In particular, critical commentaries [M. C. Angelini, F. Ricci-Tersenghi, Nat. Mach. Intell.5, 29-31 (2023)] and [S. Boettcher, Nat. Mach. Intell.5, 24-25 (2023)] point out that a simple greedy algorithm performs better than the GNN. We do not intend to discuss the merits of arguments and counterarguments in these papers. Rather, in this note, we establish a fundamental limitation for running GNN on random instances considered in these references, for a broad range of choices of GNN architecture. Specifically, these barriers hold when the depth of GNN does not scale with graph size (we note that depth 2 was used in experiments in [M. J. Schuetz, J. K. Brubaker, H. G. Katzgraber, Nat. Mach. Intell.4, 367-377 (2022)]), and importantly, these barriers hold regardless of any other parameters of GNN architecture. These limitations arise from the presence of the overlap gap property (OGP) phase transition, which is a barrier for many algorithms, including importantly local algorithms, of which GNN is an example. At the same time, some algorithms known prior to the introduction of GNN provide best results for these problems up to the OGP phase transition. This leaves very little space for GNN to outperform the known algorithms, and based on this, we side with the conclusions made in [M. C. Angelini, F. Ricci-Tersenghi, Nat. Mach. Intell.5, 29-31 (2023)] and [S. Boettcher, Nat. Mach. Intell.5, 24-25 (2023)].

14.
Proc Natl Acad Sci U S A ; 120(30): e2219925120, 2023 07 25.
Article in English | MEDLINE | ID: mdl-37459509

ABSTRACT

Infertility is a heterogeneous condition, with genetic causes thought to underlie a substantial fraction of cases. Genome sequencing is becoming increasingly important for genetic diagnosis of diseases including idiopathic infertility; however, most rare or minor alleles identified in patients are variants of uncertain significance (VUS). Interpreting the functional impacts of VUS is challenging but profoundly important for clinical management and genetic counseling. To determine the consequences of these variants in key fertility genes, we functionally evaluated 11 missense variants in the genes ANKRD31, BRDT, DMC1, EXO1, FKBP6, MCM9, M1AP, MEI1, MSH4 and SEPT12 by generating genome-edited mouse models. Nine variants were classified as deleterious by most functional prediction algorithms, and two disrupted a protein-protein interaction (PPI) in the yeast two hybrid (Y2H) assay. Though these genes are essential for normal meiosis or spermiogenesis in mice, only one variant, observed in the MCM9 gene of a male infertility patient, compromised fertility or gametogenesis in the mouse models. To explore the disconnect between predictions and outcomes, we compared pathogenicity calls of missense variants made by ten widely used algorithms to 1) those annotated in ClinVar and 2) those evaluated in mice. All the algorithms performed poorly in terms of predicting the effects of human missense variants modeled in mice. These studies emphasize caution in the genetic diagnoses of infertile patients based primarily on pathogenicity prediction algorithms and emphasize the need for alternative and efficient in vitro or in vivo functional validation models for more effective and accurate VUS description to either pathogenic or benign categories.


Subject(s)
Infertility, Male , Mutation, Missense , Humans , Male , Mice , Animals , Reproduction , Alleles , Infertility, Male/genetics , Disease Models, Animal , Septins/genetics
15.
Proc Natl Acad Sci U S A ; 120(49): e2311014120, 2023 Dec 05.
Article in English | MEDLINE | ID: mdl-38039273

ABSTRACT

For quantum computing (QC) to emerge as a practically indispensable computational tool, there is a need for quantum protocols with end-to-end practical applications-in this instance, fluid dynamics. We debut here a high-performance quantum simulator which we term QFlowS (Quantum Flow Simulator), designed for fluid flow simulations using QC. Solving nonlinear flows by QC generally proceeds by solving an equivalent infinite dimensional linear system as a result of linear embedding. Thus, we first choose to simulate two well-known flows using QFlowS and demonstrate a previously unseen, full gate-level implementation of a hybrid and high precision Quantum Linear Systems Algorithms (QLSA) for simulating such flows at low Reynolds numbers. The utility of this simulator is demonstrated by extracting error estimates and power law scaling that relates [Formula: see text] (a parameter crucial to Hamiltonian simulations) to the condition number [Formula: see text] of the simulation matrix and allows the prediction of an optimal scaling parameter for accurate eigenvalue estimation. Further, we include two speedup preserving algorithms for a) the functional form or sparse quantum state preparation and b) in situ quantum postprocessing tool for computing nonlinear functions of the velocity field. We choose the viscous dissipation rate as an example, for which the end-to-end complexity is shown to be [Formula: see text], where [Formula: see text] is the size of the linear system of equations, [Formula: see text] is the solution error, and [Formula: see text] is the error in postprocessing. This work suggests a path toward quantum simulation of fluid flows and highlights the special considerations needed at the gate-level implementation of QC.

16.
Mol Microbiol ; 122(3): 294-303, 2024 09.
Article in English | MEDLINE | ID: mdl-38372207

ABSTRACT

Microorganisms play a central role in biotechnology and it is key that we develop strategies to engineer and optimize their functionality. To this end, most efforts have focused on introducing genetic manipulations in microorganisms which are then grown either in monoculture or in mixed-species consortia. An alternative strategy to optimize microbial processes is to rationally engineer the environment in which microbes grow. The microbial environment is multidimensional, including factors such as temperature, pH, salinity, nutrient composition, etc. These environmental factors all influence the growth and phenotypes of microorganisms and they generally "interact" with one another, combining their effects in complex, non-additive ways. In this piece, we overview the origins and consequences of these "interactions" between environmental factors and discuss how they have been built into statistical, bottom-up predictive models of microbial function to identify optimal environmental conditions for monocultures and microbial consortia. We also overview alternative "top-down" approaches, such as genetic algorithms, to finding optimal combinations of environmental factors. By providing a brief summary of the state of this field, we hope to stimulate further work on the rational manipulation and optimization of the microbial environment.


Subject(s)
Temperature , Microbial Consortia/physiology , Hydrogen-Ion Concentration , Biotechnology/methods , Bacteria/genetics , Bacteria/metabolism , Environment , Salinity
17.
Annu Rev Med ; 74: 385-400, 2023 01 27.
Article in English | MEDLINE | ID: mdl-36706748

ABSTRACT

In 2020, the nephrology community formally interrogated long-standing race-based clinical algorithms used in the field, including the kidney function estimation equations. A comprehensive understanding of the history of kidney function estimation and racial essentialism is necessary to understand underpinnings of the incorporation of a Black race coefficient into prior equations. We provide a review of this history, as well as the considerations used to develop race-free equations that are a guidepost for a more equity-oriented, scientifically rigorous future for kidney function estimation and other clinical algorithms and processes in which race may be embedded as a variable.


Subject(s)
Kidney , Racial Groups , Humans , Kidney/physiology , Black People
18.
Annu Rev Med ; 74: 401-412, 2023 01 27.
Article in English | MEDLINE | ID: mdl-35901314

ABSTRACT

Understanding how biases originate in medical technologies and developing safeguards to identify, mitigate, and remove their harms are essential to ensuring equal performance in all individuals. Drawing upon examples from pulmonary medicine, this article describes how bias can be introduced in the physical aspects of the technology design, via unrepresentative data, or by conflation of biological with social determinants of health. It then can be perpetuated by inadequate evaluation and regulatory standards. Research demonstrates that pulse oximeters perform differently depending on patient race and ethnicity. Pulmonary function testing and algorithms used to predict healthcare needs are two additional examples of medical technologies with racial and ethnic biases that may perpetuate health disparities.


Subject(s)
Ethnicity , Healthcare Disparities , Humans , Bias
19.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38189539

ABSTRACT

Sequence motif discovery algorithms enhance the identification of novel deoxyribonucleic acid sequences with pivotal biological significance, especially transcription factor (TF)-binding motifs. The advent of assay for transposase-accessible chromatin using sequencing (ATAC-seq) has broadened the toolkit for motif characterization. Nonetheless, prevailing computational approaches have focused on delineating TF-binding footprints, with motif discovery receiving less attention. Herein, we present Cis rEgulatory Motif Influence using de Bruijn Graph (CEMIG), an algorithm leveraging de Bruijn and Hamming distance graph paradigms to predict and map motif sites. Assessment on 129 ATAC-seq datasets from the Cistrome Data Browser demonstrates CEMIG's exceptional performance, surpassing three established methodologies on four evaluative metrics. CEMIG accurately identifies both cell-type-specific and common TF motifs within GM12878 and K562 cell lines, demonstrating its comparative genomic capabilities in the identification of evolutionary conservation and cell-type specificity. In-depth transcriptional and functional genomic studies have validated the functional relevance of CEMIG-identified motifs across various cell types. CEMIG is available at https://github.com/OSU-BMBL/CEMIG, developed in C++ to ensure cross-platform compatibility with Linux, macOS and Windows operating systems.


Subject(s)
Algorithms , Chromatin Immunoprecipitation Sequencing , Benchmarking , Biological Evolution , Cell Line
20.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36585784

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) clustering and labelling methods are used to determine precise cellular composition of tissue samples. Automated labelling methods rely on either unsupervised, cluster-based approaches or supervised, cell-based approaches to identify cell types. The high complexity of cancer poses a unique challenge, as tumor microenvironments are often composed of diverse cell subpopulations with unique functional effects that may lead to disease progression, metastasis and treatment resistance. Here, we assess 17 cell-based and 9 cluster-based scRNA-seq labelling algorithms using 8 cancer datasets, providing a comprehensive large-scale assessment of such methods in a cancer-specific context. Using several performance metrics, we show that cell-based methods generally achieved higher performance and were faster compared to cluster-based methods. Cluster-based methods more successfully labelled non-malignant cell types, likely because of a lack of gene signatures for relevant malignant cell subpopulations. Larger cell numbers present in some cell types in training data positively impacted prediction scores for cell-based methods. Finally, we examined which methods performed favorably when trained and tested on separate patient cohorts in scenarios similar to clinical applications, and which were able to accurately label particularly small or under-represented cell populations in the given datasets. We conclude that scPred and SVM show the best overall performances with cancer-specific data and provide further suggestions for algorithm selection. Our analysis pipeline for assessing the performance of cell type labelling algorithms is available in https://github.com/shooshtarilab/scRNAseq-Automated-Cell-Type-Labelling.


Subject(s)
Neoplasms , Single-Cell Gene Expression Analysis , Humans , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Algorithms , Neoplasms/genetics , Cluster Analysis , Gene Expression Profiling/methods , Tumor Microenvironment
SELECTION OF CITATIONS
SEARCH DETAIL