Búsqueda | BVS Bolivia

1.

Secure discovery of genetic relatives across large-scale and distributed genomic datasets.

Hong, Matthew Man-Hou; Froelicher, David; Magner, Ricky; Popic, Victoria; Berger, Bonnie; Cho, Hyunghoon.

Genome Res ; 2024 Aug 07.

Artículo en Inglés | MEDLINE | ID: mdl-39111815

RESUMEN

Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.

2.

Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets.

Hong, Matthew M; Froelicher, David; Magner, Ricky; Popic, Victoria; Berger, Bonnie; Cho, Hyunghoon.

Res Comput Mol Biol ; 14758: 308-313, 2024.

Artículo en Inglés | MEDLINE | ID: mdl-39027313

RESUMEN

Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the significant burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 94.9% of third-degree relatives, and 99.9% of second-degree or closer relatives, within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.

3.

Democratizing protein language models with parameter-efficient fine-tuning.

Sledzieski, Samuel; Kshirsagar, Meghana; Baek, Minkyung; Dodhia, Rahul; Lavista Ferres, Juan; Berger, Bonnie.

Proc Natl Acad Sci U S A ; 121(26): e2405840121, 2024 Jun 25.

Artículo en Inglés | MEDLINE | ID: mdl-38900798

RESUMEN

Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size of models, where these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter-efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform state-of-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open-source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint to democratize the power of PLM adaptation to groups with limited computational resources.

Asunto(s)

Proteómica , Proteómica/métodos , Proteínas/química , Proteínas/metabolismo , Procesamiento de Lenguaje Natural , Mapeo de Interacción de Proteínas/métodos , Biología Computacional/métodos , Humanos , Algoritmos

4.

Scanorama: integrating large and diverse single-cell transcriptomic datasets.

Hie, Brian L; Kim, Soochi; Rando, Thomas A; Bryson, Bryan; Berger, Bonnie.

Nat Protoc ; 19(8): 2283-2297, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-38844552

RESUMEN

Merging diverse single-cell RNA sequencing (scRNA-seq) data from numerous experiments, laboratories and technologies can uncover important biological insights. Nonetheless, integrating scRNA-seq data encounters special challenges when the datasets are composed of diverse cell type compositions. Scanorama offers a robust solution for improving the quality and interpretation of heterogeneous scRNA-seq data by effectively merging information from diverse sources. Scanorama is designed to address the technical variation introduced by differences in sample preparation, sequencing depth and experimental batches that can confound the analysis of multiple scRNA-seq datasets. Here we provide a detailed protocol for using Scanorama within a Scanpy-based single-cell analysis workflow coupled with Google Colaboratory, a cloud-based free Jupyter notebook environment service. The protocol involves Scanorama integration, a process that typically spans 0.5-3 h. Scanorama integration requires a basic understanding of cellular biology, transcriptomic technologies and bioinformatics. Our protocol and new Scanorama-Colaboratory resource should make scRNA-seq integration more widely accessible to researchers.

Asunto(s)

Análisis de la Célula Individual , Transcriptoma , Análisis de la Célula Individual/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Humanos , RNA-Seq/métodos

5.

Dirichlet Flow Matching with Applications to DNA Sequence Design.

Stark, Hannes; Jing, Bowen; Wang, Chenyu; Corso, Gabriele; Berger, Bonnie; Barzilay, Regina; Jaakkola, Tommi.

ArXiv ; 2024 May 30.

Artículo en Inglés | MEDLINE | ID: mdl-38855543

RESUMEN

Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that na\"ive linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures' scores and the flow's vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in $O(L)$ speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets. Code is available at https://github.com/HannesStark/dirichlet-flow-matching.

6.

How has the AI boom impacted algorithmic biology?

Singh, Mona; Sahinalp, Cenk; Zeng, Jianyang; Li, Wei Vivian; Kingsford, Carl; Zhang, Qiangfeng; Przytycka, Teresa; Welch, Joshua; Ma, Jian; Berger, Bonnie.

Cell Syst ; 15(6): 483-487, 2024 Jun 19.

Artículo en Inglés | MEDLINE | ID: mdl-38901402

RESUMEN

This Voices piece will highlight the impact of artificial intelligence on algorithm development among computational biologists. How has worldwide focus on AI changed the path of research in computational biology? What is the impact on the algorithmic biology research community?

Asunto(s)

Algoritmos , Inteligencia Artificial , Biología Computacional , Inteligencia Artificial/tendencias , Biología Computacional/métodos , Humanos

7.

Causal gene regulatory analysis with RNA velocity reveals an interplay between slow and fast transcription factors.

Singh, Rohit; Wu, Alexander P; Mudide, Anish; Berger, Bonnie.

Cell Syst ; 15(5): 462-474.e5, 2024 May 15.

Artículo en Inglés | MEDLINE | ID: mdl-38754366

RESUMEN

Single-cell expression dynamics, from differentiation trajectories or RNA velocity, have the potential to reveal causal links between transcription factors (TFs) and their target genes in gene regulatory networks (GRNs). However, existing methods either overlook these expression dynamics or necessitate that cells be ordered along a linear pseudotemporal axis, which is incompatible with branching trajectories. We introduce Velorama, an approach to causal GRN inference that represents single-cell differentiation dynamics as a directed acyclic graph of cells, constructed from pseudotime or RNA velocity measurements. Additionally, Velorama enables the estimation of the speed at which TFs influence target genes. Applying Velorama, we uncover evidence that the speed of a TF's interactions is tied to its regulatory function. For human corticogenesis, we find that slow TFs are linked to gliomas, while fast TFs are associated with neuropsychiatric diseases. We expect Velorama to become a critical part of the RNA velocity toolkit for investigating the causal drivers of differentiation and disease.

Asunto(s)

Diferenciación Celular , Redes Reguladoras de Genes , ARN , Factores de Transcripción , Humanos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Redes Reguladoras de Genes/genética , Diferenciación Celular/genética , ARN/genética , ARN/metabolismo , Análisis de la Célula Individual/métodos , Regulación de la Expresión Génica/genética

8.

Rapid and accurate prediction of protein homo-oligomer symmetry with Seq2Symm.

Kshirsagar, Meghana; Meller, Artur; Humphreys, Ian; Sledzieski, Samuel; Xu, Yixi; Dodhia, Rahul; Horvitz, Eric; Berger, Bonnie; Bowman, Gregory; Ferres, Juan Lavista; Baker, David; Baek, Minkyung.

Res Sq ; 2024 Apr 26.

Artículo en Inglés | MEDLINE | ID: mdl-38746169

RESUMEN

The majority of proteins must form higher-order assemblies to perform their biological functions. Despite the importance of protein quaternary structure, there are few machine learning models that can accurately and rapidly predict the symmetry of assemblies involving multiple copies of the same protein chain. Here, we address this gap by training several classes of protein foundation models, including ESM-MSA, ESM2, and RoseTTAFold2, to predict homo-oligomer symmetry. Our best model named Seq2Symm, which utilizes ESM2, outperforms existing template-based and deep learning methods. It achieves an average PR-AUC of 0.48 and 0.44 across homo-oligomer symmetries on two different held-out test sets compared to 0.32 and 0.23 for the template-based method. Because Seq2Symm can rapidly predict homo-oligomer symmetries using a single sequence as input (~ 80,000 proteins/hour), we have applied it to 5 entire proteomes and ~ 3.5 million unlabeled protein sequences to identify patterns in protein assembly complexity across biological kingdoms and species.

9.

Lipophorin receptors genetically modulate neurodegeneration caused by reduction of Psn expression in the aging Drosophila brain.

Kang, Jongkyun; Zhang, Chen; Wang, Yuhao; Peng, Jian; Berger, Bonnie; Perrimon, Norbert; Shen, Jie.

Genetics ; 226(1)2024 Jan 03.

Artículo en Inglés | MEDLINE | ID: mdl-37996068

RESUMEN

Mutations in the Presenilin (PSEN) genes are the most common cause of early-onset familial Alzheimer's disease (FAD). Studies in cell culture, in vitro biochemical systems, and knockin mice showed that PSEN mutations are loss-of-function mutations, impairing Î³-secretase activity. Mouse genetic analysis highlighted the importance of Presenilin (PS) in learning and memory, synaptic plasticity and neurotransmitter release, and neuronal survival, and Drosophila studies further demonstrated an evolutionarily conserved role of PS in neuronal survival during aging. However, molecular pathways that interact with PS in neuronal survival remain unclear. To identify genetic modifiers that modulate PS-dependent neuronal survival, we developed a new DrosophilaPsn model that exhibits age-dependent neurodegeneration and increases of apoptosis. Following a bioinformatic analysis, we tested top ranked candidate genes by selective knockdown (KD) of each gene in neurons using two independent RNAi lines in Psn KD models. Interestingly, 4 of the 9 genes enhancing neurodegeneration in Psn KD flies are involved in lipid transport and metabolism. Specifically, neuron-specific KD of lipophorin receptors, lpr1 and lpr2, dramatically worsens neurodegeneration in Psn KD flies, and overexpression of lpr1 or lpr2 does not alleviate Psn KD-induced neurodegeneration. Furthermore, lpr1 or lpr2 KD alone also leads to neurodegeneration, increased apoptosis, climbing defects, and shortened lifespan. Lastly, heterozygotic deletions of lpr1 and lpr2 or homozygotic deletions of lpr1 or lpr2 similarly lead to age-dependent neurodegeneration and further exacerbate neurodegeneration in Psn KD flies. These findings show that LpRs modulate Psn-dependent neuronal survival and are critically important for neuronal integrity in the aging brain.

Asunto(s)

Enfermedad de Alzheimer , Drosophila , Animales , Ratones , Drosophila/genética , Drosophila/metabolismo , Presenilinas/genética , Presenilinas/metabolismo , Encéfalo/metabolismo , Enfermedad de Alzheimer/genética , Envejecimiento/genética

10.

Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms.

Jing, Bowen; Jaakkola, Tommi; Berger, Bonnie.

ArXiv ; 2023 Dec 07.

Artículo en Inglés | MEDLINE | ID: mdl-38106455

RESUMEN

Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.

11.

Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning.

Sledzieski, Samuel; Kshirsagar, Meghana; Baek, Minkyung; Berger, Bonnie; Dodhia, Rahul; Ferres, Juan Lavista.

bioRxiv ; 2023 Nov 10.

Artículo en Inglés | MEDLINE | ID: mdl-37986761

RESUMEN

Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from large corpora of sequences. The parameters of these models are then fine-tuned in a supervised setting to tailor the model to a specific downstream task. However, as model size increases, the computational and memory footprint of fine-tuning becomes a barrier for many research groups. In the field of natural language processing, which has seen a similar explosion in the size of models, these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we newly bring parameter-efficient fine-tuning methods to proteomics. Using the parameter-efficient method LoRA, we train new models for two important proteomic tasks: predicting protein-protein interactions (PPI) and predicting the symmetry of homooligomers. We show that for homooligomer symmetry prediction, these approaches achieve performance competitive with traditional fine-tuning while requiring reduced memory and using three orders of magnitude fewer parameters. On the PPI prediction task, we surprisingly find that PEFT models actually outperform traditional fine-tuning while using two orders of magnitude fewer parameters. Here, we go even further to show that freezing the parameters of the language model and training only a classification head also outperforms fine-tuning, using five orders of magnitude fewer parameters, and that both of these models outperform state-of-the-art PPI prediction methods with substantially reduced compute. We also demonstrate that PEFT is robust to variations in training hyper-parameters, and elucidate where best practices for PEFT in proteomics differ from in natural language processing. Thus, we provide a blueprint to democratize the power of protein language model tuning to groups which have limited computational resources.

12.

TT3D: Leveraging precomputed protein 3D sequence models to predict protein-protein interactions.

Sledzieski, Samuel; Devkota, Kapil; Singh, Rohit; Cowen, Lenore; Berger, Bonnie.

Bioinformatics ; 39(11)2023 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-37897686

RESUMEN

MOTIVATION: High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di). RESULTS: We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide. AVAILABILITY AND IMPLEMENTATION: TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.

Asunto(s)

Proteínas , Programas Informáticos , Secuencia de Aminoácidos , Proteínas/química

13.

CD8+ lymphocytes are critical for early control of tuberculosis in macaques.

Winchell, Caylin G; Nyquist, Sarah K; Chao, Michael C; Maiello, Pauline; Myers, Amy J; Hopkins, Forrest; Chase, Michael; Gideon, Hannah P; Patel, Kush V; Bromley, Joshua D; Simonson, Andrew W; Floyd-O'Sullivan, Roisin; Wadsworth, Marc; Rosenberg, Jacob M; Uddin, Rockib; Hughes, Travis; Kelly, Ryan J; Griffo, Josephine; Tomko, Jaime; Klein, Edwin; Berger, Bonnie; Scanga, Charles A; Mattila, Joshua; Fortune, Sarah M; Shalek, Alex K; Lin, Philana Ling; Flynn, JoAnne L.

J Exp Med ; 220(12)2023 12 04.

Artículo en Inglés | MEDLINE | ID: mdl-37843832

RESUMEN

The functional role of CD8+ lymphocytes in tuberculosis remains poorly understood. We depleted innate and/or adaptive CD8+ lymphocytes in macaques and showed that loss of all CD8α+ cells (using anti-CD8α antibody) significantly impaired early control of Mycobacterium tuberculosis (Mtb) infection, leading to increased granulomas, lung inflammation, and bacterial burden. Analysis of barcoded Mtb from infected macaques demonstrated that depletion of all CD8+ lymphocytes allowed increased establishment of Mtb in lungs and dissemination within lungs and to lymph nodes, while depletion of only adaptive CD8+ T cells (with anti-CD8ß antibody) worsened bacterial control in lymph nodes. Flow cytometry and single-cell RNA sequencing revealed polyfunctional cytotoxic CD8+ lymphocytes in control granulomas, while CD8-depleted animals were unexpectedly enriched in CD4 and Î³Î´ T cells adopting incomplete cytotoxic signatures. Ligand-receptor analyses identified IL-15 signaling in granulomas as a driver of cytotoxic T cells. These data support that CD8+ lymphocytes are required for early protection against Mtb and suggest polyfunctional cytotoxic responses as a vaccine target.

Asunto(s)

Mycobacterium tuberculosis , Tuberculosis , Animales , Macaca , Tuberculosis/microbiología , Linfocitos T CD8-positivos , Granuloma , Linfocitos T CD4-Positivos

14.

Assessing transcriptomic reidentification risks using discriminative sequence models.

Sadhuka, Shuvom; Fridman, Daniel; Berger, Bonnie; Cho, Hyunghoon.

Genome Res ; 33(7): 1101-1112, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37541758

RESUMEN

Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.

Asunto(s)

Estudio de Asociación del Genoma Completo , Transcriptoma , Humanos , Perfilación de la Expresión Génica , Genotipo , Sitios de Carácter Cuantitativo , Polimorfismo de Nucleótido Simple

15.

Efficient minimizer orders for large values of k using minimum decycling sets.

Pellow, David; Pu, Lianrong; Ekim, Baris; Kotlar, Lior; Berger, Bonnie; Shamir, Ron; Orenstein, Yaron.

Genome Res ; 33(7): 1154-1161, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37558282

RESUMEN

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.

Asunto(s)

Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos

16.

SCA: recovering single-cell heterogeneity through information-based dimensionality reduction.

DeMeo, Benjamin; Berger, Bonnie.

Genome Biol ; 24(1): 195, 2023 08 25.

Artículo en Inglés | MEDLINE | ID: mdl-37626411

RESUMEN

Dimensionality reduction summarizes the complex transcriptomic landscape of single-cell datasets for downstream analyses. Current approaches favor large cellular populations defined by many genes, at the expense of smaller and more subtly defined populations. Here, we present surprisal component analysis (SCA), a technique that newly leverages the information-theoretic notion of surprisal for dimensionality reduction to promote more meaningful signal extraction. For example, SCA uncovers clinically important cytotoxic T-cell subpopulations that are indistinguishable using existing pipelines. We also demonstrate that SCA substantially improves downstream imputation. SCA's efficient information-theoretic paradigm has broad applications to the study of complex biological tissues in health and disease.

Asunto(s)

Perfilación de la Expresión Génica , Transcriptoma

17.

Efficient mapping of accurate long reads in minimizer space with mapquik.

Ekim, Baris; Sahlin, Kristoffer; Medvedev, Paul; Berger, Bonnie; Chikhi, Rayan.

Genome Res ; 33(7): 1188-1197, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37399256

RESUMEN

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Humanos , Algoritmos , Análisis de Secuencia de ADN , Genoma Humano

18.

Contrastive learning in protein language space predicts interactions between drugs and protein targets.

Singh, Rohit; Sledzieski, Samuel; Bryson, Bryan; Cowen, Lenore; Berger, Bonnie.

Proc Natl Acad Sci U S A ; 120(24): e2220778120, 2023 Jun 13.

Artículo en Inglés | MEDLINE | ID: mdl-37289807

RESUMEN

Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational prediction needs to be generalizable and scalable while remaining sensitive to subtle variations in the inputs. However, current computational techniques fail to simultaneously meet these goals, often sacrificing performance of one to achieve the others. We develop a deep learning model, ConPLex, successfully leveraging the advances in pretrained protein language models ("PLex") and employing a protein-anchored contrastive coembedding ("Con") to outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds. It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome. Experimental testing of 19 kinase-drug interaction predictions validated 12 interactions, including four with subnanomolar affinity, plus a strongly binding EPHB1 inhibitor (KD = 1.3 nM). Furthermore, ConPLex embeddings are interpretable, which enables us to visualize the drug-target embedding space and use embeddings to characterize the function of human cell-surface proteins. We anticipate that ConPLex will facilitate efficient drug discovery by making highly sensitive in silico drug screening feasible at the genome scale. ConPLex is available open source at https://ConPLex.csail.mit.edu.

Asunto(s)

Descubrimiento de Drogas , Proteínas , Humanos , Proteínas/química , Descubrimiento de Drogas/métodos , Evaluación Preclínica de Medicamentos , Lenguaje

19.

split-intein Gal4 provides intersectional genetic labeling that is repressible by Gal80.

Ewen-Campen, Ben; Luan, Haojiang; Xu, Jun; Singh, Rohit; Joshi, Neha; Thakkar, Tanuj; Berger, Bonnie; White, Benjamin H; Perrimon, Norbert.

Proc Natl Acad Sci U S A ; 120(24): e2304730120, 2023 06 13.

Artículo en Inglés | MEDLINE | ID: mdl-37276389

RESUMEN

The split-Gal4 system allows for intersectional genetic labeling of highly specific cell types and tissues in Drosophila. However, the existing split-Gal4 system, unlike the standard Gal4 system, cannot be repressed by Gal80, and therefore cannot be controlled temporally. This lack of temporal control precludes split-Gal4 experiments in which a genetic manipulation must be restricted to specific timepoints. Here, we describe a split-Gal4 system based on a self-excising split-intein, which drives transgene expression as strongly as the current split-Gal4 system and Gal4 reagents, yet which is repressible by Gal80. We demonstrate the potent inducibility of "split-intein Gal4" in vivo using both fluorescent reporters and via reversible tumor induction in the gut. Further, we show that our split-intein Gal4 can be extended to the drug-inducible GeneSwitch system, providing an independent method for intersectional labeling with inducible control. We also show that the split-intein Gal4 system can be used to generate highly cell type-specific genetic drivers based on in silico predictions generated by single-cell RNAseq (scRNAseq) datasets, and we describe an algorithm ("Two Against Background" or TAB) to predict cluster-specific gene pairs across multiple tissue-specific scRNA datasets. We provide a plasmid toolkit to efficiently create split-intein Gal4 drivers based on either CRISPR knock-ins to target genes or using enhancer fragments. Altogether, the split-intein Gal4 system allows for the creation of highly specific intersectional genetic drivers that are inducible/repressible.

Asunto(s)

Proteínas de Drosophila , Factores de Transcripción , Animales , Factores de Transcripción/metabolismo , Inteínas , Drosophila/genética , Drosophila/metabolismo , Empalme de Proteína , Transgenes , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo

20.

sfkit: a web-based toolkit for secure and federated genomic analysis.

Mendelsohn, Simon; Froelicher, David; Loginov, Denis; Bernick, David; Berger, Bonnie; Cho, Hyunghoon.

Nucleic Acids Res ; 51(W1): W535-W541, 2023 07 05.

Artículo en Inglés | MEDLINE | ID: mdl-37246709

RESUMEN

Advances in genomics are increasingly depending upon the ability to analyze large and diverse genomic data collections, which are often difficult to amass due to privacy concerns. Recent works have shown that it is possible to jointly analyze datasets held by multiple parties, while provably preserving the privacy of each party's dataset using cryptographic techniques. However, these tools have been challenging to use in practice due to the complexities of the required setup and coordination among the parties. We present sfkit, a secure and federated toolkit for collaborative genomic studies, to allow groups of collaborators to easily perform joint analyses of their datasets without compromising privacy. sfkit consists of a web server and a command-line interface, which together support a range of use cases including both auto-configured and user-supplied computational environments. sfkit provides collaborative workflows for the essential tasks of genome-wide association study (GWAS) and principal component analysis (PCA). We envision sfkit becoming a one-stop server for secure collaborative tools for a broad range of genomic analyses. sfkit is open-source and available at: https://sfkit.org.

Asunto(s)

Estudio de Asociación del Genoma Completo , Genómica , Programas Informáticos , Estudio de Asociación del Genoma Completo/métodos , Genómica/métodos , Internet , Privacidad , Flujo de Trabajo

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA