Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 187
Filtrar
2.
Nat Commun ; 12(1): 5910, 2021 10 11.
Artigo em Inglês | MEDLINE | ID: mdl-34635645

RESUMO

Using real-world evidence in biomedical research, an indispensable complement to clinical trials, requires access to large quantities of patient data that are typically held separately by multiple healthcare institutions. We propose FAMHE, a novel federated analytics system that, based on multiparty homomorphic encryption (MHE), enables privacy-preserving analyses of distributed datasets by yielding highly accurate results without revealing any intermediate data. We demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics. Using our system, we accurately and efficiently reproduce two published centralized studies in a federated setting, enabling biomedical insights that are not possible from individual institutions alone. Our work represents a necessary key step towards overcoming the privacy hurdle in enabling multi-centric scientific collaborations.


Assuntos
Medicina de Precisão , Privacidade , Algoritmos , Segurança Computacional , Atenção à Saúde , Estudo de Associação Genômica Ampla , Humanos , Estimativa de Kaplan-Meier , Análise de Sobrevida
3.
Cell Syst ; 12(10): 969-982.e6, 2021 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-34536380

RESUMO

We combine advances in neural language modeling and structurally motivated design to develop D-SCRIPT, an interpretable and generalizable deep-learning model, which predicts interaction between two proteins using only their sequence and maintains high accuracy with limited training data and across species. We show that a D-SCRIPT model trained on 38,345 human PPIs enables significantly improved functional characterization of fly proteins compared with the state-of-the-art approach. Evaluating the same D-SCRIPT model on protein complexes with known 3D structure, we find that the inter-protein contact map output by D-SCRIPT has significant overlap with the ground truth. We apply D-SCRIPT to screen for PPIs in cow (Bos taurus) at a genome-wide scale and focusing on rumen physiology, identify functional gene modules related to metabolism and immune response. The predicted interactions can then be leveraged for function prediction at scale, addressing the genome-to-phenome challenge, especially in species where little data are available.

4.
Cell Syst ; 12(10): 958-968.e6, 2021 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-34525345

RESUMO

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. Here, we define an algorithmic approach, mdBG, that makes use of minimizer-space de Bruijn graphs to enable long-read genome assembly. mdBG achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without compromising accuracy. A human genome is assembled in under 10 min using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 min using 1 GB RAM. In addition, we constructed a minimizer-space de Bruijn graph-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully search it for anti-microbial resistance (AMR) genes in 12 min. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics, and pangenomics. Code for constructing mdBGs is freely available for download at https://github.com/ekimb/rust-mdbg/.

5.
Bioinformatics ; 37(Suppl_1): i349-i357, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252956

RESUMO

MOTIVATION: Recent advances in single-cell RNA-sequencing (scRNA-seq) technologies promise to enable the study of gene regulatory associations at unprecedented resolution in diverse cellular contexts. However, identifying unique regulatory associations observed only in specific cell types or conditions remains a key challenge; this is particularly so for rare transcriptional states whose sample sizes are too small for existing gene regulatory network inference methods to be effective. RESULTS: We present ShareNet, a Bayesian framework for boosting the accuracy of cell type-specific gene regulatory networks by propagating information across related cell types via an information sharing structure that is adaptively optimized for a given single-cell dataset. The techniques we introduce can be used with a range of general network inference algorithms to enhance the output for each cell type. We demonstrate the enhanced accuracy of our approach on three benchmark scRNA-seq datasets. We find that our inferred cell type-specific networks also uncover key changes in gene associations that underpin the complex rewiring of regulatory networks across cell types, tissues and dynamic biological processes. Our work presents a path toward extracting deeper insights about cell type-specific gene regulation in the rapidly growing compendium of scRNA-seq datasets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AVAILABILITY AND IMPLEMENTATION: The code for ShareNet is available at http://sharenet.csail.mit.edu and https://github.com/alexw16/sharenet.


Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Teorema de Bayes , Disseminação de Informação , Análise de Sequência de RNA , Software
7.
Science ; 373(6550)2021 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-34210851

RESUMO

Synthetic biological networks comprising fast, reversible reactions could enable engineering of new cellular behaviors that are not possible with slower regulation. Here, we created a bistable toggle switch in Saccharomyces cerevisiae using a cross-repression topology comprising 11 protein-protein phosphorylation elements. The toggle is ultrasensitive, can be induced to switch states in seconds, and exhibits long-term bistability. Motivated by our toggle's architecture and size, we developed a computational framework to search endogenous protein pathways for other large and similar bistable networks. Our framework helped us to identify and experimentally verify five formerly unreported endogenous networks that exhibit bistability. Building synthetic protein-protein networks will enable bioengineers to design fast sensing and processing systems, allow sophisticated regulation of cellular processes, and aid discovery of endogenous networks with particular functions.


Assuntos
Bioengenharia , Mapas de Interação de Proteínas , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Proteínas Quinases Ativadas por Mitógeno/genética , Proteínas Quinases Ativadas por Mitógeno/metabolismo , Fosforilação , Proteínas de Saccharomyces cerevisiae/genética
8.
JCI Insight ; 6(16)2021 08 23.
Artigo em Inglês | MEDLINE | ID: mdl-34252054

RESUMO

SARS-CoV-2 infects epithelial cells of the human gastrointestinal (GI) tract and causes related symptoms. HIV infection impairs gut homeostasis and is associated with an increased risk of COVID-19 fatality. To investigate the potential link between these observations, we analyzed single-cell transcriptional profiles and SARS-CoV-2 entry receptor expression across lymphoid and mucosal human tissue from chronically HIV-infected individuals and uninfected controls. Absorptive gut enterocytes displayed the highest coexpression of SARS-CoV-2 receptors ACE2, TMPRSS2, and TMPRSS4, of which ACE2 expression was associated with canonical interferon response and antiviral genes. Chronic treated HIV infection was associated with a clear antiviral response in gut enterocytes and, unexpectedly, with a substantial reduction of ACE2 and TMPRSS2 target cells. Gut tissue from SARS-CoV-2-infected individuals, however, showed abundant SARS-CoV-2 nucleocapsid protein in both the large and small intestine, including an HIV-coinfected individual. Thus, upregulation of antiviral response genes and downregulation of ACE2 and TMPRSS2 in the GI tract of HIV-infected individuals does not prevent SARS-CoV-2 infection in this compartment. The impact of these HIV-associated intestinal mucosal changes on SARS-CoV-2 infection dynamics, disease severity, and vaccine responses remains unclear and requires further investigation.


Assuntos
Enzima de Conversão de Angiotensina 2/análise , Infecções por HIV/virologia , Mucosa Intestinal/virologia , SARS-CoV-2/isolamento & purificação , Serina Endopeptidases/análise , Adulto , Doença Crônica , Feminino , Humanos , Mucosa Intestinal/química , Masculino , Pessoa de Meia-Idade
9.
IEEE Trans Inf Theory ; 67(6): 3287-3294, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34257466

RESUMO

Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.

10.
Cell Syst ; 12(6): 654-669.e3, 2021 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-34139171

RESUMO

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.

11.
Genome Biol ; 22(1): 131, 2021 05 03.
Artigo em Inglês | MEDLINE | ID: mdl-33941239

RESUMO

A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.

12.
Nat Methods ; 18(2): 176-185, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33542510

RESUMO

Cryo-electron microscopy (cryo-EM) single-particle analysis has proven powerful in determining the structures of rigid macromolecules. However, many imaged protein complexes exhibit conformational and compositional heterogeneity that poses a major challenge to existing three-dimensional reconstruction methods. Here, we present cryoDRGN, an algorithm that leverages the representation power of deep neural networks to directly reconstruct continuous distributions of 3D density maps and map per-particle heterogeneity of single-particle cryo-EM datasets. Using cryoDRGN, we uncovered residual heterogeneity in high-resolution datasets of the 80S ribosome and the RAG complex, revealed a new structural state of the assembling 50S ribosome, and visualized large-scale continuous motions of a spliceosome complex. CryoDRGN contains interactive tools to visualize a dataset's distribution of per-particle variability, generate density maps for exploratory analysis, extract particle subsets for use with other tools and generate trajectories to visualize molecular motions. CryoDRGN is open-source software freely available at http://cryodrgn.csail.mit.edu .


Assuntos
Microscopia Crioeletrônica/métodos , Substâncias Macromoleculares/química , Redes Neurais de Computação , Estrutura Molecular
13.
Sci Immunol ; 6(56)2021 Feb 26.
Artigo em Inglês | MEDLINE | ID: mdl-33637594

RESUMO

Mast cells (MCs) play a pathobiologic role in type 2 (T2) allergic inflammatory diseases of the airway, including asthma and chronic rhinosinusitis with nasal polyposis (CRSwNP). Distinct MC subsets infiltrate the airway mucosa in T2 disease, including subepithelial MCs expressing the proteases tryptase and chymase (MCTC) and epithelial MCs expressing tryptase without chymase (MCT). However, mechanisms underlying MC expansion and the transcriptional programs underlying their heterogeneity are poorly understood. Here, we use flow cytometry and single-cell RNA-sequencing (scRNA-seq) to conduct a comprehensive analysis of human MC hyperplasia in CRSwNP, a T2 cytokine-mediated inflammatory disease. We link discrete cell surface phenotypes to the distinct transcriptomes of CRSwNP MCT and MCTC, which represent polarized ends of a transcriptional gradient of nasal polyp MCs. We find a subepithelial population of CD38highCD117high MCs that is markedly expanded during T2 inflammation. These CD38highCD117high MCs exhibit an intermediate phenotype relative to the expanded MCT and MCTC subsets. CD38highCD117high MCs are distinct from circulating MC progenitors and are enriched for proliferation, which is markedly increased in CRSwNP patients with aspirin-exacerbated respiratory disease, a severe disease subset characterized by increased MC burden and elevated MC activation. We observe that MCs expressing a polyp MCT-like effector program are also found within the lung during fibrotic diseases and asthma, and further identify marked differences between MCTC in nasal polyps and skin. These results indicate that MCs display distinct inflammation-associated effector programs and suggest that in situ MC proliferation is a major component of MC hyperplasia in human T2 inflammation.

14.
Science ; 371(6526): 284-288, 2021 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-33446556

RESUMO

The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence's grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.


Assuntos
Síndrome de Imunodeficiência Adquirida/imunologia , COVID-19/imunologia , HIV-1/genética , Vírus da Influenza A/genética , Influenza Humana/imunologia , SARS-CoV-2/genética , Síndrome de Imunodeficiência Adquirida/virologia , Sítios de Ligação , COVID-19/virologia , Evolução Molecular , Glicoproteínas de Hemaglutininação de Vírus da Influenza/química , Glicoproteínas de Hemaglutininação de Vírus da Influenza/genética , Humanos , Influenza Humana/virologia , Mutação , Domínios Proteicos , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Produtos do Gene env do Vírus da Imunodeficiência Humana/química , Produtos do Gene env do Vírus da Imunodeficiência Humana/genética
15.
Nat Biotechnol ; 39(6): 765-774, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33462509

RESUMO

Nonlinear data visualization methods, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), summarize the complex transcriptomic landscape of single cells in two dimensions or three dimensions, but they neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subsets of cells are given more visual space than warranted by their transcriptional diversity in the dataset. Here we present den-SNE and densMAP, which are density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to accurately incorporate information about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data. Applied to recently published datasets, our methods reveal significant changes in transcriptomic variability in a range of biological processes, including heterogeneity in transcriptomic variability of immune cells in blood and tumor, human immune cell specialization and the developmental trajectory of Caenorhabditis elegans. Our methods are readily applicable to visualizing high-dimensional data in other scientific domains.


Assuntos
Visualização de Dados , Análise de Célula Única , Transcriptoma , Algoritmos , Perfilação da Expressão Gênica/métodos , Humanos , Análise de Componente Principal
16.
Nat Neurosci ; 24(2): 197-203, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33432194

RESUMO

Although germline de novo copy number variants (CNVs) are known causes of autism spectrum disorder (ASD), the contribution of mosaic (early-developmental) copy number variants (mCNVs) has not been explored. In this study, we assessed the contribution of mCNVs to ASD by ascertaining mCNVs in genotype array intensity data from 12,077 probands with ASD and 5,500 unaffected siblings. We detected 46 mCNVs in probands and 19 mCNVs in siblings, affecting 2.8-73.8% of cells. Probands carried a significant burden of large (>4-Mb) mCNVs, which were detected in 25 probands but only one sibling (odds ratio = 11.4, 95% confidence interval = 1.5-84.2, P = 7.4 × 10-4). Event size positively correlated with severity of ASD symptoms (P = 0.016). Surprisingly, we did not observe mosaic analogues of the short de novo CNVs recurrently observed in ASD (eg, 16p11.2). We further experimentally validated two mCNVs in postmortem brain tissue from 59 additional probands. These results indicate that mCNVs contribute a previously unexplained component of ASD risk.


Assuntos
Transtorno do Espectro Autista/genética , Variações do Número de Cópias de DNA , Mosaicismo , Adulto , Transtorno do Espectro Autista/epidemiologia , Autopsia , Química Encefálica/genética , Criança , Transtornos Globais do Desenvolvimento Infantil/genética , Estudos de Coortes , Predisposição Genética para Doença , Genótipo , Mutação em Linhagem Germinativa , Humanos , Medição de Risco , Bancos de Tecidos
17.
Nat Struct Mol Biol ; 28(1): 29-37, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33318703

RESUMO

In motile cilia, a mechanoregulatory network is responsible for converting the action of thousands of dynein motors bound to doublet microtubules into a single propulsive waveform. Here, we use two complementary cryo-EM strategies to determine structures of the major mechanoregulators that bind ciliary doublet microtubules in Chlamydomonas reinhardtii. We determine structures of isolated radial spoke RS1 and the microtubule-bound RS1, RS2 and the nexin-dynein regulatory complex (N-DRC). From these structures, we identify and build atomic models for 30 proteins, including 23 radial-spoke subunits. We reveal how mechanoregulatory complexes dock to doublet microtubules with regular 96-nm periodicity and communicate with one another. Additionally, we observe a direct and dynamically coupled association between RS2 and the dynein motor inner dynein arm subform c (IDAc), providing a molecular basis for the control of motor activity by mechanical signals. These structures advance our understanding of the role of mechanoregulation in defining the ciliary waveform.


Assuntos
Chlamydomonas reinhardtii/anatomia & histologia , Cílios/metabolismo , Locomoção/fisiologia , Proteínas de Plantas/metabolismo , Axonema/metabolismo , Fenômenos Biomecânicos/fisiologia , Microscopia Crioeletrônica , Proteínas do Citoesqueleto/metabolismo , Dineínas/metabolismo , Flagelos/metabolismo , Microtúbulos/metabolismo , Modelos Moleculares , Estrutura Terciária de Proteína , Transdução de Sinais/fisiologia , Nexinas de Classificação/metabolismo
18.
Cell Syst ; 11(5): 461-477.e9, 2020 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-33065027

RESUMO

Machine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution to address this issue is to quantify prediction uncertainty so that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pre-trained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition of Mycobacterium tuberculosis. Uncertainty facilitates a tight iterative loop between computation and experimentation and generalizes across biological domains as diverse as protein engineering and single-cell transcriptomics. More broadly, our work demonstrates that uncertainty should play a key role in the increasing adoption of machine learning algorithms into the experimental lifecycle.


Assuntos
Biologia Computacional/métodos , Previsões/métodos , Incerteza , Algoritmos , Aprendizado de Máquina/tendências , Distribuição Normal
19.
Nat Commun ; 11(1): 5208, 2020 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-33060581

RESUMO

Cryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise ratio (SNR) in cryoEM images reduces the confidence and throughput of structure determination during several steps of data processing, resulting in impediments such as missing particle orientations. Denoising cryoEM images can not only improve downstream analysis but also accelerate the time-consuming data collection process by allowing lower electron dose micrographs to be used for analysis. Here, we present Topaz-Denoise, a deep learning method for reliably and rapidly increasing the SNR of cryoEM images and cryoET tomograms. By training on a dataset composed of thousands of micrographs collected across a wide range of imaging conditions, we are able to learn models capturing the complexity of the cryoEM image formation process. The general model we present is able to denoise new datasets without additional training. Denoising with this model improves micrograph interpretability and allows us to solve 3D single particle structures of clustered protocadherin, an elongated particle with previously elusive views. We then show that low dose collection, enabled by Topaz-Denoise, improves downstream analysis in addition to reducing data collection time. We also present a general 3D denoising model for cryoET. Topaz-Denoise and pre-trained general models are now included in Topaz. We expect that Topaz-Denoise will be of broad utility to the cryoEM community for improving micrograph and tomogram interpretability and accelerating analysis.


Assuntos
Microscopia Crioeletrônica/métodos , Aprendizado de Máquina , Cimentos de Resina , Caderinas , Coleta de Dados , Tamanho da Partícula , Razão Sinal-Ruído
20.
Nat Commun ; 11(1): 4662, 2020 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-32938926

RESUMO

Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X's feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10×  faster than other tools. The advantage of HapTree-X's ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.


Assuntos
Desequilíbrio Alélico , Haplótipos , Análise de Sequência de RNA , Algoritmos , Bases de Dados Genéticas , Diploide , Humanos , Células K562 , Modelos Genéticos , Modelos Estatísticos , Polimorfismo de Nucleotídeo Único , Poliploidia , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...