Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 228
Filtrar
1.
Res Sq ; 2024 Apr 26.
Artigo em Inglês | MEDLINE | ID: mdl-38746169

RESUMO

The majority of proteins must form higher-order assemblies to perform their biological functions. Despite the importance of protein quaternary structure, there are few machine learning models that can accurately and rapidly predict the symmetry of assemblies involving multiple copies of the same protein chain. Here, we address this gap by training several classes of protein foundation models, including ESM-MSA, ESM2, and RoseTTAFold2, to predict homo-oligomer symmetry. Our best model named Seq2Symm, which utilizes ESM2, outperforms existing template-based and deep learning methods. It achieves an average PR-AUC of 0.48 and 0.44 across homo-oligomer symmetries on two different held-out test sets compared to 0.32 and 0.23 for the template-based method. Because Seq2Symm can rapidly predict homo-oligomer symmetries using a single sequence as input (~ 80,000 proteins/hour), we have applied it to 5 entire proteomes and ~ 3.5 million unlabeled protein sequences to identify patterns in protein assembly complexity across biological kingdoms and species.

2.
Genetics ; 226(1)2024 Jan 03.
Artigo em Inglês | MEDLINE | ID: mdl-37996068

RESUMO

Mutations in the Presenilin (PSEN) genes are the most common cause of early-onset familial Alzheimer's disease (FAD). Studies in cell culture, in vitro biochemical systems, and knockin mice showed that PSEN mutations are loss-of-function mutations, impairing γ-secretase activity. Mouse genetic analysis highlighted the importance of Presenilin (PS) in learning and memory, synaptic plasticity and neurotransmitter release, and neuronal survival, and Drosophila studies further demonstrated an evolutionarily conserved role of PS in neuronal survival during aging. However, molecular pathways that interact with PS in neuronal survival remain unclear. To identify genetic modifiers that modulate PS-dependent neuronal survival, we developed a new DrosophilaPsn model that exhibits age-dependent neurodegeneration and increases of apoptosis. Following a bioinformatic analysis, we tested top ranked candidate genes by selective knockdown (KD) of each gene in neurons using two independent RNAi lines in Psn KD models. Interestingly, 4 of the 9 genes enhancing neurodegeneration in Psn KD flies are involved in lipid transport and metabolism. Specifically, neuron-specific KD of lipophorin receptors, lpr1 and lpr2, dramatically worsens neurodegeneration in Psn KD flies, and overexpression of lpr1 or lpr2 does not alleviate Psn KD-induced neurodegeneration. Furthermore, lpr1 or lpr2 KD alone also leads to neurodegeneration, increased apoptosis, climbing defects, and shortened lifespan. Lastly, heterozygotic deletions of lpr1 and lpr2 or homozygotic deletions of lpr1 or lpr2 similarly lead to age-dependent neurodegeneration and further exacerbate neurodegeneration in Psn KD flies. These findings show that LpRs modulate Psn-dependent neuronal survival and are critically important for neuronal integrity in the aging brain.


Assuntos
Doença de Alzheimer , Drosophila , Animais , Camundongos , Drosophila/genética , Drosophila/metabolismo , Presenilinas/genética , Presenilinas/metabolismo , Encéfalo/metabolismo , Doença de Alzheimer/genética , Envelhecimento/genética
3.
ArXiv ; 2023 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-38106455

RESUMO

Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.

4.
bioRxiv ; 2023 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-37986761

RESUMO

Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from large corpora of sequences. The parameters of these models are then fine-tuned in a supervised setting to tailor the model to a specific downstream task. However, as model size increases, the computational and memory footprint of fine-tuning becomes a barrier for many research groups. In the field of natural language processing, which has seen a similar explosion in the size of models, these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we newly bring parameter-efficient fine-tuning methods to proteomics. Using the parameter-efficient method LoRA, we train new models for two important proteomic tasks: predicting protein-protein interactions (PPI) and predicting the symmetry of homooligomers. We show that for homooligomer symmetry prediction, these approaches achieve performance competitive with traditional fine-tuning while requiring reduced memory and using three orders of magnitude fewer parameters. On the PPI prediction task, we surprisingly find that PEFT models actually outperform traditional fine-tuning while using two orders of magnitude fewer parameters. Here, we go even further to show that freezing the parameters of the language model and training only a classification head also outperforms fine-tuning, using five orders of magnitude fewer parameters, and that both of these models outperform state-of-the-art PPI prediction methods with substantially reduced compute. We also demonstrate that PEFT is robust to variations in training hyper-parameters, and elucidate where best practices for PEFT in proteomics differ from in natural language processing. Thus, we provide a blueprint to democratize the power of protein language model tuning to groups which have limited computational resources.

5.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37897686

RESUMO

MOTIVATION: High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di). RESULTS: We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide. AVAILABILITY AND IMPLEMENTATION: TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.


Assuntos
Proteínas , Software , Sequência de Aminoácidos , Proteínas/química
6.
J Exp Med ; 220(12)2023 12 04.
Artigo em Inglês | MEDLINE | ID: mdl-37843832

RESUMO

The functional role of CD8+ lymphocytes in tuberculosis remains poorly understood. We depleted innate and/or adaptive CD8+ lymphocytes in macaques and showed that loss of all CD8α+ cells (using anti-CD8α antibody) significantly impaired early control of Mycobacterium tuberculosis (Mtb) infection, leading to increased granulomas, lung inflammation, and bacterial burden. Analysis of barcoded Mtb from infected macaques demonstrated that depletion of all CD8+ lymphocytes allowed increased establishment of Mtb in lungs and dissemination within lungs and to lymph nodes, while depletion of only adaptive CD8+ T cells (with anti-CD8ß antibody) worsened bacterial control in lymph nodes. Flow cytometry and single-cell RNA sequencing revealed polyfunctional cytotoxic CD8+ lymphocytes in control granulomas, while CD8-depleted animals were unexpectedly enriched in CD4 and γδ T cells adopting incomplete cytotoxic signatures. Ligand-receptor analyses identified IL-15 signaling in granulomas as a driver of cytotoxic T cells. These data support that CD8+ lymphocytes are required for early protection against Mtb and suggest polyfunctional cytotoxic responses as a vaccine target.


Assuntos
Mycobacterium tuberculosis , Tuberculose , Animais , Macaca , Tuberculose/microbiologia , Linfócitos T CD8-Positivos , Granuloma , Linfócitos T CD4-Positivos
7.
Genome Biol ; 24(1): 195, 2023 08 25.
Artigo em Inglês | MEDLINE | ID: mdl-37626411

RESUMO

Dimensionality reduction summarizes the complex transcriptomic landscape of single-cell datasets for downstream analyses. Current approaches favor large cellular populations defined by many genes, at the expense of smaller and more subtly defined populations. Here, we present surprisal component analysis (SCA), a technique that newly leverages the information-theoretic notion of surprisal for dimensionality reduction to promote more meaningful signal extraction. For example, SCA uncovers clinically important cytotoxic T-cell subpopulations that are indistinguishable using existing pipelines. We also demonstrate that SCA substantially improves downstream imputation. SCA's efficient information-theoretic paradigm has broad applications to the study of complex biological tissues in health and disease.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma
8.
Genome Res ; 33(7): 1101-1112, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37541758

RESUMO

Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.


Assuntos
Estudo de Associação Genômica Ampla , Transcriptoma , Humanos , Perfilação da Expressão Gênica , Genótipo , Locos de Características Quantitativas , Polimorfismo de Nucleotídeo Único
9.
Genome Res ; 33(7): 1154-1161, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37558282

RESUMO

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos
10.
Genome Res ; 33(7): 1188-1197, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37399256

RESUMO

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Humanos , Algoritmos , Análise de Sequência de DNA , Genoma Humano
11.
Proc Natl Acad Sci U S A ; 120(24): e2220778120, 2023 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-37289807

RESUMO

Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational prediction needs to be generalizable and scalable while remaining sensitive to subtle variations in the inputs. However, current computational techniques fail to simultaneously meet these goals, often sacrificing performance of one to achieve the others. We develop a deep learning model, ConPLex, successfully leveraging the advances in pretrained protein language models ("PLex") and employing a protein-anchored contrastive coembedding ("Con") to outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds. It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome. Experimental testing of 19 kinase-drug interaction predictions validated 12 interactions, including four with subnanomolar affinity, plus a strongly binding EPHB1 inhibitor (KD = 1.3 nM). Furthermore, ConPLex embeddings are interpretable, which enables us to visualize the drug-target embedding space and use embeddings to characterize the function of human cell-surface proteins. We anticipate that ConPLex will facilitate efficient drug discovery by making highly sensitive in silico drug screening feasible at the genome scale. ConPLex is available open source at https://ConPLex.csail.mit.edu.


Assuntos
Descoberta de Drogas , Proteínas , Humanos , Proteínas/química , Descoberta de Drogas/métodos , Avaliação Pré-Clínica de Medicamentos , Idioma
12.
Proc Natl Acad Sci U S A ; 120(24): e2304730120, 2023 06 13.
Artigo em Inglês | MEDLINE | ID: mdl-37276389

RESUMO

The split-Gal4 system allows for intersectional genetic labeling of highly specific cell types and tissues in Drosophila. However, the existing split-Gal4 system, unlike the standard Gal4 system, cannot be repressed by Gal80, and therefore cannot be controlled temporally. This lack of temporal control precludes split-Gal4 experiments in which a genetic manipulation must be restricted to specific timepoints. Here, we describe a split-Gal4 system based on a self-excising split-intein, which drives transgene expression as strongly as the current split-Gal4 system and Gal4 reagents, yet which is repressible by Gal80. We demonstrate the potent inducibility of "split-intein Gal4" in vivo using both fluorescent reporters and via reversible tumor induction in the gut. Further, we show that our split-intein Gal4 can be extended to the drug-inducible GeneSwitch system, providing an independent method for intersectional labeling with inducible control. We also show that the split-intein Gal4 system can be used to generate highly cell type-specific genetic drivers based on in silico predictions generated by single-cell RNAseq (scRNAseq) datasets, and we describe an algorithm ("Two Against Background" or TAB) to predict cluster-specific gene pairs across multiple tissue-specific scRNA datasets. We provide a plasmid toolkit to efficiently create split-intein Gal4 drivers based on either CRISPR knock-ins to target genes or using enhancer fragments. Altogether, the split-intein Gal4 system allows for the creation of highly specific intersectional genetic drivers that are inducible/repressible.


Assuntos
Proteínas de Drosophila , Fatores de Transcrição , Animais , Fatores de Transcrição/metabolismo , Inteínas , Drosophila/genética , Drosophila/metabolismo , Processamento de Proteína , Transgenes , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo
13.
Nucleic Acids Res ; 51(W1): W535-W541, 2023 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-37246709

RESUMO

Advances in genomics are increasingly depending upon the ability to analyze large and diverse genomic data collections, which are often difficult to amass due to privacy concerns. Recent works have shown that it is possible to jointly analyze datasets held by multiple parties, while provably preserving the privacy of each party's dataset using cryptographic techniques. However, these tools have been challenging to use in practice due to the complexities of the required setup and coordination among the parties. We present sfkit, a secure and federated toolkit for collaborative genomic studies, to allow groups of collaborators to easily perform joint analyses of their datasets without compromising privacy. sfkit consists of a web server and a command-line interface, which together support a range of use cases including both auto-configured and user-supplied computational environments. sfkit provides collaborative workflows for the essential tasks of genome-wide association study (GWAS) and principal component analysis (PCA). We envision sfkit becoming a one-stop server for secure collaborative tools for a broad range of genomic analyses. sfkit is open-source and available at: https://sfkit.org.


Assuntos
Estudo de Associação Genômica Ampla , Genômica , Software , Estudo de Associação Genômica Ampla/métodos , Genômica/métodos , Internet , Privacidade , Fluxo de Trabalho
14.
ArXiv ; 2023 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-37064532

RESUMO

Protein structure prediction has reached revolutionary levels of accuracy on single structures, yet distributional modeling paradigms are needed to capture the conformational ensembles and flexibility that underlie biological function. Towards this goal, we develop EigenFold, a diffusion generative modeling framework for sampling a distribution of structures from a given protein sequence. We define a diffusion process that models the structure as a system of harmonic oscillators and which naturally induces a cascading-resolution generative process along the eigenmodes of the system. On recent CAMEO targets, EigenFold achieves a median TMScore of 0.84, while providing a more comprehensive picture of model uncertainty via the ensemble of sampled structures relative to existing methods. We then assess EigenFold's ability to model and predict conformational heterogeneity for fold-switching proteins and ligand-induced conformational change. Code is available at https://github.com/bjing2016/EigenFold.

15.
Mol Cell ; 83(8): 1350-1367.e7, 2023 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-37028419

RESUMO

The mammalian SWI/SNF (mSWI/SNF or BAF) family of chromatin remodeling complexes play critical roles in regulating DNA accessibility and gene expression. The three final-form subcomplexes-cBAF, PBAF, and ncBAF-are distinct in biochemical componentry, chromatin targeting, and roles in disease; however, the contributions of their constituent subunits to gene expression remain incompletely defined. Here, we performed Perturb-seq-based CRISPR-Cas9 knockout screens targeting mSWI/SNF subunits individually and in select combinations, followed by single-cell RNA-seq and SHARE-seq. We uncovered complex-, module-, and subunit-specific contributions to distinct regulatory networks and defined paralog subunit relationships and shifted subcomplex functions upon perturbations. Synergistic, intra-complex genetic interactions between subunits reveal functional redundancy and modularity. Importantly, single-cell subunit perturbation signatures mapped across bulk primary human tumor expression profiles both mirror and predict cBAF loss-of-function status in cancer. Our findings highlight the utility of Perturb-seq to dissect disease-relevant gene regulatory impacts of heterogeneous, multi-component master regulatory complexes.


Assuntos
Montagem e Desmontagem da Cromatina , Neoplasias , Animais , Humanos , Proteínas Cromossômicas não Histona/genética , Proteínas Cromossômicas não Histona/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Cromatina/genética , Mamíferos/metabolismo
17.
bioRxiv ; 2023 Mar 24.
Artigo em Inglês | MEDLINE | ID: mdl-36993523

RESUMO

The split-Gal4 system allows for intersectional genetic labeling of highly specific cell-types and tissues in Drosophila . However, the existing split-Gal4 system, unlike the standard Gal4 system, cannot be repressed by Gal80, and therefore cannot be controlled temporally. This lack of temporal control precludes split-Gal4 experiments in which a genetic manipulation must be restricted to specific timepoints. Here, we describe a new split-Gal4 system based on a self-excising split-intein, which drives transgene expression as strongly as the current split-Gal4 system and Gal4 reagents, yet which is fully repressible by Gal80. We demonstrate the potent inducibility of "split-intein Gal4" in vivo using both fluorescent reporters and via reversible tumor induction in the gut. Further, we show that our split-intein Gal4 can be extended to the drug-inducible GeneSwitch system, providing an independent method for intersectional labeling with inducible control. We also show that the split-intein Gal4 system can be used to generate highly cell-type specific genetic drivers based on in silico predictions generated by single cell RNAseq (scRNAseq) datasets, and we describe a new algorithm ("Two Against Background" or TAB) to predict cluster-specific gene pairs across multiple tissue-specific scRNA datasets. We provide a plasmid toolkit to efficiently create split-intein Gal4 drivers based on either CRISPR knock-ins to target genes or using enhancer fragments. Altogether, the split-intein Gal4 system allows for the creation of highly specific intersectional genetic drivers that are inducible/repressible. Significance statement: The split-Gal4 system allows Drosophila researchers to drive transgene expression with extraordinary cell type specificity. However, the existing split-Gal4 system cannot be controlled temporally, and therefore cannot be applied to many important areas of research. Here, we present a new split-Gal4 system based on a self-excising split-intein, which is fully controllable by Gal80, as well as a related drug-inducible split GeneSwitch system. This approach can both leverage and inform single-cell RNAseq datasets, and we introduce an algorithm to identify pairs of genes that precisely and narrowly mark a desired cell cluster. Our split-intein Gal4 system will be of value to the Drosophila research community, and allow for the creation of highly specific genetic drivers that are also inducible/repressible.

18.
bioRxiv ; 2023 Jan 26.
Artigo em Inglês | MEDLINE | ID: mdl-36747646

RESUMO

The ability to detect and quantify microbiota over time has a plethora of clinical, basic science, and public health applications. One of the primary means of tracking microbiota is through sequencing technologies. When the microorganism of interest is well characterized or known a priori, targeted sequencing is often used. In many applications, however, untargeted bulk (shotgun) sequencing is more appropriate; for instance, the tracking of infection transmission events and nucleotide variants across multiple genomic loci, or studying the role of multiple genes in a particular phenotype. Given these applications, and the observation that pathogens (e.g. Clostridioides difficile, Escherichia coli, Salmonella enterica) and other taxa of interest can reside at low relative abundance in the gastrointestinal tract, there is a critical need for algorithms that accurately track low-abundance taxa with strain level resolution. Here we present a sequence quality- and time-aware model, ChronoStrain, that introduces uncertainty quantification to gauge low-abundance species and significantly outperforms the current state-of-the-art on both real and synthetic data. ChronoStrain leverages sequences' quality scores and the samples' temporal information to produce a probability distribution over abundance trajectories for each strain tracked in the model. We demonstrate Chronostrain's improved performance in capturing post-antibiotic E. coli strain blooms among women with recurrent urinary tract infections (UTIs) from the UTI Microbiome (UMB) Project. Other strain tracking models on the same data either show inconsistent temporal colonization or can only track consistently using very coarse groupings. In contrast, our probabilistic outputs can reveal the relationship between low-confidence strains present in the sample that cannot be reliably assigned a single reference label (either due to poor coverage or novelty) while simultaneously calling high-confidence strains that can be unambiguously assigned a label. We also include and analyze newly sequenced cultured samples from the UMB Project.

19.
PLoS One ; 18(2): e0270965, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36735673

RESUMO

With the ease of gene sequencing and the technology available to study and manipulate non-model organisms, the extension of the methodological toolbox required to translate our understanding of model organisms to non-model organisms has become an urgent problem. For example, mining of large coral and their symbiont sequence data is a challenge, but also provides an opportunity for understanding functionality and evolution of these and other non-model organisms. Much more information than for any other eukaryotic species is available for humans, especially related to signal transduction and diseases. However, the coral cnidarian host and human have diverged over 700 million years ago and homologies between proteins in the two species are therefore often in the gray zone, or at least often undetectable with traditional BLAST searches. We introduce a two-stage approach to identifying putative coral homologues of human proteins. First, through remote homology detection using Hidden Markov Models, we identify candidate human homologues in the cnidarian genome. However, for many proteins, the human genome alone contains multiple family members with similar or even more divergence in sequence. In the second stage, therefore, we filter the remote homology results based on the functional and structural plausibility of each coral candidate, shortlisting the coral proteins likely to have conserved some of the functions of the human proteins. We demonstrate our approach with a pipeline for mapping membrane receptors in humans to membrane receptors in corals, with specific focus on the stony coral, P. damicornis. More than 1000 human membrane receptors mapped to 335 coral receptors, including 151 G protein coupled receptors (GPCRs). To validate specific sub-families, we chose opsin proteins, representative GPCRs that confer light sensitivity, and Toll-like receptors, representative non-GPCRs, which function in the immune response, and their ability to communicate with microorganisms. Through detailed structure-function analysis of their ligand-binding pockets and downstream signaling cascades, we selected those candidate remote homologues likely to carry out related functions in the corals. This pipeline may prove generally useful for other non-model organisms, such as to support the growing field of synthetic biology.


Assuntos
Antozoários , Receptores Acoplados a Proteínas G , Transdução de Sinais , Animais , Humanos , Antozoários/genética , Antozoários/fisiologia , Genoma , Receptores Acoplados a Proteínas G/genética , Receptores Acoplados a Proteínas G/metabolismo , Modelos Animais
20.
Genome Biol ; 24(1): 5, 2023 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-36631897

RESUMO

Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3-4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.


Assuntos
Biologia Computacional , Disseminação de Informação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...