Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Genome Res ; 34(9): 1312-1323, 2024 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-39111815

RESUMO

Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging owing to the burden of estimating kinship between all the pairs of individuals across data sets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us data sets. On a data set of 200,000 individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 h of runtime. Our work enables secure identification of relatives across large-scale genomic data sets.


Assuntos
Algoritmos , Genômica , Humanos , Genômica/métodos , Linhagem , Segurança Computacional
2.
Nat Methods ; 20(4): 559-568, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36959322

RESUMO

Structural variants (SVs) are a major driver of genetic diversity and disease in the human genome and their discovery is imperative to advances in precision medicine. Existing SV callers rely on hand-engineered features and heuristics to model SVs, which cannot scale to the vast diversity of SVs nor fully harness the information available in sequencing datasets. Here we propose an extensible deep-learning framework, Cue, to call and genotype SVs that can learn complex SV abstractions directly from the data. At a high level, Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image. We show that Cue outperforms the state of the art in the detection of several classes of SVs on synthetic and real short-read data and that it can be easily extended to other sequencing platforms, while achieving competitive performance.


Assuntos
Aprendizado Profundo , Software , Humanos , Genótipo , Sinais (Psicologia) , Variação Estrutural do Genoma , Genoma Humano
3.
Bioinformatics ; 36(4): 1082-1090, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31584621

RESUMO

MOTIVATION: We propose Meltos, a novel computational framework to address the challenging problem of building tumor phylogeny trees using somatic structural variants (SVs) among multiple samples. Meltos leverages the tumor phylogeny tree built on somatic single nucleotide variants (SNVs) to identify high confidence SVs and produce a comprehensive tumor lineage tree, using a novel optimization formulation. While we do not assume the evolutionary progression of SVs is necessarily the same as SNVs, we show that a tumor phylogeny tree using high-quality somatic SNVs can act as a guide for calling and assigning somatic SVs on a tree. Meltos utilizes multiple genomic read signals for potential SV breakpoints in whole genome sequencing data and proposes a probabilistic formulation for estimating variant allele fractions (VAFs) of SV events. RESULTS: In order to assess the ability of Meltos to correctly refine SNV trees with SV information, we tested Meltos on two simulated datasets with five genomes in both. We also assessed Meltos on two real cancer datasets. We tested Meltos on multiple samples from a liposarcoma tumor and on a multi-sample breast cancer data (Yates et al., 2015), where the authors provide validated structural variation events together with deep, targeted sequencing for a collection of somatic SNVs. We show Meltos has the ability to place high confidence validated SV calls on a refined tumor phylogeny tree. We also showed the flexibility of Meltos to either estimate VAFs directly from genomic data or to use copy number corrected estimates. AVAILABILITY AND IMPLEMENTATION: Meltos is available at https://github.com/ih-lab/Meltos. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias , Genoma , Variação Estrutural do Genoma , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/genética , Filogenia , Análise de Sequência , Software
4.
Bioinformatics ; 29(13): i361-70, 2013 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-23813006

RESUMO

SUMMARY: The increasing availability of high-throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to date, there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this article. We (i) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (ii) design a new alignment algorithm based on the Burrows-Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of two or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome. AVAILABILITY: http://viq854.github.com/bwbble.


Assuntos
Genoma Humano , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Variação Genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos
5.
Res Comput Mol Biol ; 14758: 308-313, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39027313

RESUMO

Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the significant burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 94.9% of third-degree relatives, and 99.9% of second-degree or closer relatives, within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.

6.
bioRxiv ; 2024 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-38464114

RESUMO

Gene fusions are found as cancer drivers in diverse adult and pediatric cancers. Accurate detection of fusion transcripts is essential in cancer clinical diagnostics, prognostics, and for guiding therapeutic development. Most currently available methods for fusion transcript detection are compatible with Illumina RNA-seq involving highly accurate short read sequences. Recent advances in long read isoform sequencing enable the detection of fusion transcripts at unprecedented resolution in bulk and single cell samples. Here we developed a new computational tool CTAT-LR-fusion to detect fusion transcripts from long read RNA-seq with or without companion short reads, with applications to bulk or single cell transcriptomes. We demonstrate that CTAT-LR-fusion exceeds fusion detection accuracy of alternative methods as benchmarked with simulated and real long read RNA-seq. Using short and long read RNA-seq, we further apply CTAT-LR-fusion to bulk transcriptomes of nine tumor cell lines, and to tumor single cells derived from a melanoma sample and three metastatic high grade serous ovarian carcinoma samples. In both bulk and in single cell RNA-seq, long isoform reads yielded higher sensitivity for fusion detection than short reads with notable exceptions. By combining short and long reads in CTAT-LR-fusion, we are able to further maximize detection of fusion splicing isoforms and fusion-expressing tumor cells. CTAT-LR-fusion is available at https://github.com/TrinityCTAT/CTAT-LR-fusion/wiki.

7.
Nat Biotechnol ; 42(4): 582-586, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37291427

RESUMO

Full-length RNA-sequencing methods using long-read technologies can capture complete transcript isoforms, but their throughput is limited. We introduce multiplexed arrays isoform sequencing (MAS-ISO-seq), a technique for programmably concatenating complementary DNAs (cDNAs) into molecules optimal for long-read sequencing, increasing the throughput >15-fold to nearly 40 million cDNA reads per run on the Sequel IIe sequencer. When applied to single-cell RNA sequencing of tumor-infiltrating T cells, MAS-ISO-seq demonstrated a 12- to 32-fold increase in the discovery of differentially spliced genes.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Isoformas de RNA , DNA Complementar/genética , Isoformas de RNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Isoformas de Proteínas/genética , Análise de Sequência de RNA/métodos , Transcriptoma , Perfilação da Expressão Gênica/métodos , RNA/genética
8.
J Comput Biol ; 25(7): 677-688, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29658784

RESUMO

We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.


Assuntos
Teorema de Bayes , Biologia Computacional/estatística & dados numéricos , Metagenômica/estatística & dados numéricos , Microbiota/genética , Análise por Conglomerados , Humanos , Metagenoma/genética
9.
Curr Protoc Bioinformatics ; 62(1): e49, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29927069

RESUMO

The reconstruction of cancer phylogeny trees and quantifying the evolution of the disease is a challenging task. LICHeE and BAMSE are two computational tools designed and implemented recently for this purpose. They both utilize estimated variant allele fraction of somatic mutations across multiple samples to infer the most likely cancer phylogenies. This unit provides extensive guidelines for installing and running both LICHeE and BAMSE. © 2018 by John Wiley & Sons, Inc.


Assuntos
Algoritmos , Biologia Computacional/métodos , Neoplasias/genética , Filogenia , Humanos
10.
Nat Commun ; 8: 15311, 2017 05 16.
Artigo em Inglês | MEDLINE | ID: mdl-28508884

RESUMO

Low-cost clouds can alleviate the compute and storage burden of the genome sequencing data explosion. However, moving personal genome data analysis to the cloud can raise serious privacy concerns. Here, we devise a method named Balaur, a privacy preserving read mapper for hybrid clouds based on locality sensitive hashing and kmer voting. Balaur can securely outsource a substantial fraction of the computation to the public cloud, while being highly competitive in accuracy and speed with non-private state-of-the-art read aligners on short read data. We also show that the method is significantly faster than the state of the art in long read mapping. Therefore, Balaur can enable institutions handling massive genomic data sets to shift part of their analysis to the cloud without sacrificing accuracy or exposing sensitive information to an untrusted third party.


Assuntos
Algoritmos , Computação em Nuvem , Biologia Computacional/métodos , Segurança Computacional , Privacidade , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Reprodutibilidade dos Testes
11.
Genome Biol ; 16: 91, 2015 May 06.
Artigo em Inglês | MEDLINE | ID: mdl-25944252

RESUMO

Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic single nucleotide variants obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open source and available at http://viq854.github.io/lichee .


Assuntos
Linhagem da Célula/genética , Variação Genética , Neoplasias/genética , Algoritmos , Carcinoma de Células Renais/genética , Biologia Computacional/métodos , Simulação por Computador , Progressão da Doença , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias Renais/genética , Neoplasias Ovarianas/genética , Filogenia , Software , Ensaios Antitumorais Modelo de Xenoenxerto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA