Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Funct Integr Genomics ; 24(5): 139, 2024 Aug 19.
Artigo em Inglês | MEDLINE | ID: mdl-39158621

RESUMO

Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Aprendizado de Máquina , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Aprendizado Profundo
2.
Brief Bioinform ; 22(1): 55-65, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32249310

RESUMO

Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient's individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine's main objective-ensuring the optimum diagnosis, treatment and prognosis for each individual-investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data-and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).


Assuntos
Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Fenótipo , Medicina de Precisão/métodos , Predisposição Genética para Doença , Humanos , Sequenciamento Completo do Genoma/métodos
3.
Genet Epidemiol ; 45(1): 36-45, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-32864779

RESUMO

The breakthroughs in next generation sequencing have allowed us to access data consisting of both common and rare variants, and in particular to investigate the impact of rare genetic variation on complex diseases. Although rare genetic variants are thought to be important components in explaining genetic mechanisms of many diseases, discovering these variants remains challenging, and most studies are restricted to population-based designs. Further, despite the shift in the field of genome-wide association studies (GWAS) towards studying rare variants due to the "missing heritability" phenomenon, little is known about rare X-linked variants associated with complex diseases. For instance, there is evidence that X-linked genes are highly involved in brain development and cognition when compared with autosomal genes; however, like most GWAS for other complex traits, previous GWAS for mental diseases have provided poor resources to deal with identification of rare variant associations on X-chromosome. In this paper, we address the two issues described above by proposing a method that can be used to test X-linked variants using sequencing data on families. Our method is much more general than existing methods, as it can be applied to detect both common and rare variants, and is applicable to autosomes as well. Our simulation study shows that the method is efficient, and exhibits good operational characteristics. An application to the University of Miami Study on Genetics of Autism and Related Disorders also yielded encouraging results.


Assuntos
Genes Ligados ao Cromossomo X , Estudo de Associação Genômica Ampla , Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Modelos Genéticos , Herança Multifatorial
4.
Clin Chem ; 68(2): 313-321, 2022 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-34871369

RESUMO

BACKGROUND: To date, the usage of Galaxy, an open-source bioinformatics platform, has been reported primarily in research. We report 5 years' experience (2015 to 2020) with Galaxy in our hospital, as part of the "Assistance Publique-Hôpitaux de Paris" (AP-HP), to demonstrate its suitability for high-throughput sequencing (HTS) data analysis in a clinical laboratory setting. METHODS: Our Galaxy instance has been running since July 2015 and is used daily to study inherited diseases, cancer, and microbiology. For the molecular diagnosis of hereditary diseases, 6970 patients were analyzed with Galaxy (corresponding to a total of 7029 analyses). RESULTS: Using Galaxy, the time to process a batch of 23 samples-equivalent to a targeted DNA sequencing MiSeq run-from raw data to an annotated variant call file was generally less than 2 h for panels between 1 and 500 kb. Over 5 years, we only restarted the server twice for hardware maintenance and did not experience any significant troubles, demonstrating the robustness of our Galaxy installation in conjunction with HTCondor as a job scheduler and a PostgreSQL database. The quality of our targeted exome sequencing method was externally evaluated annually by the European Molecular Genetics Quality Network (EMQN). Sensitivity was mean (SD)% 99 (2)% for single nucleotide variants and 93 (9)% for small insertion-deletions. CONCLUSION: Our experience with Galaxy demonstrates it to be a suitable platform for HTS data analysis with vast potential to benefit patient care in a clinical laboratory setting.


Assuntos
Biologia Computacional , Laboratórios Clínicos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA , Software
5.
Curr Issues Mol Biol ; 43(3): 1937-1949, 2021 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-34889894

RESUMO

The worldwide emergence and spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) since 2019 has highlighted the importance of rapid and reliable diagnostic testing to prevent and control the viral transmission. However, inaccurate results may occur due to false negatives (FN) caused by polymorphisms or point mutations related to the virus evolution and compromise the accuracy of the diagnostic tests. Therefore, PCR-based SARS-CoV-2 diagnostics should be evaluated and evolve together with the rapidly increasing number of new variants appearing around the world. However, even by using a large collection of samples, laboratories are not able to test a representative collection of samples that deals with the same level of diversity that is continuously evolving worldwide. In the present study, we proposed a methodology based on an in silico and in vitro analysis. First, we used all information offered by available whole-genome sequencing data for SARS-CoV-2 for the selection of the two PCR assays targeting two different regions in the genome, and to monitor the possible impact of virus evolution on the specificity of the primers and probes of the PCR assays during and after the development of the assays. Besides this first essential in silico evaluation, a minimal set of testing was proposed to generate experimental evidence on the method performance, such as specificity, sensitivity and applicability. Therefore, a duplex reverse-transcription droplet digital PCR (RT-ddPCR) method was evaluated in silico by using 154 489 whole-genome sequences of SARS-CoV-2 strains that were representative for the circulating strains around the world. The RT-ddPCR platform was selected as it presented several advantages to detect and quantify SARS-CoV-2 RNA in clinical samples and wastewater. Next, the assays were successfully experimentally evaluated for their sensitivity and specificity. A preliminary evaluation of the applicability of the developed method was performed using both clinical and wastewater samples.


Assuntos
Teste de Ácido Nucleico para COVID-19/métodos , COVID-19/virologia , Testes Diagnósticos de Rotina/métodos , Evolução Molecular , RNA Viral/genética , SARS-CoV-2/genética , COVID-19/diagnóstico , Humanos , Curva ROC , SARS-CoV-2/isolamento & purificação
6.
Methods ; 173: 61-68, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31271880

RESUMO

Structural variants (SVs) are a class of genomic variation shared by members of the same species. Though relatively rare, they represent an increasingly important class of variation, as SVs have been associated with diseases and susceptibility to some types of cancer. Common approaches to SV detection require the sequencing and mapping of fragments from a test genome to a high-quality reference genome. Candidate SVs correspond to fragments with discordant mapped configurations. However, because errors in the sequencing and mapping will also create discordant arrangements, many of these predictions will be spurious. When sequencing coverage is low, distinguishing true SVs from errors is even more challenging. In recent work, we have developed SV detection methods that exploit genome information of closely related individuals - parents and children. Our previous approaches were based on the assumption that any SV present in a child's genome must have come from one of their parents. However, using this strict restriction may have resulted in failing to predict any rare but novel variants present only in the child. In this work, we generalize our previous approaches to allow the child to carry novel variants. We consider a constrained optimization approach where variants in the child are of two types either inherited - and therefore must be present in a parent - or novel. For simplicity, we consider only a single parent and single child each of which have a haploid genome. However, even in this restricted case, our approach has the power to improve variant prediction. We present results on both simulated candidate variant regions, parent-child trios from the 1000 Genomes Project, and a subset of the 17 Platinum Genomes.


Assuntos
Genoma Humano/genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Variação Estrutural do Genoma/genética , Humanos
7.
BMC Genomics ; 21(Suppl 6): 405, 2020 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-33349236

RESUMO

BACKGROUND: Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. RESULTS: We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy. CONCLUSIONS: Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models.


Assuntos
Genômica , Aprendizado de Máquina , Algoritmos , Análise por Conglomerados , Biologia Computacional , Humanos , Quase-Espécies
8.
Stat Appl Genet Mol Biol ; 18(4)2019 05 30.
Artigo em Inglês | MEDLINE | ID: mdl-31145697

RESUMO

Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.


Assuntos
Variações do Número de Cópias de DNA , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Teorema de Bayes , Linhagem Celular Tumoral , Simulação por Computador , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Regressão , Software
9.
BMC Genomics ; 20(Suppl 12): 1001, 2019 Dec 30.
Artigo em Inglês | MEDLINE | ID: mdl-31888490

RESUMO

BACKGROUND: Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not. METHODS: We select about six thousand SNPs in the human genome and develop a Bayesian framework that is able to robustly identify sample matches between next generation sequencing data sets. RESULTS: We validate our approach on a variety of data sets. Most importantly, we show that our approach can establish identity between different omics data types such as Exome, RNA-Seq, and MethylCap-Seq. We demonstrate how identity detection degrades with sample quality and read coverage, but show that twenty million reads of a fairly low quality RNA-Seq sample are still sufficient for reliable sample identification. CONCLUSION: Our tool, SMASH, is able to identify sample mismatches in next generation sequencing data sets between different sequencing modalities and for low quality sequencing data.


Assuntos
Genômica/métodos , Polimorfismo de Nucleotídeo Único/genética , Software , Teorema de Bayes , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Reprodutibilidade dos Testes , Análise de Sequência de DNA
10.
BMC Bioinformatics ; 19(Suppl 4): 79, 2018 05 08.
Artigo em Inglês | MEDLINE | ID: mdl-29745849

RESUMO

BACKGROUND: As one possible solution to the "missing heritability" problem, many methods have been proposed that apply pathway-based analyses, using rare variants that are detected by next generation sequencing technology. However, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways. RESULTS: Simulation studies successfully demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing methods. Moreover, real data analysis of six type 2 diabetes-related traits, using large-scale whole exome sequencing data, identified significant pathways that were not found by univariate analysis. Furthermore, strong relationships between the identified pathways, and their associated metabolic disorder risk factors, were found via literature search, and one of the identified pathway, was successfully replicated by an analysis with an independent dataset. CONCLUSIONS: Herein, we present a powerful, pathway-based approach to investigate associations between multiple pathways and multiple phenotypes. By reflecting the natural hierarchy of biological behavior, and considering correlation between pathways and phenotypes, the proposed method is capable of analyzing multiple phenotypes and multiple pathways simultaneously.


Assuntos
Variação Genética , Transdução de Sinais/genética , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Diabetes Mellitus Tipo 2/genética , Exoma/genética , Humanos , Modelos Genéticos , Análise Multivariada , Fenótipo
11.
Genet Epidemiol ; 41(4): 363-371, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28300291

RESUMO

Recent advances in genotyping with high-density markers allow researchers access to genomic variants including rare ones. Linkage disequilibrium (LD) is widely used to provide insight into evolutionary history. It is also the basis for association mapping in humans and other species. Better understanding of the genomic LD structure may lead to better-informed statistical tests that can improve the power of association studies. Although rare variant associations with common diseases (RVCD) have been extensively studied recently, there is very limited understanding, and even controversial view of LD structures among rare variants and between rare and common variants. In fact, many popular RVCD tests make the assumptions that rare variants are independent. In this report, we show that two commonly used LD measures are not capable of detecting LD when rare variants are involved. We present this argument from two perspectives, both the LD measures themselves and the computational issues associated with them. To address these issues, we propose an alternative LD measure, the polychoric correlation, that was originally designed for detecting associations among categorical variables. Using simulated as well as the 1000 Genomes data, we explore the performances of LD measures in detail and discuss their implications in association studies.


Assuntos
Variação Genética , Estudo de Associação Genômica Ampla , Cromossomos Humanos Par 21/genética , Simulação por Computador , Frequência do Gene/genética , Genótipo , Humanos , Desequilíbrio de Ligação/genética , Polimorfismo de Nucleotídeo Único/genética
12.
BMC Bioinformatics ; 18(1): 426, 2017 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-28950836

RESUMO

BACKGROUND: Constructing alignments and phylogenies for a given locus from large genome sequencing studies with relevant outgroups allow novel evolutionary and anthropological insights. However, no user-friendly tool has been developed to integrate thousands of recently available and anthropologically relevant genome sequences to construct complete sequence alignments and phylogenies. RESULTS: Here, we provide VCFtoTree, a user friendly tool with a graphical user interface that directly accesses online databases to download, parse and analyze genome variation data for regions of interest. Our pipeline combines popular sequence datasets and tree building algorithms with custom data parsing to generate accurate alignments and phylogenies using all the individuals from the 1000 Genomes Project, Neanderthal and Denisovan genomes, as well as reference genomes of Chimpanzee and Rhesus Macaque. It can also be applied to other phased human genomes, as well as genomes from other species. The output of our pipeline includes an alignment in FASTA format and a tree file in newick format. CONCLUSION: VCFtoTree fulfills the increasing demand for constructing alignments and phylogenies for a given loci from thousands of available genomes. Our software provides a user friendly interface for a wider audience without prerequisite knowledge in programming. VCFtoTree can be accessed from https://github.com/duoduoo/VCFtoTree_3.0.0 .


Assuntos
Loci Gênicos , Genoma Humano , Filogenia , Alinhamento de Sequência/métodos , Software , Algoritmos , Animais , Sequência de Bases , Humanos , Mutação INDEL/genética , Primatas , Análise de Sequência de DNA , Interface Usuário-Computador
13.
Curr Genomics ; 18(4): 360-365, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-29081691

RESUMO

BACKGROUND: Recently, identification and functional studies of circular RNAs, a type of non-coding RNAs arising from a ligation of 3' and 5' ends of a linear RNA molecule, were conducted in mammalian cells with the development of RNA-seq technology. METHOD: Since compared with animals, studies on circular RNAs in plants are less thorough, a genome-wide identification of circular RNA candidates in Arabidopsis was conducted with our own developed bioinformatics tool to several existing RNA-seq datasets specifically for non-coding RNAs. RESULTS: A total of 164 circular RNA candidates were identified from RNA-seq data, and 4 circular RNA transcripts, including both exonic and intronic circular RNAs, were experimentally validated. Interestingly, our results show that circular RNA transcripts are enriched in the photosynthesis system for the leaf tissue and correlated to the higher expression levels of their parent genes. Sixteen out of all 40 genes that have circular RNA candidates are related to the photosynthesis system, and out of the total 146 exonic circular RNA candidates, 63 are found in chloroplast.

14.
Ann Hum Genet ; 79(3): 199-208, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25875492

RESUMO

Because next generation sequencing technology that can rapidly genotype most genetic variations genome, there is considerable interest in investigating the effects of rare variants on complex diseases. In this paper, we propose four Kullback-Leibler distance-based Tests (KLTs) for detecting genotypic differences between cases and controls. There are several features that set the proposed tests apart from existing ones. First, by explicitly considering and comparing the distributions of genotypes, existence of variants with opposite directional effects does not compromise the power of KLTs. Second, it is not necessary to set a threshold for rare variants as the KL definition makes it reasonable to consider rare and common variants together without worrying about the contribution from one type overshadowing the other. Third, KLTs are robust to null variants thanks to a built-in noise fighting mechanism. Finally, correlation among variants is taken into account implicitly so the KLTs work well regardless of the underlying LD structure. Through extensive simulations, we demonstrated good performance of KLTs compared to the sum of squared score test (SSU) and optimal sequence kernel association test (SKAT-O). Moreover, application to the Dallas Heart Study data illustrates the feasibility and performance of KLTs in a realistic setting.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Simulação por Computador , Variação Genética , Genótipo , Humanos , Modelos Genéticos
15.
Biometrics ; 70(2): 430-40, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-24571656

RESUMO

The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA-protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA-protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA-protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this article, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA-protein binding sites from PAR-CLIP data.


Assuntos
Modelos Estatísticos , Proteínas/química , RNA/química , Teorema de Bayes , Sítios de Ligação , Biometria/métodos , Simulação por Computador , Reagentes de Ligações Cruzadas , Proteína do X Frágil da Deficiência Intelectual/química , Proteína do X Frágil da Deficiência Intelectual/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Imunoprecipitação , Cadeias de Markov , Conformação de Ácido Nucleico , Estrutura Secundária de Proteína , Proteínas/metabolismo , RNA/genética , RNA/metabolismo , Proteína FUS de Ligação a RNA/química , Proteína FUS de Ligação a RNA/metabolismo , Análise de Sequência de RNA
16.
Stud Health Technol Inform ; 305: 194-197, 2023 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-37386994

RESUMO

The paper presents a current situation of the FHIR Genomics resource and an assessment of FAIR data usage and possible future directions. FHIR Genomics forges a path towards data interoperability. By integrating both the FAIR principles and the FHIR resources, we can achieve a higher standardization across healthcare data collection and a smoother data exchange. By exemplifying on the FHIR Genomics resource, we want to pave the way towards the integration of genomic data into an Obstetrics-Gynecology Information system as a future direction to be able to identify possible disease predisposition in fetus.


Assuntos
Ginecologia , Obstetrícia , Feminino , Gravidez , Humanos , Genômica , Coleta de Dados , Feto
17.
Front Immunol ; 14: 1146826, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37180102

RESUMO

The human leukocyte antigen (HLA) locus plays a central role in adaptive immune function and has significant clinical implications for tissue transplant compatibility and allelic disease associations. Studies using bulk-cell RNA sequencing have demonstrated that HLA transcription may be regulated in an allele-specific manner and single-cell RNA sequencing (scRNA-seq) has the potential to better characterize these expression patterns. However, quantification of allele-specific expression (ASE) for HLA loci requires sample-specific reference genotyping due to extensive polymorphism. While genotype prediction from bulk RNA sequencing is well described, the feasibility of predicting HLA genotypes directly from single-cell data is unknown. Here we evaluate and expand upon several computational HLA genotyping tools by comparing predictions from human single-cell data to gold-standard, molecular genotyping. The highest 2-field accuracy averaged across all loci was 76% by arcasHLA and increased to 86% using a composite model of multiple genotyping tools. We also developed a highly accurate model (AUC 0.93) for predicting HLA-DRB345 copy number in order to improve genotyping accuracy of the HLA-DRB locus. Genotyping accuracy improved with read depth and was reproducible at repeat sampling. Using a metanalytic approach, we also show that HLA genotypes from PHLAT and OptiType can generate ASE ratios that are highly correlated (R2 = 0.8 and 0.94, respectively) with those derived from gold-standard genotyping.


Assuntos
Antígenos HLA , Transcriptoma , Humanos , Análise de Sequência de DNA , Antígenos HLA/genética , Antígenos de Histocompatibilidade Classe I/genética , Genótipo , Antígenos de Histocompatibilidade Classe II/genética
18.
Comput Struct Biotechnol J ; 21: 5382-5393, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38022693

RESUMO

Analysis and interpretation of high-throughput transcriptional and chromatin accessibility data at single-cell (sc) resolution are still open challenges in the biomedical field. The existence of countless bioinformatics tools, for the different analytical steps, increases the complexity of data interpretation and the difficulty to derive biological insights. In this article, we present SCALA, a bioinformatics tool for analysis and visualization of single-cell RNA sequencing (scRNA-seq) and Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) datasets, enabling either independent or integrative analysis of the two modalities. SCALA combines standard types of analysis by integrating multiple software packages varying from quality control to the identification of distinct cell populations and cell states. Additional analysis options enable functional enrichment, cellular trajectory inference, ligand-receptor analysis, and regulatory network reconstruction. SCALA is fully parameterizable, presenting data in tabular format and producing publication-ready visualizations. The different available analysis modules can aid biomedical researchers in exploring, analyzing, and visualizing their data without any prior experience in coding. We demonstrate the functionality of SCALA through two use-cases related to TNF-driven arthritic mice, handling both scRNA-seq and scATAC-seq datasets. SCALA is developed in R, Shiny and JavaScript and is mainly available as a standalone version, while an online service of more limited capacity can be found at http://scala.pavlopouloslab.info or https://scala.fleming.gr.

19.
Mol Ecol Resour ; 2022 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-36458971

RESUMO

Polyploids are cells or organisms with a genome consisting of more than two sets of homologous chromosomes. Polyploid plants have important traits that facilitate speciation and are thus often model systems for evolutionary, molecular ecology and agricultural studies. However, due to their unusual mode of inheritance and double-reduction, diploid models of population genetic analysis cannot properly be applied to autopolyploids. To overcome this problem, we developed a software package entitled vcfpop to perform a variety of population genetic analyses for autopolyploids, such as parentage analysis, analysis of molecular variance, principal coordinates analysis, hierarchical clustering analysis and Bayesian clustering. We used three data sets to evaluate the capability of vcfpop to analyse large data sets on a desktop computer. This software is freely available at http://github.com/huangkang1987/vcfpop.

20.
Front Genet ; 13: 1084974, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36733945

RESUMO

Copy number variation (CNV) is one of the main structural variations in the human genome and accounts for a considerable proportion of variations. As CNVs can directly or indirectly cause cancer, mental illness, and genetic disease in humans, their effective detection in humans is of great interest in the fields of oncogene discovery, clinical decision-making, bioinformatics, and drug discovery. The advent of next-generation sequencing data makes CNV detection possible, and a large number of CNV detection tools are based on next-generation sequencing data. Due to the complexity (e.g., bias, noise, alignment errors) of next-generation sequencing data and CNV structures, the accuracy of existing methods in detecting CNVs remains low. In this work, we design a new CNV detection approach, called shortest path-based Copy number variation (SPCNV), to improve the detection accuracy of CNVs. SPCNV calculates the k nearest neighbors of each read depth and defines the shortest path, shortest path relation, and shortest path cost sets based on which further calculates the mean shortest path cost of each read depth and its k nearest neighbors. We utilize the ratio between the mean shortest path cost for each read depth and the mean of the mean shortest path cost of its k nearest neighbors to construct a relative shortest path score formula that is able to determine a score for each read depth. Based on the score profile, a boxplot is then applied to predict CNVs. The performance of the proposed method is verified by simulation data experiments and compared against several popular methods of the same type. Experimental results show that the proposed method achieves the best balance between recall and precision in each set of simulated samples. To further verify the performance of the proposed method in real application scenarios, we then select real sample data from the 1,000 Genomes Project to conduct experiments. The proposed method achieves the best F1-scores in almost all samples. Therefore, the proposed method can be used as a more reliable tool for the routine detection of CNVs.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA