Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 57
Filtrar
1.
Eur Heart J Digit Health ; 5(3): 363-370, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38774379

RESUMO

Aims: Cardiovascular disease (CVD) is a leading cause of mortality, especially in developing countries. This study aimed to develop and validate a CVD risk prediction model, Personalized CARdiovascular DIsease risk Assessment for Chinese (P-CARDIAC), for recurrent cardiovascular events using machine learning technique. Methods and results: Three cohorts of Chinese patients with established CVD were included if they had used any of the public healthcare services provided by the Hong Kong Hospital Authority (HA) since 2004 and categorized by their geographical locations. The 10-year CVD outcome was a composite of diagnostic or procedure codes with specific International Classification of Diseases, Ninth Revision, Clinical Modification. Multivariate imputation with chained equations and XGBoost were applied for the model development. The comparison with Thrombolysis in Myocardial Infarction Risk Score for Secondary Prevention (TRS-2°P) and Secondary Manifestations of ARTerial disease (SMART2) used the validation cohorts with 1000 bootstrap replicates. A total of 48 799, 119 672 and 140 533 patients were included in the derivation and validation cohorts, respectively. A list of 125 risk variables were used to make predictions on CVD risk, of which 8 classes of CVD-related drugs were considered interactive covariates. Model performance in the derivation cohort showed satisfying discrimination and calibration with a C statistic of 0.69. Internal validation showed good discrimination and calibration performance with C statistic over 0.6. The P-CARDIAC also showed better performance than TRS-2°P and SMART2. Conclusion: Compared with other risk scores, the P-CARDIAC enables to identify unique patterns of Chinese patients with established CVD. We anticipate that the P-CARDIAC can be applied in various settings to prevent recurrent CVD events, thus reducing the related healthcare burden.

2.
Bioinform Adv ; 4(1): vbae006, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38282975

RESUMO

Summary: Third-generation long-read sequencing is an increasingly utilized technique for profiling human immunodeficiency virus (HIV) quasispecies and detecting drug resistance mutations due to its ability to cover the entire viral genome in individual reads. Recently, the ClusterV tool has demonstrated accurate detection of HIV quasispecies from Nanopore long-read sequencing data. However, the need for scripting skills and a computational environment may act as a barrier for many potential users. To address this issue, we have introduced ClusterV-Web, a user-friendly web-based application that enables easy configuration and execution of ClusterV, both remotely and locally. Our tool provides interactive tables and data visualizations to aid in the interpretation of results. This development is expected to democratize access to long-read sequencing data analysis, enabling a wider range of researchers and clinicians to efficiently profile HIV quasispecies and detect drug resistance mutations. Availability and implementation: ClusterV-Web is freely available and open source, with detailed documentation accessible at http://www.bio8.cs.hku.hk/ClusterVW/. The standalone Docker image and source code are also available at https://github.com/HKU-BAL/ClusterV-Web.

3.
Front Cell Dev Biol ; 11: 1224069, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37655157

RESUMO

Background: An increasing number of patients are being diagnosed with lung adenocarcinoma, but there remains limited progress in enhancing prognostic outcomes and improving survival rates for these patients. Genome instability is considered a contributing factor, as it enables other hallmarks of cancer to acquire functional capabilities, thus allowing cancer cells to survive, proliferate, and disseminate. Despite the importance of genome instability in cancer development, few studies have explored the prognostic signature associated with genome instability for lung adenocarcinoma. Methods: In the study, we randomly divided 397 lung adenocarcinoma patients from The Cancer Genome Atlas database into a training group (n = 199) and a testing group (n = 198). By calculating the cumulative counts of genomic alterations for each patient in the training group, we distinguished the top 25% and bottom 25% of patients. We then compared their gene expressions to identify genome instability-related genes. Next, we used univariate and multivariate Cox regression analyses to identify the prognostic signature. We also performed the Kaplan-Meier survival analysis and the log-rank test to evaluate the performance of the identified prognostic signature. The performance of the signature was further validated in the testing group, in The Cancer Genome Atlas dataset, and in external datasets. We also conducted a time-dependent receiver operating characteristic analysis to compare our signature with established prognostic signatures to demonstrate its potential clinical value. Results: We identified GULPsig, which includes IGF2BP1, IGF2BP3, SMC1B, CLDN6, and LY6K, as a prognostic signature for lung adenocarcinoma patients from 42 genome instability-related genes. Based on the risk score of the risk model with GULPsig, we successfully stratified the patients into high- and low-risk groups according to the results of the Kaplan-Meier survival analysis and the log-rank test. We further validated the performance of GULPsig as an independent prognostic signature and observed that it outperformed established prognostic signatures. Conclusion: We provided new insights to explore the clinical application of genome instability and identified GULPsig as a potential prognostic signature for lung adenocarcinoma patients.

4.
Stem Cell Res Ther ; 14(1): 247, 2023 09 13.
Artigo em Inglês | MEDLINE | ID: mdl-37705079

RESUMO

AIMS: Dissecting complex interactions among transcription factors (TFs), microRNAs (miRNAs) and long noncoding RNAs (lncRNAs) are central for understanding heart development and function. Although computational approaches and platforms have been described to infer relationships among regulatory factors and genes, current approaches do not adequately account for how highly diverse, interacting regulators that include noncoding RNAs (ncRNAs) control cardiac gene expression dynamics over time. METHODS: To overcome this limitation, we devised an integrated framework, cardiac gene regulatory modeling (CGRM) that integrates LogicTRN and regulatory component analysis bioinformatics modeling platforms to infer complex regulatory mechanisms. We then used CGRM to identify and compare the TF-ncRNA gene regulatory networks that govern early- and late-stage cardiomyocytes (CMs) generated by in vitro differentiation of human pluripotent stem cells (hPSC) and ventricular and atrial CMs isolated during in vivo human cardiac development. RESULTS: Comparisons of in vitro versus in vivo derived CMs revealed conserved regulatory networks among TFs and ncRNAs in early cells that significantly diverged in late staged cells. We report that cardiac genes ("heart targets") expressed in early-stage hPSC-CMs are primarily regulated by MESP1, miR-1, miR-23, lncRNAs NEAT1 and MALAT1, while GATA6, HAND2, miR-200c, NEAT1 and MALAT1 are critical for late hPSC-CMs. The inferred TF-miRNA-lncRNA networks regulating heart development and contraction were similar among early-stage CMs, among individual hPSC-CM datasets and between in vitro and in vivo samples. However, genes related to apoptosis, cell cycle and proliferation, and transmembrane transport showed a high degree of divergence between in vitro and in vivo derived late-stage CMs. Overall, late-, but not early-stage CMs diverged greatly in the expression of "heart target" transcripts and their regulatory mechanisms. CONCLUSIONS: In conclusion, we find that hPSC-CMs are regulated in a cell autonomous manner during early development that diverges significantly as a function of time when compared to in vivo derived CMs. These findings demonstrate the feasibility of using CGRM to reveal dynamic and complex transcriptional and posttranscriptional regulatory interactions that underlie cell directed versus environment-dependent CM development. These results with in vitro versus in vivo derived CMs thus establish this approach for detailed analyses of heart disease and for the analysis of cell regulatory systems in other biomedical fields.


Assuntos
MicroRNAs , RNA Longo não Codificante , Humanos , RNA Longo não Codificante/genética , Fatores de Transcrição/genética , MicroRNAs/genética , Miócitos Cardíacos , Ventrículos do Coração
5.
BMC Bioinformatics ; 24(1): 308, 2023 Aug 03.
Artigo em Inglês | MEDLINE | ID: mdl-37537536

RESUMO

BACKGROUND: With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data. RESULTS: We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP . CONCLUSIONS: These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.


Assuntos
Genoma , Genômica , Reprodutibilidade dos Testes , Sequenciamento de Nucleotídeos em Larga Escala
6.
Sci Rep ; 13(1): 5237, 2023 03 31.
Artigo em Inglês | MEDLINE | ID: mdl-37002338

RESUMO

Sensitive detection of Mycobacterium tuberculosis (TB) in small percentages in metagenomic samples is essential for microbial classification and drug resistance prediction. However, traditional methods, such as bacterial culture and microscopy, are time-consuming and sometimes have limited TB detection sensitivity. Oxford nanopore technologies (ONT) MinION sequencing allows rapid and simple sample preparation for sequencing. Its recently developed adaptive sequencing selects reads from targets while allowing real-time base-calling to achieve sequence enrichment or depletion during sequencing. Another common enrichment method is PCR amplification of the target TB genes. In this study, we compared both methods using ONT MinION sequencing for TB detection and variant calling in metagenomic samples using both simulation runs and those with synthetic and patient samples. We found that both methods effectively enrich TB reads from a high percentage of human (95%) and other microbial DNA. Adaptive sequencing with readfish and UNCALLDE achieved a 3.9-fold and 2.2-fold enrichment compared to the control run. We provide a simple automatic analysis framework to support the detection of TB for clinical use, openly available at https://github.com/HKU-BAL/ONT-TB-NF . Depending on the patient's medical condition and sample type, we recommend users evaluate and optimize their workflow for different clinical specimens to improve the detection limit.


Assuntos
Mycobacterium tuberculosis , Nanoporos , Humanos , Mycobacterium tuberculosis/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Metagenoma , Simulação por Computador , Análise de Sequência de DNA
7.
Genome Med ; 15(1): 10, 2023 02 14.
Artigo em Inglês | MEDLINE | ID: mdl-36788602

RESUMO

BACKGROUND: Very low-coverage (0.1 to 1×) whole genome sequencing (WGS) has become a promising and affordable approach to discover genomic variants of human populations for genome-wide association study (GWAS). To support genetic screening using preimplantation genetic testing (PGT) in a large population, the sequencing coverage goes below 0.1× to an ultra-low level. However, the feasibility and effectiveness of ultra-low-coverage WGS (ulcWGS) for GWAS remains undetermined. METHODS: We built a pipeline to carry out analysis of ulcWGS data for GWAS. To examine its effectiveness, we benchmarked the accuracy of genotype imputation at the combination of different coverages below 0.1× and sample sizes from 2000 to 16,000, using 17,844 embryo PGT samples with approximately 0.04× average coverage and the standard Chinese sample HG005 with known genotypes. We then applied the imputed genotypes of 1744 transferred embryos who have gestational ages and complete follow-up records to GWAS. RESULTS: The accuracy of genotype imputation under ultra-low coverage can be improved by increasing the sample size and applying a set of filters. From 1744 born embryos, we identified 11 genomic risk loci associated with gestational ages and 166 genes mapped to these loci according to positional, expression quantitative trait locus, and chromatin interaction strategies. Among these mapped genes, CRHBP, ICAM1, and OXTR were more frequently reported as preterm birth related. By joint analysis of gene expression data from previous studies, we constructed interrelationships of mainly CRHBP, ICAM1, PLAGL1, DNMT1, CNTLN, DKK1, and EGR2 with preterm birth, infant disease, and breast cancer. CONCLUSIONS: This study not only demonstrates that ulcWGS could achieve relatively high accuracy of adequate genotype imputation and is capable of GWAS, but also provides insights into the associations between gestational age and genetic variations of the fetal embryos from Chinese population.


Assuntos
Estudo de Associação Genômica Ampla , Nascimento Prematuro , Recém-Nascido , Feminino , Humanos , Idade Gestacional , Polimorfismo de Nucleotídeo Único , Testes Genéticos , Genótipo , Locos de Características Quantitativas
8.
Artigo em Inglês | MEDLINE | ID: mdl-35120007

RESUMO

In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity ≤ 50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications - Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction - and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Filogenia , Biologia Computacional/métodos , Proteínas/genética , Software
9.
BMC Bioinformatics ; 23(1): 465, 2022 Nov 07.
Artigo em Inglês | MEDLINE | ID: mdl-36344913

RESUMO

BACKGROUND: Whole genome sequencing using the long-read Oxford Nanopore Technologies (ONT) MinION sequencer provides a cost-effective option for structural variant (SV) detection in clinical applications. Despite the advantage of using long reads, however, accurate SV calling and phasing are still challenging. RESULTS: We introduce Duet, an SV detection tool optimized for SV calling and phasing using ONT data. The tool uses novel features integrated from both SV signatures and single-nucleotide polymorphism signatures, which can accurately distinguish SV haplotype from a false signal. Duet was benchmarked against state-of-the-art tools on multiple ONT sequencing datasets of sequencing coverage ranging from 8× to 40×. At low sequencing coverage of 8×, Duet performs better than all other tools in SV calling, SV genotyping and SV phasing. When the sequencing coverage is higher (20× to 40×), the F1-score for SV phasing is further improved in comparison to the performance of other tools, while its performance of SV genotyping and SV calling remains higher than other tools. CONCLUSION: Duet can perform accurate SV calling, SV genotyping and SV phasing using low-coverage ONT data, making it very useful for low-coverage genomes. It has great performance when scaled to high-coverage genomes, which is adaptable to various clinical applications. Duet is open source and is available at https://github.com/yekaizhou/duet .


Assuntos
Sequenciamento por Nanoporos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Sequenciamento Completo do Genoma
10.
DNA Res ; 29(6)2022 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-36308393

RESUMO

DNA sequences that are absent in the human reference genome are classified as novel sequences. The discovery of these missed sequences is crucial for exploring the genomic diversity of populations and understanding the genetic basis of human diseases. However, various DNA lengths of reads generated from different sequencing technologies can significantly affect the results of novel sequences. In this work, we designed an assembly-free novel sequence (AF-NS) approach to identify novel sequences from Oxford Nanopore Technology long reads. Among the newly detected sequences using AF-NS, more than 95% were omitted from those using long-read assemblers and 85% were not present in short reads of Illumina. We identified the common novel sequences among all the samples and revealed their association with the binding motifs of transcription factors. Regarding the placements of the novel sequences, we found about 70% enriched in repeat regions and generated 430 for one specific subpopulation that might be related to their evolution. Our study demonstrates the advance of the assembly-free approach to capture more novel sequences over other assembler based methods. Combining the long-read data with powerful analytical methods can be a robust way to improve the completeness of novel sequences.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Humanos , Análise de Sequência de DNA/métodos , Sequência de Bases , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica
11.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35849103

RESUMO

Accurate identification of genetic variants from family child-mother-father trio sequencing data is important in genomics. However, state-of-the-art approaches treat variant calling from trios as three independent tasks, which limits their calling accuracy for Nanopore long-read sequencing data. For better trio variant calling, we introduce Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio's predicted variants within a single model to improve variant calling. We also present MCVLoss, a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio showed comprehensive improvement in experiments. It predicted far fewer Mendelian inheritance violation variations than current state-of-the-art methods. We also demonstrated that our Trio-to-Trio model is more accurate than competing architectures. Clair3-Trio is accessible as a free, open-source project at https://github.com/HKU-BAL/Clair3-Trio.


Assuntos
Nanoporos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Redes Neurais de Computação , Análise de Sequência de DNA , Software
12.
BMC Med Genomics ; 15(1): 43, 2022 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-35246132

RESUMO

BACKGROUND: The application of long-read sequencing using the Oxford Nanopore Technologies (ONT) MinION sequencer is getting more diverse in the medical field. Having a high sequencing error of ONT and limited throughput from a single MinION flowcell, however, limits its applicability for accurate variant detection. Medical exome sequencing (MES) targets clinically significant exon regions, allowing rapid and comprehensive screening of pathogenic variants. By applying MES with MinION sequencing, the technology can achieve a more uniform capture of the target regions, shorter turnaround time, and lower sequencing cost per sample. METHOD: We introduced a cost-effective optimized workflow, ECNano, comprising a wet-lab protocol and bioinformatics analysis, for accurate variant detection at 4800 clinically important genes and regions using a single MinION flowcell. The ECNano wet-lab protocol was optimized to perform long-read target enrichment and ONT library preparation to stably generate high-quality MES data with adequate coverage. The subsequent variant-calling workflow, Clair-ensemble, adopted a fast RNN-based variant caller, Clair, and was optimized for target enrichment data. To evaluate its performance and practicality, ECNano was tested on both reference DNA samples and patient samples. RESULTS: ECNano achieved deep on-target depth of coverage (DoC) at average > 100× and > 98% uniformity using one MinION flowcell. For accurate ONT variant calling, the generated reads sufficiently covered 98.9% of pathogenic positions listed in ClinVar, with 98.96% having at least 30× DoC. ECNano obtained an average read length of 1000 bp. The long reads of ECNano also covered the adjacent splice sites well, with 98.5% of positions having ≥ 30× DoC. Clair-ensemble achieved > 99% recall and accuracy for SNV calling. The whole workflow from wet-lab protocol to variant detection was completed within three days. CONCLUSION: We presented ECNano, an out-of-the-box workflow comprising (1) a wet-lab protocol for ONT target enrichment sequencing and (2) a downstream variant detection workflow, Clair-ensemble. The workflow is cost-effective, with a short turnaround time for high accuracy variant calling in 4800 clinically significant genes and regions using a single MinION flowcell. The long-read exon captured data has potential for further development, promoting the application of long-read sequencing in personalized disease treatment and risk prediction.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Análise Custo-Benefício , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA/métodos , Fluxo de Trabalho
13.
Sci Rep ; 12(1): 4519, 2022 03 16.
Artigo em Inglês | MEDLINE | ID: mdl-35296758

RESUMO

Structural variation (SV) is a major cause of genetic disorders. In this paper, we show that low-depth (specifically, 4×) whole-genome sequencing using a single Oxford Nanopore MinION flow cell suffices to support sensitive detection of SV, particularly pathogenic SV for supporting clinical diagnosis. When using 4× ONT WGS data, existing SV calling software often fails to detect pathogenic SV, especially in the form of long deletion, terminal deletion, duplication, and unbalanced translocation. Our new SV calling software SENSV can achieve high sensitivity for all types of SV and a breakpoint precision typically ± 100 bp; both features are important for clinical concerns. The improvement achieved by SENSV stems from several new algorithms. We evaluated SENSV and other software using both real and simulated data. The former was based on 24 patient samples, each diagnosed with a genetic disorder. SENSV found the pathogenic SV in 22 out of 24 cases (all heterozygous, size from hundreds of kbp to a few Mbp), reporting breakpoints within 100 bp of the true answers. On the other hand, no existing software can detect the pathogenic SV in more than 10 out of 24 cases, even when the breakpoint requirement is relaxed to ± 2000 bp.


Assuntos
Nanoporos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Software , Translocação Genética , Sequenciamento Completo do Genoma
14.
NAR Genom Bioinform ; 4(1): lqac005, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35156024

RESUMO

HKG is the first fully accessible variant database for Hong Kong Cantonese, constructed from 205 novel whole-exome sequencing data. There has long been a research gap in the understanding of the genetic architecture of southern Chinese subgroups, including Hong Kong Cantonese. HKG detected 196 325 high-quality variants with 5.93% being novel, and 25 472 variants were found to be unique in HKG compared to three Chinese populations sampled from 1000 Genomes (CHN). PCA illustrates the uniqueness of HKG in CHN, and the admixture study estimated the ancestral composition of HKG and CHN, with a gradient change from north to south, consistent with their geological distribution. ClinVar, CIViC and PharmGKB annotated 599 clinically significant variants and 360 putative loss-of-function variants, substantiating our understanding of population characteristics for future medical development. Among the novel variants, 96.57% were singleton and 6.85% were of high impact. With a good representation of Hong Kong Cantonese, we demonstrated better variant imputation using reference with the addition of HKG data, thus successfully filling the data gap in southern Chinese to facilitate the regional and global development of population genetics.

15.
Nat Comput Sci ; 2(12): 797-803, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38177392

RESUMO

Deep learning-based variant callers are becoming the standard and have achieved superior single nucleotide polymorphisms calling performance using long reads. Here we present Clair3, which leverages two major method categories: pileup calling handles most variant candidates with speed, and full-alignment tackles complicated candidates to maximize precision and recall. Clair3 runs faster than any of the other state-of-the-art variant callers and demonstrates improved performance, especially at lower coverage.


Assuntos
Aprendizado Profundo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único/genética
16.
Commun Biol ; 4(1): 1016, 2021 08 30.
Artigo em Inglês | MEDLINE | ID: mdl-34462542

RESUMO

Pan-genome sequence analysis of human population ancestry is critical for expanding and better defining human genome sequence diversity. However, the amount of genetic variation still missing from current human reference sequences is still unknown. Here, we used 486 deep-sequenced Han Chinese genomes to identify 276 Mbp of DNA sequences that, to our knowledge, are absent in the current human reference. We classified these sequences into individual-specific and common sequences, and propose that the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. We also analyzed the genomic positions of these common sequences and found that they came from genomic regions characterized by high mutation rate and low pathogenicity. Our study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and establishes the newly defined common sequences as candidates to supplement the current human reference.


Assuntos
Biologia Computacional , Genoma Humano , China , Humanos
17.
NAR Genom Bioinform ; 3(3): lqab062, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34235433

RESUMO

Relation extraction (RE) is a fundamental task for extracting gene-disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene-disease associations only from single sentences or abstract texts. A few studies have explored extracting gene-disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene-disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene-disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene-disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene-disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.

18.
BMC Genomics ; 21(Suppl 6): 500, 2020 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-33349238

RESUMO

BACKGROUND: Next-generation sequencing (NGS) enables unbiased detection of pathogens by mapping the sequencing reads of a patient sample to the known reference sequence of bacteria and viruses. However, for a new pathogen without a reference sequence of a close relative, or with a high load of mutations compared to its predecessors, read mapping fails due to a low similarity between the pathogen and reference sequence, which in turn leads to insensitive and inaccurate pathogen detection outcomes. RESULTS: We developed MegaPath, which runs fast and provides high sensitivity in detecting new pathogens. In MegaPath, we have implemented and tested a combination of polishing techniques to remove non-informative human reads and spurious alignments. MegaPath applies a global optimization to the read alignments and reassigns the reads incorrectly aligned to multiple species to a unique species. The reassignment not only significantly increased the number of reads aligned to distant pathogens, but also significantly reduced incorrect alignments. MegaPath implements an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm to run fast. CONCLUSIONS: In our benchmarks, MegaPath demonstrated superior sensitivity by detecting eight times more reads from a low-similarity pathogen than other tools. Meanwhile, MegaPath ran much faster than the other state-of-the-art alignment-based pathogen detection tools (and compariable with the less sensitivity profile-based pathogen detection tools). The running time of MegaPath is about 20 min on a typical 1 Gb dataset.


Assuntos
Metagenômica , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metagenoma , Alinhamento de Sequência , Análise de Sequência de DNA
19.
Front Genet ; 11: 1008, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33088282

RESUMO

The pathogenesis of diabetic nephropathy (DN) is accompanied by alterations in biological function and signaling pathways regulated through complex molecular mechanisms. A number of regulatory factors, including transcription factors (TFs) and non-coding RNAs (ncRNAs, including lncRNAs and miRNAs), have been implicated in DN; however, it is unclear how the interactions among these regulatory factors contribute to the development of DN pathogenesis. In this study, we developed a network-based analysis to decipher interplays between TFs and ncRNAs regulating progression of DN by combining omics data with regulatory factor-target information. To accomplish this, we identified differential expression programs of mRNAs and miRNAs during early DN (EDN) and established DN. We then uncovered putative interactive connections among miRNA-mRNA, lncRNA-miRNA, and lncRNA-mRNA implicated in transcriptional control. This led to the identification of two lncRNAs (MALAT1 and NEAT1) and the three TFs (NF-κB, NFE2L2, and PPARG) that likely cooperate with a set of miRNAs to modulate EDN and DN target genes. The results highlight how crosstalk among TFs, lncRNAs, and miRNAs regulate the expression of genes both transcriptionally and post-transcriptionally, and our findings provide new insights into the molecular basis and pathogenesis of progressive DN.

20.
BMC Res Notes ; 13(1): 444, 2020 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-32948225

RESUMO

OBJECTIVE: We designed and tested a Nanopore sequencing panel for direct tuberculosis drug resistance profiling. The panel targeted 10 resistance-associated loci. We assessed the feasibility of amplifying and sequencing these loci from 23 clinical specimens with low bacillary burden. RESULTS: At least 8 loci were successfully amplified from the majority for predicting first- and second-line drug resistance (14/23, 60.87%), and the 12 specimens yielding all 10 targets were sequenced with Nanopore MinION and Illumina MiSeq. MinION sequencing data was corrected by Nanopolish and recurrent variants were filtered. A total of 67,082 bases across all consensus sequences were analyzed, with 67,019 bases called by both MinION and MiSeq as wildtype. For the 41 single nucleotide variants (SNVs) called by MiSeq with 100% variant allelic frequency (VAF), 39 (95.1%) were called by MinION. For the 22 mixed bases called by MiSeq, a SNV with the highest VAF (70%) was called by MinION. With short assay time, reasonable reagent cost as well as continuously improving sequencing chemistry and signal correction pipelines, this Nanopore method can be a viable option for direct tuberculosis drug resistance profiling in the near future.


Assuntos
Mycobacterium tuberculosis , Nanoporos , Tuberculose , Resistência a Medicamentos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mycobacterium tuberculosis/genética , Tuberculose/tratamento farmacológico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA