Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 379
Filter
1.
Comput Math Methods Med ; 2022: 7191684, 2022.
Article in English | MEDLINE | ID: mdl-35242211

ABSTRACT

Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.


Subject(s)
Protein Interaction Mapping/methods , Protein Interaction Maps/genetics , Amino Acid Sequence , Bacterial Proteins/genetics , Computational Biology , Databases, Protein/statistics & numerical data , Discriminant Analysis , Evolution, Molecular , Helicobacter pylori/genetics , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Machine Learning , Position-Specific Scoring Matrices , Protein Interaction Mapping/statistics & numerical data , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Support Vector Machine
2.
Pathol Int ; 72(3): 187-192, 2022 Mar.
Article in English | MEDLINE | ID: mdl-35102630

ABSTRACT

NTRK fusions represent a new biomarker-defined population that can be treated with TRK inhibitors. Although rare, NTRK fusions are detected across a wide range of solid tumors. Previous reports suggest that NTRK fusions are limited to the secretory subtype of breast cancer. Here we examined NTRK fusions in a large real world next-generation sequencing (NGS) dataset and confirmed secretory versus non-secretory status using H&E images. Of 23 NTRK fusion-positive cases, 11 were classified as secretory, 11 as non-secretory, and one as mixed status. The secretory subtype trended younger, was predominantly estrogen receptor (ER)-, had lower tumor mutational burden, and exhibited lower levels of genomic loss of heterozygosity. The non-secretory subtype was enriched for TP53 mutations. The secretory subtype was enriched for ETV6-NTRK3 fusions in 7 of 11 cases, and the non-secretory subtype had NTRK1 fusions in 7 of 11 cases, each with a different fusion partner. Our data suggests NTRK fusions are present in both secretory and non-secretory subtypes, and that comprehensive genomic profiling should be considered across all clinically advanced breast cancers to identify patients that could receive benefit from TRK inhibitors.


Subject(s)
Breast Neoplasms/genetics , Carcinoma/diagnosis , Receptor, trkA/genetics , Aged , Breast Neoplasms/diagnosis , Carcinoma/genetics , Female , Gene Fusion/drug effects , Gene Fusion/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Immunohistochemistry/methods , Immunohistochemistry/statistics & numerical data , Middle Aged , Receptor, trkA/adverse effects , Receptor, trkC/genetics
3.
J Comput Biol ; 29(2): 169-187, 2022 02.
Article in English | MEDLINE | ID: mdl-35041495

ABSTRACT

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.


Subject(s)
Algorithms , Genomics/statistics & numerical data , Sequence Alignment/statistics & numerical data , Software , Computational Biology , Databases, Genetic/statistics & numerical data , Genome, Bacterial , Genome, Human , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Salmonella/genetics , Sequence Analysis, DNA/statistics & numerical data , Wavelet Analysis
4.
J Comput Biol ; 29(2): 106-120, 2022 02.
Article in English | MEDLINE | ID: mdl-35020412

ABSTRACT

High-throughput chromosome conformation capture (Hi-C) has recently been applied to natural microbial communities and revealed great potential to study multiple genomes simultaneously. Several extraneous factors may influence chromosomal contacts rendering the normalization of Hi-C contact maps essential for downstream analyses. However, the current paucity of metagenomic Hi-C normalization methods and the ignorance for spurious interspecies contacts weaken the interpretability of the data. Here, we report on two types of biases in metagenomic Hi-C experiments: explicit biases and implicit biases, and introduce HiCzin, a parametric model to correct both types of biases and remove spurious interspecies contacts. We demonstrate that the normalized metagenomic Hi-C contact maps by HiCzin result in lower biases, higher capability to detect spurious contacts, and better performance in metagenomic contig clustering.


Subject(s)
Metagenomics/statistics & numerical data , Algorithms , Bias , Chromosomes/genetics , Computational Biology , High-Throughput Nucleotide Sequencing/statistics & numerical data , Linear Models , Logistic Models , Metagenome , Microbiota/genetics , Regression Analysis , Software , Yeasts/genetics
5.
Nurs Res ; 71(1): 43-53, 2022.
Article in English | MEDLINE | ID: mdl-34985847

ABSTRACT

BACKGROUND: Nurse researchers are well poised to study the connection of the microbiome to health and disease. Evaluating published microbiome results can assist with study design and hypothesis generation. OBJECTIVES: This article aims to present and define important analysis considerations in microbiome study planning and to identify genera shared across studies despite methodological differences. This methods article will highlight a workflow that the nurse scientist can use to combine and evaluate taxonomy tables for microbiome study or research proposal planning. METHODS: We compiled taxonomy tables from 13 published gut microbiome studies that had used Ion Torrent sequencing technology. We searched for studies that had amplified multiple hypervariable (V) regions of the 16S rRNA gene when sequencing the bacteria from healthy gut samples. RESULTS: We obtained 15 taxonomy tables from the 13 studies, comprised of samples from four continents and eight V regions. Methodology among studies was highly variable, including differences in V regions amplified, geographic location, and population demographics. Nevertheless, of the 354 total genera identified from the 15 data sets, 25 were shared in all V regions and the four continents. When relative abundance differences across the V regions were compared, Dorea and Roseburia were statistically different. Taxonomy tables from Asian subjects had increased average abundances of Prevotella and lowered abundances of Bacteroides compared with the European, North American, and South American study subjects. DISCUSSION: Evaluating taxonomy tables from previously published literature is essential for study planning. The genera found from different V regions and continents highlight geography and V region as important variables to consider in microbiome study design. The 25 shared genera across the various studies may represent genera commonly found in healthy gut microbiomes. Understanding the factors that may affect the results from a variety of microbiome studies will allow nurse scientists to plan research proposals in an informed manner. This work presents a valuable framework for future cross-study comparisons conducted across the globe.


Subject(s)
Classification/methods , Gastrointestinal Microbiome/physiology , Gastrointestinal Microbiome/immunology , Global Health/statistics & numerical data , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/statistics & numerical data
6.
Ultrasound Obstet Gynecol ; 59(1): 26-32, 2022 Jan.
Article in English | MEDLINE | ID: mdl-34309942

ABSTRACT

OBJECTIVE: To determine the diagnostic yield of exome or genome sequencing (ES/GS) over chromosomal microarray analysis (CMA) in fetuses with increased nuchal translucency (NT) and no concomitant anomalies. METHODS: This systematic review was conducted in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses criteria. PubMed, Scopus and Web of Science were searched for studies describing ES/GS in fetuses with isolated increased NT. Inclusion criteria were: (1) study written in English; (2) more than two fetuses with increased NT > 99th percentile and no concomitant anomalies; and (3) a negative CMA result considered as the reference standard. Only positive variants identified on ES/GS that were classified as likely pathogenic or pathogenic and determined to be causative of the fetal phenotype were considered. Risk was assessed as the pooled effect size by single-proportion analysis using random-effects modeling (weighted by inverse of variance). RESULTS: Eleven studies reporting on the diagnostic yield of ES/GS in fetuses with isolated increased NT > 99th percentile were identified and included 309 cases. All studies were high quality according to Standards for Reporting of Diagnostic Accuracy. Overall, a pathogenic or likely pathogenic variant was identified on ES/GS in 15 fetuses, resulting in a pooled incremental yield of 4% (95% CI, 2-6%). Six (40%) of these fetuses had NT of 5 mm or more. The observed inheritance pattern was autosomal dominant in 12 cases, including four fetuses with Noonan syndrome, autosomal recessive in two cases and X-linked in one case. CONCLUSIONS: There is a 4% incremental diagnostic yield of ES/GS over CMA in fetuses with increased NT > 99th percentile without a concomitant anomaly. It is unclear whether a NT cut-off higher than 3.5 mm may be more useful in case selection for ES/GS. © 2021 International Society of Ultrasound in Obstetrics and Gynecology.


Subject(s)
Fetus/diagnostic imaging , High-Throughput Nucleotide Sequencing/statistics & numerical data , Microarray Analysis/statistics & numerical data , Nuchal Translucency Measurement , Prenatal Diagnosis/statistics & numerical data , Female , Fetus/embryology , Humans , Predictive Value of Tests , Pregnancy , Prenatal Diagnosis/methods , Reference Values
7.
J Endocrinol Invest ; 45(4): 773-786, 2022 Apr.
Article in English | MEDLINE | ID: mdl-34780050

ABSTRACT

PURPOSE: To date, many genes have been associated with congenital hypothyroidism (CH). Our aim was to identify the mutational spectrum of 23 causative genes in Turkish patients with permanent CH, including thyroid dysgenesis (TD) and dyshormonogenesis (TDH) cases. METHODS: A total of 134 patients with permanent CH (130 primary, 4 central) were included. To identify the genetic etiology, we screened 23 candidate genes associated with CH by next-generation sequencing. For confirmation and to detect the status of the specific familial variant in relatives, Sanger sequencing was also performed. RESULTS: Possible pathogenic variants were found in 5.2% of patients with TD and in 64.0% of the patients with normal-sized thyroid or goiter. In all patients, variants were most frequently found in TSHR, followed by TPO and TG. The same homozygous TSHB variant (c.162 + 5G > A) was identified in four patients with central CH. In addition, we detected novel variants in the TSHR, TG, SLC26A7, FOXE1, and DUOX2. CONCLUSION: Genetic causes were determined in the majority of CH patients with TDH, however, despite advances in genetics, we were unable to identify the genetic etiology of most CH patients with TD, suggesting the effect of unknown genes or environmental factors. The previous studies and our findings suggest that TSHR and TPO mutations is the main genetic defect of CH in the Turkish population.


Subject(s)
Congenital Hypothyroidism/genetics , Genetic Variation/genetics , Antiporters/analysis , Antiporters/blood , Antiporters/genetics , Child , Child, Preschool , Dual Oxidases/analysis , Dual Oxidases/blood , Dual Oxidases/genetics , Female , Forkhead Transcription Factors/analysis , Forkhead Transcription Factors/blood , Forkhead Transcription Factors/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Infant , Infant, Newborn , Male , Receptors, Thyrotropin/analysis , Receptors, Thyrotropin/blood , Receptors, Thyrotropin/genetics , Sulfate Transporters/analysis , Sulfate Transporters/blood , Sulfate Transporters/genetics , Thyroglobulin/analysis , Thyroglobulin/blood , Thyroglobulin/genetics
8.
Pediatr Infect Dis J ; 41(2): 166-171, 2022 02 01.
Article in English | MEDLINE | ID: mdl-34845152

ABSTRACT

BACKGROUND: Plasma metagenomic next-generation sequencing (mNGS) has the potential to detect thousands of different organisms with a single test. There are limited data on the real-world impact of mNGS and even less guidance on the types of patients and clinical scenarios in which mNGS testing is beneficial. METHODS: A retrospective review of patients who had mNGS testing as part of routine clinical care at Texas Children's Hospital from June 2018-August 2019 was performed. Medical records were reviewed for pertinent data. An expert panel of infectious disease physicians adjudicated each unique organism identified by mNGS for clinical impact. RESULTS: There were 169 patients with at least one mNGS test. mNGS identified a definitive, probable or possible infection in 49.7% of patients. mNGS led to no clinical impact in 139 patients (82.2%), a positive impact in 21 patients (12.4%), and a negative impact in 9 patients (5.3%). mNGS identified a plausible cause for infection more often in immunocompromised patients than in immunocompetent patients (55.8% vs. 30.0%, P = 0.006). Positive clinical impact was highest in patients with multiple indications for testing (37.5%, P = 0.006) with deep-seated infections, overall, being most often associated with a positive impact. CONCLUSION: mNGS testing has a limited real-world clinical impact when ordered indiscriminately. Immunocompromised patients with well-defined deep-seated infections are likely to benefit most from testing. Further studies are needed to evaluate the full spectrum of clinical scenarios for which mNGS testing is impactful.


Subject(s)
High-Throughput Nucleotide Sequencing/statistics & numerical data , Metagenomics/statistics & numerical data , Adolescent , Anti-Infective Agents/therapeutic use , Child , Child, Preschool , Female , Humans , Immunocompromised Host , Infant , Male , Retrospective Studies , Sepsis/blood , Sepsis/diagnosis , Sepsis/microbiology , Sepsis/virology
9.
Clin Epigenetics ; 13(1): 216, 2021 12 09.
Article in English | MEDLINE | ID: mdl-34886879

ABSTRACT

BACKGROUND: Illumina DNA methylation arrays are high-throughput platforms for cost-effective genome-wide profiling of individual CpGs. Experimental and technical factors introduce appreciable measurement variation, some of which can be mitigated by careful "preprocessing" of raw data. METHODS: Here we describe the ENmix preprocessing pipeline and compare it to a set of seven published alternative pipelines (ChAMP, Illumina, SWAN, Funnorm, Noob, wateRmelon, and RnBeads). We use two large sets of duplicate sample measurements with 450 K and EPIC arrays, along with mixtures of isogenic methylated and unmethylated cell line DNA to compare raw data and that preprocessed via different pipelines. RESULTS: Our evaluations show that the ENmix pipeline performs the best with significantly higher correlation and lower absolute difference between duplicate pairs, higher intraclass correlation coefficients (ICC) and smaller deviations from expected methylation level in mixture experiments. In addition to the pipeline function, ENmix software provides an integrated set of functions for reading in raw data files from mouse and human arrays, quality control, data preprocessing, visualization, detection of differentially methylated regions (DMRs), estimation of cell type proportions, and calculation of methylation age clocks. ENmix is computationally efficient, flexible and allows parallel computing. To facilitate further evaluations, we make all datasets and evaluation code publicly available. CONCLUSION: Careful selection of robust data preprocessing methods is critical for DNA methylation array studies. ENmix outperformed other pipelines in our evaluations to minimize experimental variation and to improve data quality and study power.


Subject(s)
DNA Methylation/genetics , Genetic Testing/standards , HCT116 Cells/pathology , Genetic Testing/instrumentation , Genetic Testing/statistics & numerical data , High-Throughput Nucleotide Sequencing/instrumentation , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans
10.
JAMA Netw Open ; 4(12): e2138219, 2021 12 01.
Article in English | MEDLINE | ID: mdl-34882180

ABSTRACT

Importance: In March 2018, Medicare issued a national coverage determination (NCD) for next-generation sequencing (NGS) to facilitate access to NGS testing among Medicare beneficiaries. It is unknown whether the NCD affected health equity issues for Medicare beneficiaries and the overall population. Objective: To examine the association between the Medicare NCD and NGS use by insurance types and race and ethnicity. Design, Setting, and Participants: A retrospective cohort analysis was conducted using electronic health record data derived from a real-world database. Data originated from approximately 280 cancer clinics (approximately 800 sites of care) in the US. Patients with advanced non-small cell lung cancer (aNSCLC), metastatic colorectal cancer (mCRC), metastatic breast cancer (mBC), or advanced melanoma diagnosed from January 1, 2011, through March 31, 2020, were included. Exposure: Pre- vs post-NCD period. Main Outcomes and Measures: Patients were classified by insurance type and race and ethnicity to examine patterns in NGS testing less than or equal to 60 days after diagnosis. Difference-in-differences models examined changes in average NGS testing in the pre- and post-NCD periods by race and ethnicity, and interrupted time-series analysis examined whether trends over time varied by insurance type and race and ethnicity. Results: Among 92 687 patients with aNSCLC, mCRC, mBC, or advanced melanoma, mean (SD) age was 66.6 (11.2) years, 51 582 (55.7%) were women, and 63 864 (68.9%) were Medicare beneficiaries. The largest racial and ethnic categories according to the database used and further classification were Black or African American (8605 [9.3%]) and non-Hispanic White (59 806 [64.5%]). Compared with Medicare beneficiaries, changes in pre- to post-NCD NGS testing trends were similar in commercially insured patients (odds ratio [OR], 1.03; 95% CI, 0.98-1.08; P = .25). Pre- to post-NCD NGS testing trends increased at a slower rate among patients in assistance programs (OR, 0.93; 95% CI, 0.87-0.99; P = .03) compared with Medicare beneficiaries. The rate of increase for patients receiving Medicaid was not statistically significantly different compared with those receiving Medicare (OR, 0.92; 95% CI, 0.84-1.01; P = .07). The NCD was not associated with statistically significant changes in NGS use trends by racial and ethnic groups within Medicare beneficiaries alone or across all insurance types. Compared with non-Hispanic White individuals, increases in average NGS use from the pre-NCD to post-NCD period were 14% lower (OR, 0.86; 95% CI, 0.74-0.99; P = .04) among African American and 23% lower (OR, 0.77; 95% CI, 0.62-0.96; P = .02) among Hispanic/Latino individuals; increases among Asian individuals and those with other races and ethnicities were similar. Conclusions and Relevance: The findings of this study suggest that expansion of Medicare-covered benefits may not occur equally across insurance types, thereby further widening or maintaining disparities in NGS testing. Additional efforts beyond coverage policies are needed to ensure equitable access to the benefits of precision medicine.


Subject(s)
Genetic Predisposition to Disease , Genetic Testing/economics , High-Throughput Nucleotide Sequencing/economics , High-Throughput Nucleotide Sequencing/trends , Medicare/economics , Medicare/trends , Neoplasms/genetics , Adolescent , Adult , Aged , Aged, 80 and over , Female , Forecasting , Genetic Testing/statistics & numerical data , Genetic Testing/trends , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Insurance Coverage/standards , Insurance Coverage/statistics & numerical data , Insurance Coverage/trends , Male , Medicare/statistics & numerical data , Middle Aged , Retrospective Studies , United States , Young Adult
11.
Comput Math Methods Med ; 2021: 7238495, 2021.
Article in English | MEDLINE | ID: mdl-34790254

ABSTRACT

OBJECTIVE: To uncover the application value of metagenomic next-generation sequencing (mNGS) in the detection of pathogen in bronchoalveolar lavage fluid (BALF) and sputum samples. METHODS: Totally, 32 patients with pulmonary infection were included. Pathogens in BALF and sputum samples were tested simultaneously by routine microbial culture and mNGS. Main infected pathogens (bacteria, fungi, and viruses) and their distribution in BALF and sputum samples were analyzed. Moreover, the diagnostic performance of mNGS in paired BALF and sputum samples was assessed. RESULTS: The pathogen culture results were positive in 9 patients and negative in 13 patients. No statistical differences were recorded on the sensitivity (78.94% vs. 63.15%, p = 0.283) and specificity (62.50% vs. 75.00%, p = 0.375) of mNGS diagnosis in bacteria and fungus in two types of samples. As shown in mNGS detection, 10 patients' two samples were both positive, 13 patients' two samples were both negative, 7 patients were only positive in BALF samples, and 2 patients' sputum samples were positive. Main viruses mNGS detected were EB virus, human adenovirus 5, herpes simplex virus type 1, and human cytomegalovirus. Kappa consensus analysis indicated that mNGS showed significant consistency in detecting pathogens in two samples, no matter bacteria (p < 0.001), fungi (p = 0.026), or viruses (p = 0.008). CONCLUSION: mNGS showed no statistical differences in sensitivity and specificity of pathogen detection in BALF and sputum samples. Under certain conditions, sputum samples might be more suitable for pathogen detection because of invasiveness of BALF samples.


Subject(s)
Bronchoalveolar Lavage Fluid/microbiology , Bronchoalveolar Lavage Fluid/virology , High-Throughput Nucleotide Sequencing/methods , Metagenomics/methods , Pneumonia/microbiology , Pneumonia/virology , Sputum/microbiology , Sputum/virology , Adult , Computational Biology , Female , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Male , Metagenomics/statistics & numerical data , Microbiological Techniques , Middle Aged , Pneumonia/diagnosis , Retrospective Studies , Sensitivity and Specificity , Sequence Analysis, DNA
12.
Clin Epigenetics ; 13(1): 204, 2021 11 13.
Article in English | MEDLINE | ID: mdl-34774111

ABSTRACT

BACKGROUND: GGC repeat expansions in NOTCH2NLC are associated with neuronal intranuclear inclusion disease. Very recently, asymptomatic carriers with NOTCH2NLC repeat expansions were reported. In these asymptomatic individuals, the CpG island in NOTCH2NLC is hypermethylated, suggesting that two factors repeat length and DNA methylation status should be considered to evaluate pathogenicity. Long-read sequencing can be used to simultaneously profile genomic and epigenomic alterations. We analyzed four sporadic cases with NOTCH2NLC repeat expansion and their phenotypically normal parents. The native genomic DNA that retains base modification was sequenced on a per-trio basis using both PacBio and Oxford Nanopore long-read sequencing technologies. A custom workflow was developed to evaluate DNA modifications. With these two technologies combined, long-range DNA methylation information was integrated with complete repeat DNA sequences to investigate the genetic origins of expanded GGC repeats in these sporadic cases. RESULTS: In all four families, asymptomatic fathers had longer expansions (median: 522, 390, 528 and 650 repeats) compared with their affected offspring (median: 93, 117, 162 and 140 repeats, respectively). These expansions are much longer than the disease-causing range previously reported (in general, 41-300 repeats). Repeat lengths were extremely variable in the father, suggesting somatic mosaicism. Instability is more frequent in alleles with uninterrupted pure GGCs. Single molecule epigenetic analysis revealed complex DNA methylation patterns and epigenetic heterogeneity. We identified an aberrant gain-of-methylation region (2.2 kb in size beyond the CpG island and GGC repeats) in asymptomatic fathers. This methylated region was unmethylated in the normal allele with bilateral transitional zones with both methylated and unmethylated CpG dinucleotides, which may be protected from methylation to ensure NOTCH2NLC expression. CONCLUSIONS: We clearly demonstrate that the four sporadic NOTCH2NLC-related cases are derived from the paternal GGC repeat contraction associated with demethylation. The entire genetic and epigenetic landscape of the NOTCH2NLC region was uncovered using the custom workflow of long-read sequence data, demonstrating the utility of this method for revealing epigenetic/mutational changes in repetitive elements, which are difficult to characterize by conventional short-read/bisulfite sequencing methods. Our approach should be useful for biomedical research, aiding the discovery of DNA methylation abnormalities through the entire genome.


Subject(s)
Father-Child Relations , Genetic Background , Intercellular Signaling Peptides and Proteins/genetics , Nerve Tissue Proteins/genetics , DNA Methylation/genetics , DNA Methylation/physiology , Epigenesis, Genetic/genetics , Epigenesis, Genetic/physiology , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Intercellular Signaling Peptides and Proteins/analysis , Nerve Tissue Proteins/analysis
13.
Sci Rep ; 11(1): 21820, 2021 11 08.
Article in English | MEDLINE | ID: mdl-34750410

ABSTRACT

Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further investigation. Can any machine learning algorithm (ML) help us classify NGS variants? This has led us to investigate which ML can fit our NGS data and to develop a tool that can be routinely implemented to help biologists. Currently, one of the greatest challenges in medicine is processing a significant quantity of data. This is particularly true in molecular biology with the advantage of next-generation sequencing (NGS) for profiling and identifying molecular tumors and their treatment. In addition to bioinformatics pipelines, artificial intelligence (AI) can be valuable in helping to analyze mutation variants. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. However, analyzing the massive quantities of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skills and a panel of bioinformatic and biostatistic tools, in which artificial intelligence is now successful in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identifying real variants challenging. We present a machine learning method for classifying pathogenic single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), multiple nucleotide variants (MNVs), insertions, and deletions detected by NGS from different types of tumor specimens, such as: colorectal, melanoma, lung and glioma cancer. We compared our NGS data to different machine learning algorithms using the k-fold cross-validation method and to neural networks (deep learning) to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnosis. We trained our machine learning with 70% of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with the 30% remaining data. The model offering the best accuracy was chosen and implemented in the NGS analysis routine. Artificial intelligence was developed with the R script language version 3.6.0. We trained our model on 70% of 102,011 variants. Our best error rate (0.22%) was found with random forest machine learning (ntree = 500 and mtry = 4), with an AUC of 0.99. Neural networks achieved some good scores. The final trained model with the neural network achieved an accuracy of 98% and an ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate < 1%). The errors were nomenclature problems and false positives. After adding false positives to our training database and implementing our RF model routinely, our error rate was always < 0.5%. The RF model shows excellent results for oncosomatic NGS interpretation and can easily be implemented in other molecular biology laboratories. AI is becoming increasingly important in molecular biomedical analysis and can be very helpful in processing medical data. Neural networks show a good capacity in variant classification, and in the future, they may be useful in predicting more complex variants.


Subject(s)
Genetic Variation , High-Throughput Nucleotide Sequencing/statistics & numerical data , Machine Learning , Neoplasms/genetics , Oncogenes , Algorithms , Biomarkers, Tumor/genetics , Computational Biology , Databases, Genetic/statistics & numerical data , Deep Learning , Humans , INDEL Mutation , Models, Statistical , Neural Networks, Computer , Polymorphism, Single Nucleotide , ROC Curve
14.
Clin Epigenetics ; 13(1): 200, 2021 10 29.
Article in English | MEDLINE | ID: mdl-34715912

ABSTRACT

BACKGROUND: Depression is a common, complex, and debilitating mental disorder estimated to be under-diagnosed and insufficiently treated in society. Liability to depression is influenced by both genetic and environmental risk factors, which are both capable of impacting DNA methylation (DNAm). Accordingly, numerous studies have researched for DNAm signatures of this disorder. Recently, an epigenome-wide association study of monozygotic twins identified an association between DNAm status in the KLK8 (neuropsin) promoter region and severity of depression symptomatology. METHODS: In this study, we aimed to investigate: (i) if blood DNAm levels, quantified by pyrosequencing, at two CpG sites in the KLK8 promoter are associated with depression symptomatology and depression diagnosis in an independent clinical cohort and (ii) if KLK8 DNAm levels are associated with depression, postpartum depression, and depression symptomatology in four independent methylomic cohorts, with blood and brain DNAm quantified by either MBD-seq or 450 k methylation array. RESULTS: DNAm levels in KLK8 were not significantly different between depression cases and controls, and were not significantly associated with any of the depression symptomatology scores after correction for multiple testing (minimum p value for KLK8 CpG1 = 0.12 for 'Depressed mood,' and for CpG2 = 0.03 for 'Loss of self-confidence with other people'). However, investigation of the link between KLK8 promoter DNAm levels and depression-related phenotypes collected from four methylomic cohorts identified significant association (p value < 0.05) between severity of depression symptomatology and blood DNAm levels at seven CpG sites. CONCLUSIONS: Our findings suggest that variance in blood DNAm levels in KLK8 promoter region is associated with severity of depression symptoms, but not depression diagnosis.


Subject(s)
DNA Methylation/genetics , Depression/diagnosis , Kallikreins/analysis , Kallikreins/genetics , Aged , Depression/psychology , Female , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Male , Middle Aged
15.
Genes (Basel) ; 12(9)2021 08 24.
Article in English | MEDLINE | ID: mdl-34573280

ABSTRACT

Inborn errors of immunity (IEI) include a large group of inherited diseases sharing either poor, dysregulated, or absent and/or acquired function in one or more components of the immune system. Next-generation sequencing (NGS) has driven a rapid increase in the recognition of such defects, though the wide heterogeneity of genetically diverse but phenotypically overlapping diseases has often prevented the molecular characterization of the most complex patients. Two hundred and seventy-two patients were submitted to three successive NGS-based gene panels composed of 58, 146, and 312 genes. Along with pathogenic and likely pathogenic causative gene variants, accounting for the corresponding disorders (37/272 patients, 13.6%), a number of either rare (probably) damaging variants in genes unrelated to patients' phenotype, variants of unknown significance (VUS) in genes consistent with their clinics, or apparently inconsistent benign, likely benign, or VUS variants were also detected. Finally, a remarkable amount of yet unreported variants of unknown significance were also found, often recurring in our dataset. The NGS approach demonstrated an expected IEI diagnostic rate. However, defining the appropriate list of genes for these panels may not be straightforward, and the application of unbiased approaches should be taken into consideration, especially when patients show atypical clinical pictures.


Subject(s)
Gene Frequency , Immune System Diseases/genetics , Metabolism, Inborn Errors/genetics , Adolescent , Female , Gene-Environment Interaction , Genetic Testing/statistics & numerical data , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Immune System Diseases/diagnosis , Male , Metabolism, Inborn Errors/diagnosis , Mutation , Sequence Analysis, DNA/statistics & numerical data
16.
PLoS Comput Biol ; 17(8): e1009254, 2021 08.
Article in English | MEDLINE | ID: mdl-34343164

ABSTRACT

Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies.


Subject(s)
Genetic Variation , Genome, Human/immunology , Immune System , Algorithms , Computational Biology , DNA Copy Number Variations , Genomics/methods , Genomics/statistics & numerical data , HLA Antigens/genetics , Haplotypes , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Immunogenetic Phenomena , Immunoglobulins/genetics , Receptors, Antigen, T-Cell/genetics , Receptors, KIR/genetics , Sequence Analysis, DNA/statistics & numerical data
17.
PLoS Comput Biol ; 17(8): e1008904, 2021 08.
Article in English | MEDLINE | ID: mdl-34339413

ABSTRACT

The killer-cell immunoglobulin-like receptor (KIR) complex on chromosome 19 encodes receptors that modulate the activity of natural killer cells, and variation in these genes has been linked to infectious and autoimmune disease, as well as having bearing on pregnancy and transplant outcomes. The medical relevance and high variability of KIR genes makes short-read sequencing an attractive technology for interrogating the region, providing a high-throughput, high-fidelity sequencing method that is cost-effective. However, because this gene complex is characterized by extensive nucleotide polymorphism, structural variation including gene fusions and deletions, and a high level of homology between genes, its interrogation at high resolution has been thwarted by bioinformatic challenges, with most studies limited to examining presence or absence of specific genes. Here, we present the PING (Pushing Immunogenetics to the Next Generation) pipeline, which incorporates empirical data, novel alignment strategies and a custom alignment processing workflow to enable high-throughput KIR sequence analysis from short-read data. PING provides KIR gene copy number classification functionality for all KIR genes through use of a comprehensive alignment reference. The gene copy number determined per individual enables an innovative genotype determination workflow using genotype-matched references. Together, these methods address the challenges imposed by the structural complexity and overall homology of the KIR complex. To determine copy number and genotype determination accuracy, we applied PING to European and African validation cohorts and a synthetic dataset. PING demonstrated exceptional copy number determination performance across all datasets and robust genotype determination performance. Finally, an investigation into discordant genotypes for the synthetic dataset provides insight into misaligned reads, advancing our understanding in interpretation of short-read sequencing data in complex genomic regions. PING promises to support a new era of studies of KIR polymorphism, delivering high-resolution KIR genotypes that are highly accurate, enabling high-quality, high-throughput KIR genotyping for disease and population studies.


Subject(s)
Immunogenetics/statistics & numerical data , Receptors, KIR/genetics , Africa, Southern , Alleles , Computational Biology , Computer Simulation , Databases, Nucleic Acid/statistics & numerical data , Europe , Gene Dosage , Genetics, Population/statistics & numerical data , Genotype , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Polymorphism, Genetic , Receptors, KIR/classification , Sequence Alignment/statistics & numerical data , Software Design
18.
mBio ; 12(4): e0163821, 2021 08 31.
Article in English | MEDLINE | ID: mdl-34399612

ABSTRACT

RNA viruses cause numerous emerging diseases, mostly due to transmission from mammalian and avian reservoirs. Large-scale surveillance of RNA viral infections in these animals is a fundamental step for controlling viral infectious diseases. Metagenomic analysis is a powerful method for virus identification with low bias and has contributed substantially to the discovery of novel viruses. Deep-sequencing data have been collected from diverse animals and accumulated in public databases, which can be valuable resources for identifying unknown viral sequences. Here, we screened for infections of 33 RNA viral families in publicly available mammalian and avian sequencing data and found approximately 900 hidden viral infections. We also discovered six nearly complete viral genomes in livestock, wild, and experimental animals: hepatovirus in a goat, hepeviruses in blind mole-rats and a galago, astrovirus in macaque monkeys, parechovirus in a cow, and pegivirus in tree shrews. Some of these viruses were phylogenetically close to human-pathogenic viruses, suggesting the potential risk of causing disease in humans upon infection. Furthermore, infections of five novel viruses were identified in several different individuals, indicating that their infections may have already spread in the natural host population. Our findings demonstrate the reusability of public sequencing data for surveying viral infections and identifying novel viral sequences, presenting a warning about a new threat of viral infectious disease to public health. IMPORTANCE Monitoring the spread of viral infections and identifying novel viruses capable of infecting humans through animal reservoirs are necessary to control emerging viral diseases. Massive amounts of sequencing data collected from various animals are publicly available, and these data may contain sequences originating from a wide variety of viruses. Here, we analyzed more than 46,000 public sequencing data and identified approximately 900 hidden RNA viral infections in mammalian and avian samples. Some viruses discovered in this study were genetically similar to pathogens that cause hepatitis, diarrhea, or encephalitis in humans, suggesting the presence of new threats to public health. Our study demonstrates the effectiveness of reusing public sequencing data to identify known and unknown viral infections, indicating that future continuous monitoring of public sequencing data by metagenomic analyses would help prepare and mitigate future viral pandemics.


Subject(s)
Communicable Diseases, Emerging/virology , Metagenomics , RNA Virus Infections/prevention & control , RNA Viruses/genetics , RNA Viruses/pathogenicity , Sequence Analysis, DNA/statistics & numerical data , Animals , Birds/virology , Cattle , Data Analysis , Genome, Viral , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , RNA Virus Infections/virology , RNA Viruses/classification , Sequence Analysis, DNA/methods
19.
Nucleic Acids Res ; 49(19): e114, 2021 11 08.
Article in English | MEDLINE | ID: mdl-34403470

ABSTRACT

Haplotype phasing plays an important role in understanding the genetic data of diploid eukaryotic organisms. Different sequencing technologies (such as next-generation sequencing or third-generation sequencing) produce various genetic data that require haplotype assembly. Although multiple diploid haplotype phasing algorithms exist, only a few will work equally well across all sequencing technologies. In this work, we propose SpecHap, a novel haplotype assembly tool that leverages spectral graph theory. On both in silico and whole-genome sequencing datasets, SpecHap consumed less memory and required less CPU time, yet achieved comparable accuracy with state-of-art methods across all the test instances, which comprises sequencing data from next-generation sequencing, linked-reads, high-throughput chromosome conformation capture, PacBio single-molecule real-time, and Oxford Nanopore long-reads. Furthermore, SpecHap successfully phased an individual Ambystoma mexicanum, a species with gigantic diploid genomes, within 6 CPU hours and 945MB peak memory usage, while other tools failed to yield results either due to memory overflow (40GB) or time limit exceeded (5 days). Our results demonstrated that SpecHap is scalable, efficient, and accurate for diploid phasing across many sequencing platforms.


Subject(s)
Algorithms , Ambystoma mexicanum/genetics , Genome , High-Throughput Nucleotide Sequencing/statistics & numerical data , Sequence Analysis, DNA/methods , Whole Genome Sequencing/statistics & numerical data , Animals , Benchmarking , Datasets as Topic , Diploidy , Haplotypes , Humans , Nanopores , Time Factors
20.
Hum Immunol ; 82(11): 838-849, 2021 Nov.
Article in English | MEDLINE | ID: mdl-34404545

ABSTRACT

BACKGROUND AND PURPOSE: Currently there are no widely accepted guidelines for chimerism analysis testing in hematopoietic cell transplantation (HCT) patients. The objective of this review is to provide a practical guide to address key aspects of performing and utilizing chimerism testing results. In developing this guide, we conducted a survey of testing practices among laboratories that are accredited for performing engraftment monitoring/chimerism analysis by either the American Society for Histocompatibility & Immunogenetics (ASHI) and/or the European Federation of Immunogenetics (EFI). We interpreted the survey results in the light of pertinent literature as well as the experience in the laboratories of the authors. RECENT DEVELOPMENTS: In recent years there has been significant advances in high throughput molecular methods such as next generation sequencing (NGS) as well as growing access to these technologies in histocompatibility and immunogenetics laboratories. These methods have the potential to improve the performance of chimerism testing in terms of sensitivity, availability of informative genetic markers that distinguish donors from recipients as well as cost. SUMMARY: The results of the survey revealed a great deal of heterogeneity in chimerism testing practices among participating laboratories. The most consistent response indicated monitoring of engraftment within the first 30 days. These responses are reflective of published literature. Additional clinical indications included early detection of impending relapse as well as identification of cases of HLA-loss relapse.


Subject(s)
Hematopoietic Stem Cell Transplantation , High-Throughput Nucleotide Sequencing/statistics & numerical data , Histocompatibility Testing/statistics & numerical data , Laboratories, Clinical/statistics & numerical data , Practice Patterns, Physicians'/statistics & numerical data , Chimerism , High-Throughput Nucleotide Sequencing/standards , Histocompatibility Testing/methods , Histocompatibility Testing/standards , Humans , Laboratories, Clinical/standards , Practice Guidelines as Topic , Practice Patterns, Physicians'/standards , Surveys and Questionnaires/statistics & numerical data , Transplantation Chimera/genetics , Transplantation Chimera/immunology , Transplantation, Homologous
SELECTION OF CITATIONS
SEARCH DETAIL
...