RESUMEN
BACKGROUND: Direct RNA sequencing (dRNA-seq) on the Oxford Nanopore Technologies (ONT) platforms can produce reads covering up to full-length gene transcripts, while containing decipherable information about RNA base modifications and poly-A tail lengths. Although many published studies have been expanding the potential of dRNA-seq, its sequencing accuracy and error patterns remain understudied. RESULTS: We present the first comprehensive evaluation of sequencing accuracy and characterisation of systematic errors in dRNA-seq data from diverse organisms and synthetic in vitro transcribed RNAs. We found that for sequencing kits SQK-RNA001 and SQK-RNA002, the median read accuracy ranged from 87% to 92% across species, and deletions significantly outnumbered mismatches and insertions. Due to their high abundance in the transcriptome, heteropolymers and short homopolymers were the major contributors to the overall sequencing errors. We also observed systematic biases across all species at the levels of single nucleotides and motifs. In general, cytosine/uracil-rich regions were more likely to be erroneous than guanines and adenines. By examining raw signal data, we identified the underlying signal-level features potentially associated with the error patterns and their dependency on sequence contexts. While read quality scores can be used to approximate error rates at base and read levels, failure to detect DNA adapters may be a source of errors and data loss. By comparing distinct basecallers, we reason that some sequencing errors are attributable to signal insufficiency rather than algorithmic (basecalling) artefacts. Lastly, we generated dRNA-seq data using the latest SQK-RNA004 sequencing kit released at the end of 2023 and found that although the overall read accuracy increased, the systematic errors remain largely identical compared to the previous kits. CONCLUSIONS: As the first systematic investigation of dRNA-seq errors, this study offers a comprehensive overview of reproducible error patterns across diverse datasets, identifies potential signal-level insufficiency, and lays the foundation for error correction methods.
Asunto(s)
Secuenciación de Nanoporos , Análisis de Secuencia de ARN , Análisis de Secuencia de ARN/métodos , Secuenciación de Nanoporos/métodos , Nanoporos , Humanos , Animales , ARN/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
MOTIVATION: Infectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e. signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. RESULTS: We developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction. AVAILABILITY AND IMPLEMENTATION: Code and datasets for reproduction and customization are available at https://github.com/bio-ontology-research-group/DeepViral. Prediction results for 14 virus families are available at https://doi.org/10.5281/zenodo.4429824. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
AIM: The slow coronary flow (SCF) phenomenon was characterized by delayed perfusion of epicardial arteries, and no obvious coronary artery lesion in coronary angiography. The prognosis of patients with slow coronary flow was poor. However, there is lack of rapid, simple, and accurate method for SCF diagnosis. This study aimed to explore the utility of plasma choline as a diagnostic biomarker for SCF. METHODS: Patients with coronary artery stenosis <40% evaluated by the coronary angiogram method were recruited in this study and were grouped into normal coronary flow (NCF) and SCF by thrombolysis in myocardial infarction frame count (TFC). Plasma choline concentrations of patients with NCF and SCF were quantified by Ultra Performance Liquid Chromatography Tandem Mass Spectrometry. Correlation analysis was performed between plasma choline concentration and TFC. Receiver operating characteristic (ROC) curve analysis with or without confounding factor adjustment was applied to predict the diagnostic power of plasma choline in SCF. RESULTS: Forty-four patients with SCF and 21 patients with NCF were included in this study. TFC in LAD, LCX, and RCA and mean TFC were significantly higher in patients with SCF in comparison with patients with NCF (32.67 ± 8.37 vs. 20.66 ± 3.41, P < 0.01). Plasma choline level was obviously higher in patients with SCF when compared with patients with NCF (754.65 ± 238.18 vs. 635.79 ± 108.25, P=0.007). Plasma choline level had significantly positive correlation with Mean TFC (r = 0.364, P=0.002). Receiver operating characteristic (ROC) analysis showed that choline with or without confounding factor adjustment had an AUC score of 0.65 and 0.77, respectively. CONCLUSIONS: TFC were closely related with plasma choline level, and plasma choline can be a suitable and stable diagnostic biomarker for SCF.