ABSTRACT
The Standard Practices for Infrared Multivariate Quantitative Analysis (ASTM E1655) provide a guide for determining physicochemical properties of materials using multivariate calibration techniques applied to chemical sources that have high multicollinearity and correlated information. Partial least squares (PLS) is the most widely used multivariate regression method due to its excellent prediction capabilities and easy optimization. Initially applied to chromatographic data, PLS has also shown great results in near-infrared (NIR) and mid-infrared (MIR) spectroscopies. However, complex chemical matrices with low correlation may not be efficiently modeled using PLS or other multivariate analyses limited by grouping similar information (such as latent variables or principal components). Therefore, this study aims to evaluate the multicollinearity of different analytical techniques, such as high-temperature gas chromatography (HTGC), NIR, MIR, hydrogen nuclear magnetic resonance (1H NMR), carbon-13 nuclear magnetic resonance (13C NMR), and Fourier transform ion cyclotron resonance mass spectrometry coupled to the electrospray source in positive and negative ionization modes (ESI(±)FT-ICR). Descriptive statistics (coefficient of determination, R2) and principal component analysis (PCA) were used to identify the distribution of correlated information. Results showed that NIR and MIR spectroscopies exhibited a higher percentage of correlated variables, while 13C NMR and ESI(±)FT-ICR MS had more discrete profiles. Therefore, PLS development may be more effectively applied to NIR, MIR, and 1H NMR data, while 13C NMR and mass spectra may require other algorithms or variable selection methods in combination with PLS.
ABSTRACT
Rapid identification of existing respiratory viruses in biological samples is of utmost importance in strategies to combat pandemics. Inputting MALDI FT-ICR MS (matrix-assisted laser desorption/ionization Fourier-transform ion cyclotron resonance mass spectrometry) data output into machine learning algorithms could hold promise in classifying positive samples for SARS-CoV-2. This study aimed to develop a fast and effective methodology to perform saliva-based screening of patients with suspected COVID-19, using the MALDI FT-ICR MS technique with a support vector machine (SVM). In the method optimization, the best sample preparation was obtained with the digestion of saliva in 10 µL of trypsin for 2 h and the MALDI analysis, which presented a satisfactory resolution for the analysis with 1 M. SVM models were created with data from the analysis of 97 samples that were designated as SARS-CoV-2 positives versus 52 negatives, confirmed by RT-PCR tests. SVM1 and SVM2 models showed the best results. The calibration group obtained 100% accuracy, and the test group 95.6% (SVM1) and 86.7% (SVM2). SVM1 selected 780 variables and has a false negative rate (FNR) of 0%, while SVM2 selected only two variables with a FNR of 3%. The proposed methodology suggests a promising tool to aid screening for COVID-19.
Subject(s)
COVID-19 , COVID-19/diagnosis , COVID-19 Testing , Fourier Analysis , Humans , Machine Learning , SARS-CoV-2 , Saliva , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization/methodsABSTRACT
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the worst global health crisis in living memory. The reverse transcription polymerase chain reaction (RT-qPCR) is considered the gold standard diagnostic method, but it exhibits limitations in the face of enormous demands. We evaluated a mid-infrared (MIR) data set of 237 saliva samples obtained from symptomatic patients (138 COVID-19 infections diagnosed via RT-qPCR). MIR spectra were evaluated via unsupervised random forest (URF) and classification models. Linear discriminant analysis (LDA) was applied following the genetic algorithm (GA-LDA), successive projection algorithm (SPA-LDA), partial least squares (PLS-DA), and a combination of dimension reduction and variable selection methods by particle swarm optimization (PSO-PLS-DA). Additionally, a consensus class was used. URF models can identify structures even in highly complex data. Individual models performed well, but the consensus class improved the validation performance to 85% accuracy, 93% sensitivity, 83% specificity, and a Matthew's correlation coefficient value of 0.69, with information at different spectral regions. Therefore, through this unsupervised and supervised framework methodology, it is possible to better highlight the spectral regions associated with positive samples, including lipid (â¼1700 cm-1), protein (â¼1400 cm-1), and nucleic acid (â¼1200-950 cm-1) regions. This methodology presents an important tool for a fast, noninvasive diagnostic technique, reducing costs and allowing for risk reduction strategies.
Subject(s)
COVID-19 , Saliva , Discriminant Analysis , Humans , Least-Squares Analysis , Multivariate Analysis , SARS-CoV-2 , Spectroscopy, Fourier Transform InfraredABSTRACT
We have built an online tool with a user-friendly and browser-based interface to facilitate the processing of high resolution and precision oil mass spectrometry data. DropMS does not require software installations. Mass spectra are sent and processed by the server using various algorithms reported in the literature, such as S/N ratio filters, recalibrations, chemical formula assimilations, and data visualization using graphs and diagrams popularly known in mass spectrometry as Van Krevelen and Kendrick diagrams and DBE vs C#. To validate the algorithms used and the processing results, the same mass spectrum of a typical Brazilian oil sample was analyzed by ESI(+)-FT-ICR/MS and processed using Sierra Analytics DropMS and Composer to obtain good agreement between the heteroatomic classes found and the number of compounds assigned. The MS has chemical information spread over the entire spectrum. The PLS multivariate regression has the main objective of decomposing the most important information into latent variables in order to quantify the most evaluated properties. Finally, 12 processed petroleum FT-ICR MS spectra were used for a partial least-squares regression with seven latent variables for R2 = 0.971 and RMSEC of 0.997 for API density property with a reference value range of 21-42.
ABSTRACT
RATIONALE: Electrospray ionization Fourier transform ion cyclotron resonance mass spectrometry (ESI FT-ICR MS) is an important analytical technique used for the elucidation of crude oil polar compounds at the molecular level, providing thousands of heteroatom compounds in a single analysis. Due to the high resolution, the complexity of data produced, and steps involved in spectra acquisition and processing, it is necessary to estimate its intermediate precision. METHODS: Intermediate precision was estimated for positive- and negative-ion ionization modes (ESI(±)) using Composer® software for two Brazilian crude oil samples. The analytical parameters evaluated were the class distribution histogram, the double bond equivalent (DBE) distribution, and the DBE versus carbon number. The statistical parameters used to study the intermediate precision were calculated from the average, standard deviation, confidence interval (significance level at 5%), coefficient of variation (CV), intermediate precision limit (ISO 5725), and principal component analysis (PCA). RESULTS: Two crude oil samples (A and B) were analyzed, in triplicate, for seven consecutive days by ESI(±) FT-ICR MS. The assigned class limit by ESI(+) for crude oil A was 0.42% (O2 S[H] class) and for crude oil B was 0.04% (N2 O2 S[H] class). The assigned DBE intensity limits for the two crude oils were 0.04% for ESI(+) and 0.013% for ESI(-). The PCA for ESI(-) and ESI(+) modes presented better precision for crude oils B and A, respectively. CONCLUSIONS: The most abundant classes and DBE of the majority class (i.e., with the highest intensity) are the parameters produced from the Composer® software that had the highest precision and can be used to estimate crude oil properties. The DBE values presented lower intermediate precision limit values (0.04%) than the assigned class values (0.4%). According to CV and PCA, ESI(+) was more precise for crude oil A (83% precision) and ESI(-) for crude oil B (84% precision).