Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 74
Filter
Add more filters

Publication year range
1.
J Chem Inf Model ; 62(14): 3415-3425, 2022 07 25.
Article in English | MEDLINE | ID: mdl-35834424

ABSTRACT

Molecular dynamics (MD) is a core methodology of molecular modeling and computational design for the study of the dynamics and temporal evolution of molecular systems. MD simulations have particularly benefited from the rapid increase of computational power that has characterized the past decades of computational chemical research, being the first method to be successfully migrated to the GPU infrastructure. While new-generation MD software is capable of delivering simulations on an ever-increasing scale, relatively less effort is invested in developing postprocessing methods that can keep up with the quickly expanding volumes of data that are being generated. Here, we introduce a new idea for sampling frames from large MD trajectories, based on the recently introduced framework of extended similarity indices. Our approach presents a new, linearly scaling alternative to the traditional approach of applying a clustering algorithm that usually scales as a quadratic function of the number of frames. When showcasing its usage on case studies with different system sizes and simulation lengths, we have registered speedups of up to 2 orders of magnitude, as compared to traditional clustering algorithms. The conformational diversity of the selected frames is also noticeably higher, which is a further advantage for certain applications, such as the selection of structural ensembles for ligand docking. The method is available open-source at https://github.com/ramirandaq/MultipleComparisons.


Subject(s)
Molecular Dynamics Simulation , Proteins , Algorithms , Cluster Analysis , Proteins/chemistry , Software
2.
J Comput Aided Mol Des ; 36(3): 157-173, 2022 03.
Article in English | MEDLINE | ID: mdl-35288838

ABSTRACT

Extended (or n-ary) similarity indices have been recently proposed to extend the comparative analysis of binary strings. Going beyond the traditional notion of pairwise comparisons, these novel indices allow comparing any number of objects at the same time. This results in a remarkable efficiency gain with respect to other approaches, since now we can compare N molecules in O(N) instead of the common quadratic O(N2) timescale. This favorable scaling has motivated the application of these indices to diversity selection, clustering, phylogenetic analysis, chemical space visualization, and post-processing of molecular dynamics simulations. However, the current formulation of the n-ary indices is limited to vectors with binary or categorical inputs. Here, we present the further generalization of this formalism so it can be applied to numerical data, i.e. to vectors with continuous components. We discuss several ways to achieve this extension and present their analytical properties. As a practical example, we apply this formalism to the problem of feature selection in QSAR and prove that the extended continuous similarity indices provide a convenient way to discern between several sets of descriptors.


Subject(s)
Drug Design , Quantitative Structure-Activity Relationship , Phylogeny
3.
Chem Rev ; 119(6): 3674-3729, 2019 03 27.
Article in English | MEDLINE | ID: mdl-30604951

ABSTRACT

Reversed-phase high-performance liquid chromatography (RP-HPLC) is the most popular chromatographic mode, accounting for more than 90% of all separations. HPLC itself owes its immense popularity to it being relatively simple and inexpensive, with the equipment being reliable and easy to operate. Due to extensive automation, it can be run virtually unattended with multiple samples at various separation conditions, even by relatively low-skilled personnel. Currently, there are >600 RP-HPLC columns available to end users for purchase, some of which exhibit very large differences in selectivity and production quality. Often, two similar RP-HPLC columns are not equally suitable for the requisite separation, and to date, there is no universal RP-HPLC column covering a variety of analytes. This forces analytical laboratories to keep a multitude of diverse columns. Therefore, column selection is a crucial segment of RP-HPLC method development, especially since sample complexity is constantly increasing. Rationally choosing an appropriate column is complicated. In addition to the differences in the primary intermolecular interactions with analytes of the dispersive (London) type, individual columns can also exhibit a unique character owing to specific polar, hydrogen bond, and electron pair donor-acceptor interactions. They can also vary depending on the type of packing, amount and type of residual silanols, "end-capping", bonding density of ligands, and pore size, among others. Consequently, the chromatographic performance of RP-HPLC systems is often considerably altered depending on the selected column. Although a wide spectrum of knowledge is available on this important subject, there is still a lack of a comprehensive review for an objective comparison and/or selection of chromatographic columns. We aim for this review to be a comprehensive, authoritative, critical, and easily readable monograph of the most relevant publications regarding column selection and characterization in RP-HPLC covering the past four decades. Future perspectives, which involve the integration of state-of-the-art molecular simulations (molecular dynamics or Monte Carlo) with minimal experiments, aimed at nearly "experiment-free" column selection methodology, are proposed.


Subject(s)
Chemistry Techniques, Analytical/methods , Chromatography, High Pressure Liquid/methods , Chromatography, Reverse-Phase/methods , Adsorption , Buffers , Chromatography, High Pressure Liquid/instrumentation , Chromatography, Reverse-Phase/instrumentation , Hydrophobic and Hydrophilic Interactions , Quantitative Structure-Activity Relationship
4.
Mol Divers ; 25(3): 1409-1424, 2021 Aug.
Article in English | MEDLINE | ID: mdl-34110577

ABSTRACT

In this review, we outline the current trends in the field of machine learning-driven classification studies related to ADME (absorption, distribution, metabolism and excretion) and toxicity endpoints from the past six years (2015-2021). The study focuses only on classification models with large datasets (i.e. more than a thousand compounds). A comprehensive literature search and meta-analysis was carried out for nine different targets: hERG-mediated cardiotoxicity, blood-brain barrier penetration, permeability glycoprotein (P-gp) substrate/inhibitor, cytochrome P450 enzyme family, acute oral toxicity, mutagenicity, carcinogenicity, respiratory toxicity and irritation/corrosion. The comparison of the best classification models was targeted to reveal the differences between machine learning algorithms and modeling types, endpoint-specific performances, dataset sizes and the different validation protocols. Based on the evaluation of the data, we can say that tree-based algorithms are (still) dominating the field, with consensus modeling being an increasing trend in drug safety predictions. Although one can already find classification models with great performances to hERG-mediated cardiotoxicity and the isoenzymes of the cytochrome P450 enzyme family, these targets are still central to ADMET-related research efforts.


Subject(s)
Drug Design , Machine Learning , Models, Molecular , Quantitative Structure-Activity Relationship , Algorithms , Drug-Related Side Effects and Adverse Reactions , ERG1 Potassium Channel/chemistry , ERG1 Potassium Channel/genetics , Humans , Neural Networks, Computer , Pharmacokinetics , Support Vector Machine , Tissue Distribution
5.
Molecules ; 26(4)2021 Feb 19.
Article in English | MEDLINE | ID: mdl-33669834

ABSTRACT

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.


Subject(s)
Algorithms , Databases as Topic , Quantitative Structure-Activity Relationship , Confidence Intervals , Machine Learning
6.
Anal Bioanal Chem ; 412(19): 4619-4628, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32472144

ABSTRACT

Extracellular vesicles (EVs) are lipid bilayer-bounded particles that are actively synthesized and released by cells. The main components of EVs are lipids, proteins, and nucleic acids and their composition is characteristic to their type and origin, and it reveals the physiological and pathological conditions of the parent cells. The concentration and protein composition of EVs closely relate to their functions; therefore, total protein determination can assist in EV-based diagnostics and disease prognosis. Here, we present a simple, reagent-free method based on attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy to quantify the protein content of EV samples without any further sample preparation. After calibration with bovine serum albumin, the protein concentration of red blood cell-derived EVs (REVs) were investigated by ATR-FTIR spectroscopy. The integrated area of the amide I band was calculated from the IR spectra of REVs, which was proportional to the protein quantity in the sample' regardless of its secondary structure. A spike test and a dilution test were performed to determine the ability to use ATR-FTIR spectroscopy for protein quantification in EV samples, which resulted in linearity with R2 values as high as 0.992 over the concentration range of 0.08 to 1 mg/mL. Additionally, multivariate calibration with the partial least squares (PLS) regression method was carried out on the bovine serum albumin and EV spectra. R2 values were 0.94 for the calibration and 0.91 for the validation set. The results indicate that ATR-FTIR measurements provide a reliable method for reagent-free protein quantification of EVs. Graphical abstract.


Subject(s)
Erythrocytes/chemistry , Extracellular Vesicles/chemistry , Proteins/analysis , Spectroscopy, Fourier Transform Infrared/methods , Animals , Cattle , Humans , Indicators and Reagents , Least-Squares Analysis , Serum Albumin, Bovine/analysis
7.
Molecules ; 24(15)2019 Jul 24.
Article in English | MEDLINE | ID: mdl-31344902

ABSTRACT

Ensemble docking is a widely applied concept in structure-based virtual screening-to at least partly account for protein flexibility-usually granting a significant performance gain at a modest cost of speed. From the individual, single-structure docking scores, a consensus score needs to be produced by data fusion: this is usually done by taking the best docking score from the available pool (in most cases- and in this study as well-this is the minimum score). Nonetheless, there are a number of other fusion rules that can be applied. We report here the results of a detailed statistical comparison of seven fusion rules for ensemble docking, on five case studies of current drug targets, based on four performance metrics. Sevenfold cross-validation and variance analysis (ANOVA) allowed us to highlight the best fusion rules. The results are presented in bubble plots, to unite the four performance metrics into a single, comprehensive image. Notably, we suggest the use of the geometric and harmonic means as better alternatives to the generally applied minimum fusion rule.


Subject(s)
Drug Design , Molecular Docking Simulation , Molecular Dynamics Simulation , Algorithms , Binding Sites , Ligands , Protein Binding , Protein Kinase Inhibitors/chemistry , Protein Kinase Inhibitors/pharmacology , Quantitative Structure-Activity Relationship , ROC Curve , Reproducibility of Results , Workflow
8.
Molecules ; 24(15)2019 Aug 01.
Article in English | MEDLINE | ID: mdl-31374986

ABSTRACT

Machine learning classification algorithms are widely used for the prediction and classification of the different properties of molecules such as toxicity or biological activity. the prediction of toxic vs. non-toxic molecules is important due to testing on living animals, which has ethical and cost drawbacks as well. The quality of classification models can be determined with several performance parameters. which often give conflicting results. In this study, we performed a multi-level comparison with the use of different performance metrics and machine learning classification methods. Well-established and standardized protocols for the machine learning tasks were used in each case. The comparison was applied to three datasets (acute and aquatic toxicities) and the robust, yet sensitive, sum of ranking differences (SRD) and analysis of variance (ANOVA) were applied for evaluation. The effect of dataset composition (balanced vs. imbalanced) and 2-class vs. multiclass classification scenarios was also studied. Most of the performance metrics are sensitive to dataset composition, especially in 2-class classification problems. The optimal machine learning algorithm also depends significantly on the composition of the dataset.


Subject(s)
Algorithms , Benchmarking , Machine Learning
10.
Anal Bioanal Chem ; 408(23): 6403-11, 2016 Sep.
Article in English | MEDLINE | ID: mdl-27531031

ABSTRACT

Almost a hundred commercially available energy drink samples from Hungary, Slovakia, and Greece were collected for the quantitative determination of their caffeine and sugar content with FT-NIR spectroscopy and high-performance liquid chromatography (HPLC). Calibration models were built with partial least-squares regression (PLSR). An HPLC-UV method was used to measure the reference values for caffeine content, while sugar contents were measured with the Schoorl method. Both the nominal sugar content (as indicated on the cans) and the measured sugar concentration were used as references. Although the Schoorl method has larger error and bias, appropriate models could be developed using both references. The validation of the models was based on sevenfold cross-validation and external validation. FT-NIR analysis is a good candidate to replace the HPLC-UV method, because it is much cheaper than any chromatographic method, while it is also more time-efficient. The combination of FT-NIR with multidimensional chemometric techniques like PLSR can be a good option for the detection of low caffeine concentrations in energy drinks. Moreover, three types of energy drinks that contain (i) taurine, (ii) arginine, and (iii) none of these two components were classified correctly using principal component analysis and linear discriminant analysis. Such classifications are important for the detection of adulterated samples and for quality control, as well. In this case, more than a hundred samples were used for the evaluation. The classification was validated with cross-validation and several randomization tests (X-scrambling). Graphical Abstract The way of energy drinks from cans to appropriate chemometric models.


Subject(s)
Energy Drinks/analysis , Spectroscopy, Fourier Transform Infrared/methods , Arginine/analysis , Caffeine/analysis , Calibration , Chromatography, High Pressure Liquid/methods , Discriminant Analysis , Least-Squares Analysis , Principal Component Analysis , Sugars/analysis , Taurine/analysis
11.
Anal Bioanal Chem ; 407(10): 2887-98, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25662936

ABSTRACT

A novel, time- and money-sparing method has been developed and validated for the quantitative determination of coenzyme Q10 (CoQ10) from several dietary supplements. FT-NIR spectroscopy was applied for the examination, and a calibration model was built by partial least-square regression (PLS-R) using 50 dietary supplements. The combination of FT-NIRS and multivariate calibration methods is a very fast and simple way to replace the commonly used HPLC-UV method; because in contrast with the traditional techniques, sample pretreatment and reagents are not required and no wastes are produced. The calibration models could be improved by different variable selection techniques (for instance interval PLS, interval selectivity ratio, genetic algorithm), which are very fast and user-friendly. The R(2) (goodness of calibration) and Q(2) (goodness of validation) of the variable selected models are highly increased, the R(2) values being over 0.90 and the Q(2) values being over 0.86 in every case. Fivefold cross-validation and external validation were applied. The developed method(s) could be used by quality assurance laboratories for routine measurement of coenzyme Q10 products.


Subject(s)
Data Interpretation, Statistical , Dietary Supplements/analysis , Spectroscopy, Near-Infrared/methods , Ubiquinone/analogs & derivatives , Algorithms , Calibration , Chromatography, High Pressure Liquid , Fourier Analysis , Least-Squares Analysis , Models, Statistical , Reproducibility of Results , Ubiquinone/analysis
12.
J Comput Aided Mol Des ; 27(10): 837-44, 2013 Oct.
Article in English | MEDLINE | ID: mdl-24141986

ABSTRACT

Coefficient of determination (R (2)) and its leave-one-out cross-validated analogue (denoted by Q (2) or R cv (2) ) are the most frequantly published values to characterize the predictive performance of models. In this article we use R (2) and Q (2) in a reversed aspect to determine uncommon points, i.e. influential points in any data sets. The term (1 - Q (2))/(1 - R (2)) corresponds to the ratio of predictive residual sum of squares and the residual sum of squares. The ratio correlates to the number of influential points in experimental and random data sets. We propose an (approximate) F test on (1 - Q (2))/(1 - R (2)) term to quickly pre-estimate the presence of influential points in training sets of models. The test is founded upon the routinely calculated Q (2) and R (2) values and warns the model builders to verify the training set, to perform influence analysis or even to change to robust modeling.


Subject(s)
Models, Molecular , Models, Theoretical , Quantitative Structure-Activity Relationship , Computer Simulation , Data Interpretation, Statistical , Regression Analysis
13.
Anal Bioanal Chem ; 405(25): 8363-75, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23912826

ABSTRACT

Sum of ranking differences (SRD) was applied for comparing multianalyte results obtained by several analytical methods used in one or in different laboratories, i.e., for ranking the overall performances of the methods (or laboratories) in simultaneous determination of the same set of analytes. The data sets for testing of the SRD applicability contained the results reported during one of the proficiency tests (PTs) organized by EU Reference Laboratory for Polycyclic Aromatic Hydrocarbons (EU-RL-PAH). In this way, the SRD was also tested as a discriminant method alternative to existing average performance scores used to compare mutlianalyte PT results. SRD should be used along with the z scores--the most commonly used PT performance statistics. SRD was further developed to handle the same rankings (ties) among laboratories. Two benchmark concentration series were selected as reference: (a) the assigned PAH concentrations (determined precisely beforehand by the EU-RL-PAH) and (b) the averages of all individual PAH concentrations determined by each laboratory. Ranking relative to the assigned values and also to the average (or median) values pointed to the laboratories with the most extreme results, as well as revealed groups of laboratories with similar overall performances. SRD reveals differences between methods or laboratories even if classical test(s) cannot. The ranking was validated using comparison of ranks by random numbers (a randomization test) and using seven folds cross-validation, which highlighted the similarities among the (methods used in) laboratories. Principal component analysis and hierarchical cluster analysis justified the findings based on SRD ranking/grouping. If the PAH-concentrations are row-scaled, (i.e., z scores are analyzed as input for ranking) SRD can still be used for checking the normality of errors. Moreover, cross-validation of SRD on z scores groups the laboratories similarly. The SRD technique is general in nature, i.e., it can be applied to any experimental problem in which multianalyte results obtained either by several analytical procedures, analysts, instruments, or laboratories need to be compared.


Subject(s)
Environmental Pollutants/analysis , Polycyclic Aromatic Hydrocarbons/analysis , Chromatography, Liquid/methods , Cluster Analysis , Environmental Monitoring/methods , Mass Spectrometry/methods , Principal Component Analysis
14.
PLoS One ; 18(4): e0284078, 2023.
Article in English | MEDLINE | ID: mdl-37053261

ABSTRACT

Non-negative matrix factorization (NMF) efficiently reduces high dimensionality for many-objective ranking problems. In multi-objective optimization, as long as only three or four conflicting viewpoints are present, an optimal solution can be determined by finding the Pareto front. When the number of the objectives increases, the multi-objective problem evolves into a many-objective optimization task, where the Pareto front becomes oversaturated. The key idea is that NMF aggregates the objectives so that the Pareto front can be applied, while the Sum of Ranking Differences (SRD) method selects the objectives that have a detrimental effect on the aggregation, and validates the findings. The applicability of the method is illustrated by the ranking of 1176 universities based on 46 variables of the CWTS Leiden Ranking 2020 database. The performance of NMF is compared to principal component analysis (PCA) and sparse non-negative matrix factorization-based solutions. The results illustrate that PCA incorporates negatively correlated objectives into the same principal component. On the contrary, NMF only allows non-negative correlations, which enable the proper use of the Pareto front. With the combination of NMF and SRD, a non-biased ranking of the universities based on 46 criteria is established, where Harvard, Rockefeller and Stanford Universities are determined as the first three. To evaluate the ranking capabilities of the methods, measures based on Relative Entropy (RE) and Hypervolume (HV) are proposed. The results confirm that the sparse NMF method provides the most informative ranking. The results highlight that academic excellence can be improved by decreasing the proportion of unknown open-access publications and short distance collaborations. The proportion of gender indicators barely correlate with scientific impact. More authors, long-distance collaborations, publications that have more scientific impact and citations on average highly influence the university ranking in a positive direction.


Subject(s)
Algorithms , Humans , Universities , Principal Component Analysis
15.
PLoS One ; 17(2): e0264277, 2022.
Article in English | MEDLINE | ID: mdl-35213620

ABSTRACT

The Promethee-GAIA method is a multicriteria decision support technique that defines the aggregated ranks of multiple criteria and visualizes them based on Principal Component Analysis (PCA). In the case of numerous criteria, the PCA biplot-based visualization do not perceive how a criterion influences the decision problem. The central question is how the Promethee-GAIA-based decision-making process can be improved to gain more interpretable results that reveal more characteristic inner relationships between the criteria. To improve the Promethee-GAIA method, we suggest three techniques that eliminate redundant criteria as well as clearly outline, which criterion belongs to which factor and explore the similarities between criteria. These methods are the following: A) Principal factoring with rotation and communality analysis (P-PFA), B) the integration of Sparse PCA into the Promethee II method (P-sPCA), and C) the Sum of Ranking Differences method (P-SRD). The suggested methods are presented through an I4.0+ dataset that measures the Industry 4.0 readiness of NUTS 2-classified regions. The proposed methods are useful tools for handling multicriteria ranking problems, if the number of criteria is numerous.


Subject(s)
Decision Support Techniques , Models, Theoretical , Factor Analysis, Statistical , Industry
16.
Front Chem ; 10: 852893, 2022.
Article in English | MEDLINE | ID: mdl-35755260

ABSTRACT

The screening of compounds for ADME-Tox targets plays an important role in drug design. QSPR models can increase the speed of these specific tasks, although the performance of the models highly depends on several factors, such as the applied molecular descriptors. In this study, a detailed comparison of the most popular descriptor groups has been carried out for six main ADME-Tox classification targets: Ames mutagenicity, P-glycoprotein inhibition, hERG inhibition, hepatotoxicity, blood-brain-barrier permeability, and cytochrome P450 2C9 inhibition. The literature-based, medium-sized binary classification datasets (all above 1,000 molecules) were used for the model building by two common algorithms, XGBoost and the RPropMLP neural network. Five molecular representation sets were compared along with their joint applications: Morgan, Atompairs, and MACCS fingerprints, and the traditional 1D and 2D molecular descriptors, as well as 3D molecular descriptors, separately. The statistical evaluation of the model performances was based on 18 different performance parameters. Although all the developed models were close to the usual performance of QSPR models for each specific ADME-Tox target, the results clearly showed the superiority of the traditional 1D, 2D, and 3D descriptors in the case of the XGBoost algorithm. It is worth trying the classical tools in single model building because the use of 2D descriptors can produce even better models for almost every dataset than the combination of all the examined descriptor sets.

17.
Food Res Int ; 143: 110309, 2021 05.
Article in English | MEDLINE | ID: mdl-33992329

ABSTRACT

In recent decades, eye-movement detection technology has improved significantly, and eye-trackers are available not only as standalone research tools but also as computer peripherals. This rapid spread gives further opportunities to measure the eye-movements of participants. The current paper provides classification models for the prediction of food choice and selects the best one. Four choice sets were presented to 112 volunteered participants, each choice set consisting of four different choice tasks, resulting in altogether sixteen choice tasks. The choice sets followed the 2-, 4-, 6- and 8-alternative forced-choice paradigm. Tobii X2-60 eye-tracker and Tobii Studio software were used to capture and export gazing data, respectively. After variable filtering, thirteen classification models were elaborated and tested; moreover, eight performance parameters were computed. The models were compared based on the performance parameters using the sum of ranking differences algorithm. The algorithm ranks and groups the models by comparing the ranks of their performance metrics to a predefined gold standard. Techniques based on decision trees were superior in all cases, regardless of the choice tasks and food product categories. Among the classifiers, Quinlan's C4.5 and cost-sensitive decision trees proved to be the best-performing ones. Future studies should focus on the fine-tuning of these models as well as their applications with mobile eye-trackers.


Subject(s)
Algorithms , Eye Movements , Humans , Software
18.
Food Chem ; 344: 128617, 2021 May 15.
Article in English | MEDLINE | ID: mdl-33221108

ABSTRACT

Finding optimal solutions usually requires multicriteria optimization. The sum of ranking differences (SRD) algorithm can efficiently solve such problems. Its principles and earlier applications will be discussed here, along with meta-analyses of papers published in various subfields of food science, such as analytics in food chemistry, food engineering, food technology, food microbiology, quality control, and sensory analysis. Carefully selected real case studies give an overview of the wide range of applications for multicriteria optimizations, using a free, easy-to-use and validated method. Results are presented and discussed in a way that helps scientists and practitioners, who are less familiar with multicriteria optimization, to integrate the method into their research projects. The utility of SRD, optionally coupled with other statistical methods such as ANOVA, is demonstrated on altogether twelve case studies, covering diverse method comparison and data evaluation scenarios from various subfields of food science.


Subject(s)
Decision Making , Food Technology , Algorithms , Analysis of Variance , Food Analysis/methods , Food Microbiology , Food Quality , Plant Proteins/chemistry , Plant Proteins/metabolism , Volatile Organic Compounds/analysis , Wine/analysis
19.
Comput Struct Biotechnol J ; 19: 3628-3639, 2021.
Article in English | MEDLINE | ID: mdl-34257841

ABSTRACT

Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.

20.
J Cheminform ; 13(1): 33, 2021 Apr 23.
Article in English | MEDLINE | ID: mdl-33892799

ABSTRACT

Despite being a central concept in cheminformatics, molecular similarity has so far been limited to the simultaneous comparison of only two molecules at a time and using one index, generally the Tanimoto coefficent. In a recent contribution we have not only introduced a complete mathematical framework for extended similarity calculations, (i.e. comparisons of more than two molecules at a time) but defined a series of novel idices. Part 1 is a detailed analysis of the effects of various parameters on the similarity values calculated by the extended formulas. Their features were revealed by sum of ranking differences and ANOVA. Here, in addition to characterizing several important aspects of the newly introduced similarity metrics, we will highlight their applicability and utility in real-life scenarios using datasets with popular molecular fingerprints. Remarkably, for large datasets, the use of extended similarity measures provides an unprecedented speed-up over "traditional" pairwise similarity matrix calculations. We also provide illustrative examples of a more direct algorithm based on the extended Tanimoto similarity to select diverse compound sets, resulting in much higher levels of diversity than traditional approaches. We discuss the inner and outer consistency of our indices, which are key in practical applications, showing whether the n-ary and binary indices rank the data in the same way. We demonstrate the use of the new n-ary similarity metrics on t-distributed stochastic neighbor embedding (t-SNE) plots of datasets of varying diversity, or corresponding to ligands of different pharmaceutical targets, which show that our indices provide a better measure of set compactness than standard binary measures. We also present a conceptual example of the applicability of our indices in agglomerative hierarchical algorithms. The Python code for calculating the extended similarity metrics is freely available at: https://github.com/ramirandaq/MultipleComparisons.

SELECTION OF CITATIONS
SEARCH DETAIL