RESUMO
While being one of the first and most elegant tools for dimensionality reduction, Fisher linear discriminant analysis (FLDA) is not currently considered among the top methods for feature extraction or classification. In this paper, we will review two recent approaches to FLDA, namely, least squares Fisher discriminant analysis (LSFDA) and regularized kernel FDA (RKFDA) and propose deep FDA (DFDA), a straightforward nonlinear extension of LSFDA that takes advantage of the recent advances on deep neural networks. We will compare the performance of RKFDA and DFDA on a large number of two-class and multiclass problems, many of them involving class-imbalanced data sets and some having quite large sample sizes; we will use, for this, the areas under the receiver operating characteristics (ROCs) curve of the classifiers considered. As we shall see, the classification performance of both methods is often very similar and particularly good on imbalanced problems, but building DFDA models is considerably much faster than doing so for RKFDA, particularly in problems with quite large sample sizes.
RESUMO
High-content screening (HCS) allows the exploration of complex cellular phenotypes by automated microscopy and is increasingly being adopted for small interfering RNA genomic screening and phenotypic drug discovery. We introduce a series of cell-based evaluation metrics that have been implemented and validated in a mono-parametric HCS for regulators of the membrane trafficking protein caveolin 1 (CAV1) and have also proved useful for the development of a multiparametric phenotypic HCS for regulators of cytoskeletal reorganization. Imaging metrics evaluate imaging quality such as staining and focus, whereas cell biology metrics are fuzzy logic-based evaluators describing complex biological parameters such as sparseness, confluency, and spreading. The evaluation metrics were implemented in a data-mining pipeline, which first filters out cells that do not pass a quality criterion based on imaging metrics and then uses cell biology metrics to stratify cell samples to allow further analysis of homogeneous cell populations. Use of these metrics significantly improved the robustness of the monoparametric assay tested, as revealed by an increase in Z' factor, Kolmogorov-Smirnov distance, and strict standard mean difference. Cell biology evaluation metrics were also implemented in a novel supervised learning classification method that combines them with phenotypic features in a statistical model that exceeded conventional classification methods, thus improving multiparametric phenotypic assay sensitivity.
Assuntos
Avaliação Pré-Clínica de Medicamentos/métodos , Ensaios de Triagem em Larga Escala/métodos , Linhagem Celular Tumoral , Lógica Fuzzy , Humanos , Microscopia Confocal , Microscopia de Fluorescência , Curva ROC , Reprodutibilidade dos TestesRESUMO
In this brief, we give a new proof of the asymptotic convergence of the sequential minimum optimization (SMO) algorithm for both the most violating pair and second order rules to select the pair of coefficients to be updated. The proof is more self-contained, shorter, and simpler than previous ones and has a different flavor, partially building upon Gilbert's original convergence proof of its algorithm to solve the minimum norm problem for convex hulls. It is valid for both support vector classification (SVC) and support vector regression, which are formulated under a general problem that encompasses them. Moreover, this general problem can be further extended to also cover other support vector machines (SVM)-related problems such as -SVC or one-class SVMs, while the convergence proof of the slight variant of SMO needed for them remains basically unchanged.
RESUMO
Implicit Wiener series are a powerful tool to build Volterra representations of time series with any degree of non-linearity. A natural question is then whether higher order representations yield more useful models. In this work we shall study this question for ECoG data channel relationships in epileptic seizure recordings, considering whether quadratic representations yield more accurate classifiers than linear ones. To do so we first show how to derive statistical information on the Volterra coefficient distribution and how to construct seizure classification patterns over that information. As our results illustrate, a quadratic model seems to provide no advantages over a linear one. Nevertheless, we shall also show that the interpretability of the implicit Wiener series provides insights into the inter-channel relationships of the recordings.
Assuntos
Eletroencefalografia , Epilepsia/fisiopatologia , Dinâmica não Linear , Algoritmos , Eletrodos , Humanos , Modelos BiológicosRESUMO
This paper reports on the evaluation of different machine learning techniques for the automated classification of coding gene sequences obtained from several organisms in terms of their functional role as adhesins. Diverse, biologically-meaningful, sequence-based features were extracted from the sequences and used as inputs to the in silico prediction models. Another contribution of this work is the generation of potentially novel and testable predictions about the surface protein DGF-1 family in Trypanosoma cruzi. Finally, these techniques are potentially useful for the automated annotation of known adhesin-like proteins from the trans-sialidase surface protein family in T. cruzi, the etiological agent of Chagas disease.
Assuntos
Biologia Computacional/métodos , Proteínas de Membrana/química , Trypanosoma cruzi/metabolismo , Animais , Inteligência Artificial , Doença de Chagas/parasitologia , Bases de Dados de Proteínas , Glicoproteínas/química , Humanos , Modelos Estatísticos , Família Multigênica , Neuraminidase/química , Proteômica/métodos , Proteínas de Protozoários/química , Trypanosoma cruzi/classificação , Trypanosoma cruzi/genéticaRESUMO
Non Linear Discriminant Analysis (NLDA) networks combine a standard Multilayer Perceptron (MLP) transfer function with the minimization of a Fisher analysis criterion. In this work we will define natural-like gradients for NLDA network training. Instead of a more principled approach, that would require the definition of an appropriate Riemannian structure on the NLDA weight space, we will follow a simpler procedure, based on the observation that the gradient of the NLDA criterion function J can be written as the expectation nablaJ(W)=E[Z(X,W)] of a certain random vector Z and defining then I=E[Z(X,W)Z(X,W)(t)] as the Fisher information matrix in this case. This definition of I formally coincides with that of the information matrix for the MLP or other square error functions; the NLDA J criterion, however, does not have this structure. Although very simple, the proposed approach shows much faster convergence than that of standard gradient descent, even when its costlier complexity is taken into account. While the faster convergence of natural MLP batch training can be also explained in terms of its relationship with the Gauss-Newton minimization method, this is not the case for NLDA training, as we will see analytically and numerically that the hessian and information matrices are different.