Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
1.
Bioinformatics ; 38(14): 3621-3628, 2022 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-35640976

RESUMO

MOTIVATION: Medical images can provide rich information about diseases and their biology. However, investigating their association with genetic variation requires non-standard methods. We propose transferGWAS, a novel approach to perform genome-wide association studies directly on full medical images. First, we learn semantically meaningful representations of the images based on a transfer learning task, during which a deep neural network is trained on independent but similar data. Then, we perform genetic association tests with these representations. RESULTS: We validate the type I error rates and power of transferGWAS in simulation studies of synthetic images. Then we apply transferGWAS in a genome-wide association study of retinal fundus images from the UK Biobank. This first-of-a-kind GWAS of full imaging data yielded 60 genomic regions associated with retinal fundus images, of which 7 are novel candidate loci for eye-related traits and diseases. AVAILABILITY AND IMPLEMENTATION: Our method is implemented in Python and available at https://github.com/mkirchler/transferGWAS/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estudo de Associação Genômica Ampla , Redes Neurais de Computação , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Genoma , Aprendizado de Máquina
2.
Neuroimage ; 120: 225-53, 2015 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-26067346

RESUMO

Neuroscientific data is typically analyzed based on the behavioral response of the participant. However, the errors made may or may not be in line with the neural processing. In particular in experiments with time pressure or studies where the threshold of perception is measured, the error distribution deviates from uniformity due to the structure in the underlying experimental set-up. When we base our analysis on the behavioral labels as usually done, then we ignore this problem of systematic and structured (non-uniform) label noise and are likely to arrive at wrong conclusions in our data analysis. This paper contributes a remedy to this important scenario: we present a novel approach for a) measuring label noise and b) removing structured label noise. We demonstrate its usefulness for EEG data analysis using a standard d2 test for visual attention (N=20 participants).


Assuntos
Atenção/fisiologia , Encéfalo/fisiologia , Neurociência Cognitiva/métodos , Eletroencefalografia/métodos , Potenciais Evocados/fisiologia , Aprendizado de Máquina não Supervisionado , Adulto , Feminino , Humanos , Masculino , Reconhecimento Visual de Modelos , Adulto Jovem
3.
IEEE Trans Neural Netw Learn Syst ; 34(5): 2259-2270, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-34473630

RESUMO

We propose orthogonal inductive matrix completion (OMIC), an interpretable approach to matrix completion based on a sum of multiple orthonormal side information terms, together with nuclear-norm regularization. The approach allows us to inject prior knowledge about the singular vectors of the ground-truth matrix. We optimize the approach by a provably converging algorithm, which optimizes all components of the model simultaneously. We study the generalization capabilities of our method in both the distribution-free setting and in the case where the sampling distribution admits uniform marginals, yielding learning guarantees that improve with the quality of the injected knowledge in both cases. As particular cases of our framework, we present models that can incorporate user and item biases or community information in a joint and additive fashion. We analyze the performance of OMIC on several synthetic and real datasets. On synthetic datasets with a sliding scale of user bias relevance, we show that OMIC better adapts to different regimes than other methods. On real-life datasets containing user/items recommendations and relevant side information, we find that OMIC surpasses the state of the art, with the added benefit of greater interpretability.

4.
Artigo em Inglês | MEDLINE | ID: mdl-37432811

RESUMO

In a recommender systems (RSs) dataset, observed ratings are subject to unequal amounts of noise. Some users might be consistently more conscientious in choosing the ratings they provide for the content they consume. Some items may be very divisive and elicit highly noisy reviews. In this article, we perform a nuclear-norm-based matrix factorization method which relies on side information in the form of an estimate of the uncertainty of each rating. A rating with a higher uncertainty is considered more likely to be erroneous or subject to large amounts of noise, and therefore more likely to mislead the model. Our uncertainty estimate is used as a weighting factor in the loss we optimize. To maintain the favorable scaling and theoretical guarantees coming with nuclear norm regularization even in this weighted context, we introduce an adjusted version of the trace norm regularizer which takes the weights into account. This regularization strategy is inspired from the weighted trace norm which was introduced to tackle nonuniform sampling regimes in matrix completion. Our method exhibits state-of-the-art performance on both synthetic and real life datasets in terms of various performance measures, confirming that we have successfully used the auxiliary information extracted.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 2952-2969, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35793301

RESUMO

Existing unsupervised outlier detection (OD) solutions face a grave challenge with surging visual data like images. Although deep neural networks (DNNs) prove successful for visual data, deep OD remains difficult due to OD's unsupervised nature. This paper proposes a novel framework named E 3Outlier that can perform effective and end-to-end deep outlier removal. Its core idea is to introduce self-supervision into deep OD. Specifically, our major solution is to adopt a discriminative learning paradigm that creates multiple pseudo classes from given unlabeled data by various data operations, which enables us to apply prevalent discriminative DNNs (e.g., ResNet) to the unsupervised OD problem. Then, with theoretical and empirical demonstration, we argue that inlier priority, a property that encourages DNN to prioritize inliers during self-supervised learning, makes it possible to perform end-to-end OD. Meanwhile, unlike frequently-used outlierness measures (e.g., density, proximity) in previous OD methods, we explore network uncertainty and validate it as a highly effective outlierness measure, while two practical score refinement strategies are also designed to improve OD performance. Finally, in addition to the discriminative learning paradigm above, we also explore the solutions that exploit other learning paradigms (i.e., generative learning and contrastive learning) to introduce self-supervision for E 3Outlier. Such extendibility not only brings further performance gain on relatively difficult datasets, but also enables E 3Outlier to be applied to other OD applications like video abnormal event detection. Extensive experiments demonstrate that E 3Outlier can considerably outperform state-of-the-art counterparts by 10%-30% AUROC. Demo codes are available at https://github.com/demonzyj56/E3Outlier.

6.
Procedia CIRP ; 115: 83-88, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36373025

RESUMO

The COVID-19 pandemic and crises like the Ukraine-Russia war have led to numerous restrictions for industrial manufacturing due to interrupted supply chains, staff absences due to illness or quarantine measures, and order situations that changed significantly at short notice. These influences have exposed that it is crucial to address the issue of manufacturing resilience in the context of current disruptions. This can be plausibly guaranteed by subjecting the ML model of a manufacturing system to attacks deliberately designed to fool its prediction. Such attacks can provide useful insights into properties that can increase resilience of manufacturing systems.

7.
Chem Sci ; 13(17): 4854-4862, 2022 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-35655876

RESUMO

Predictive models of thermodynamic properties of mixtures are paramount in chemical engineering and chemistry. Classical thermodynamic models are successful in generalizing over (continuous) conditions like temperature and concentration. On the other hand, matrix completion methods (MCMs) from machine learning successfully generalize over (discrete) binary systems; these MCMs can make predictions without any data for a given binary system by implicitly learning commonalities across systems. In the present work, we combine the strengths from both worlds in a hybrid approach. The underlying idea is to predict the pair-interaction energies, as they are used in basically all physical models of liquid mixtures, by an MCM. As an example, we embed an MCM into UNIQUAC, a widely-used physical model for the Gibbs excess energy. We train the resulting hybrid model in a Bayesian machine-learning framework on experimental data for activity coefficients in binary systems of 1146 components from the Dortmund Data Bank. We thereby obtain, for the first time, a complete set of UNIQUAC parameters for all binary systems of these components, which allows us to predict, in principle, activity coefficients at arbitrary temperature and composition for any combination of these components, not only for binary but also for multicomponent systems. The hybrid model even outperforms the best available physical model for predicting activity coefficients, the modified UNIFAC (Dortmund) model.

8.
IEEE Trans Neural Netw Learn Syst ; 33(10): 5177-5189, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-33835924

RESUMO

Taking the assumption that data samples are able to be reconstructed with the dictionary formed by themselves, recent multiview subspace clustering (MSC) algorithms aim to find a consensus reconstruction matrix via exploring complementary information across multiple views. Most of them directly operate on the original data observations without preprocessing, while others operate on the corresponding kernel matrices. However, they both ignore that the collected features may be designed arbitrarily and hard guaranteed to be independent and nonoverlapping. As a result, original data observations and kernel matrices would contain a large number of redundant details. To address this issue, we propose an MSC algorithm that groups samples and removes data redundancy concurrently. In specific, eigendecomposition is employed to obtain the robust data representation of low redundancy for later clustering. By utilizing the two processes into a unified model, clustering results will guide eigendecomposition to generate more discriminative data representation, which, as feedback, helps obtain better clustering results. In addition, an alternate and convergent algorithm is designed to solve the optimization problem. Extensive experiments are conducted on eight benchmarks, and the proposed algorithm outperforms comparative ones in recent literature by a large margin, verifying its superiority. At the same time, its effectiveness, computational efficiency, and robustness to noise are validated experimentally.

9.
IEEE Trans Pattern Anal Mach Intell ; 43(8): 2634-2646, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32086196

RESUMO

Incomplete multi-view clustering (IMVC) optimally combines multiple pre-specified incomplete views to improve clustering performance. Among various excellent solutions, the recently proposed multiple kernel k-means with incomplete kernels (MKKM-IK) forms a benchmark, which redefines IMVC as a joint optimization problem where the clustering and kernel matrix imputation tasks are alternately performed until convergence. Though demonstrating promising performance in various applications, we observe that the manner of kernel matrix imputation in MKKM-IK would incur intensive computational and storage complexities, over-complicated optimization and limitedly improved clustering performance. In this paper, we first propose an Efficient and Effective Incomplete Multi-view Clustering (EE-IMVC) algorithm to address these issues. Instead of completing the incomplete kernel matrices, EE-IMVC proposes to impute each incomplete base matrix generated by incomplete views with a learned consensus clustering matrix. Moreover, we further improve this algorithm by incorporating prior knowledge to regularize the learned consensus clustering matrix. Two three-step iterative algorithms are carefully developed to solve the resultant optimization problems with linear computational complexity, and their convergence is theoretically proven. After that, we theoretically study the generalization bound of the proposed algorithms. Furthermore, we conduct comprehensive experiments to study the proposed algorithms in terms of clustering accuracy, evolution of the learned consensus clustering matrix and the convergence. As indicated, our algorithms deliver their effectiveness by significantly and consistently outperforming some state-of-the-art ones.

10.
J Phys Chem Lett ; 11(3): 981-985, 2020 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-31964142

RESUMO

Activity coefficients, which are a measure of the nonideality of liquid mixtures, are a key property in chemical engineering with relevance to modeling chemical and phase equilibria as well as transport processes. Although experimental data on thousands of binary mixtures are available, prediction methods are needed to calculate the activity coefficients in many relevant mixtures that have not been explored to date. In this report, we propose a probabilistic matrix factorization model for predicting the activity coefficients in arbitrary binary mixtures. Although no physical descriptors for the considered components were used, our method outperforms the state-of-the-art method that has been refined over three decades while requiring much less training effort. This opens perspectives to novel methods for predicting physicochemical properties of binary mixtures with the potential to revolutionize modeling and simulation in chemical engineering.

11.
IEEE Trans Pattern Anal Mach Intell ; 42(5): 1191-1204, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-30640600

RESUMO

Multiple kernel clustering (MKC) algorithms optimally combine a group of pre-specified base kernel matrices to improve clustering performance. However, existing MKC algorithms cannot efficiently address the situation where some rows and columns of base kernel matrices are absent. This paper proposes two simple yet effective algorithms to address this issue. Different from existing approaches where incomplete kernel matrices are first imputed and a standard MKC algorithm is applied to the imputed kernel matrices, our first algorithm integrates imputation and clustering into a unified learning procedure. Specifically, we perform multiple kernel clustering directly with the presence of incomplete kernel matrices, which are treated as auxiliary variables to be jointly optimized. Our algorithm does not require that there be at least one complete base kernel matrix over all the samples. Also, it adaptively imputes incomplete kernel matrices and combines them to best serve clustering. Moreover, we further improve this algorithm by encouraging these incomplete kernel matrices to mutually complete each other. The three-step iterative algorithm is designed to solve the resultant optimization problems. After that, we theoretically study the generalization bound of the proposed algorithms. Extensive experiments are conducted on 13 benchmark data sets to compare the proposed algorithms with existing imputation-based methods. Our algorithms consistently achieve superior performance and the improvement becomes more significant with increasing missing ratio, verifying the effectiveness and advantages of the proposed joint imputation and clustering.

12.
IEEE Trans Neural Netw Learn Syst ; 29(9): 3994-4006, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-28961127

RESUMO

We present ClusterSVDD, a methodology that unifies support vector data descriptions (SVDDs) and $k$ -means clustering into a single formulation. This allows both methods to benefit from one another, i.e., by adding flexibility using multiple spheres for SVDDs and increasing anomaly resistance and flexibility through kernels to $k$ -means. In particular, our approach leads to a new interpretation of $k$ -means as a regularized mode seeking algorithm. The unifying formulation further allows for deriving new algorithms by transferring knowledge from one-class learning settings to clustering settings and vice versa. As a showcase, we derive a clustering method for structured data based on a one-class learning scenario. Additionally, our formulation can be solved via a particularly simple optimization scheme. We evaluate our approach empirically to highlight some of the proposed benefits on artificially generated data, as well as on real-world problems, and provide a Python software package comprising various implementations of primal and dual SVDD as well as our proposed ClusterSVDD.

13.
PLoS One ; 12(6): e0178161, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28570703

RESUMO

Training of one-vs.-rest SVMs can be parallelized over the number of classes in a straight forward way. Given enough computational resources, one-vs.-rest SVMs can thus be trained on data involving a large number of classes. The same cannot be stated, however, for the so-called all-in-one SVMs, which require solving a quadratic program of size quadratically in the number of classes. We develop distributed algorithms for two all-in-one SVM formulations (Lee et al. and Weston and Watkins) that parallelize the computation evenly over the number of classes. This allows us to compare these models to one-vs.-rest SVMs on unprecedented scale. The results indicate superior accuracy on text classification data.


Assuntos
Máquina de Vetores de Suporte , Algoritmos , Modelos Teóricos
14.
PLoS One ; 12(3): e0174392, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28346487

RESUMO

High prediction accuracies are not the only objective to consider when solving problems using machine learning. Instead, particular scientific applications require some explanation of the learned prediction function. For computational biology, positional oligomer importance matrices (POIMs) have been successfully applied to explain the decision of support vector machines (SVMs) using weighted-degree (WD) kernels. To extract relevant biological motifs from POIMs, the motifPOIM method has been devised and showed promising results on real-world data. Our contribution in this paper is twofold: as an extension to POIMs, we propose gPOIM, a general measure of feature importance for arbitrary learning machines and feature sets (including, but not limited to, SVMs and CNNs) and devise a sampling strategy for efficient computation. As a second contribution, we derive a convex formulation of motifPOIMs that leads to more reliable motif extraction from gPOIMs. Empirical evaluations confirm the usefulness of our approach on artificially generated data as well as on real-world datasets.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Máquina de Vetores de Suporte , Algoritmos
15.
Sci Rep ; 6: 36671, 2016 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-27892471

RESUMO

The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.

16.
PLoS One ; 10(12): e0144782, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26690911

RESUMO

Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.


Assuntos
Aprendizado de Máquina , Modelos Genéticos , Motivos de Nucleotídeos , Análise de Sequência de DNA/métodos , Humanos
17.
IEEE Trans Neural Netw Learn Syst ; 25(5): 870-81, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24808034

RESUMO

The task of structured output prediction deals with learning general functional dependencies between arbitrary input and output spaces. In this context, two loss-sensitive formulations for maximum-margin training have been proposed in the literature, which are referred to as margin and slack rescaling, respectively. The latter is believed to be more accurate and easier to handle. Nevertheless, it is not popular due to the lack of known efficient inference algorithms; therefore, margin rescaling--which requires a similar type of inference as normal structured prediction--is the most often used approach. Focusing on the task of label sequence learning, we here define a general framework that can handle a large class of inference problems based on Hamming-like loss functions and the concept of decomposability for the underlying joint feature map. In particular, we present an efficient generic algorithm that can handle both rescaling approaches and is guaranteed to find an optimal solution in polynomial time.

18.
PLoS One ; 7(10): e42947, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23118845

RESUMO

We provide a novel interpretation of the dual of support vector machines (SVMs) in terms of scatter with respect to class prototypes and their mean. As a key contribution, we extend this framework to multiple classes, providing a new joint Scatter SVM algorithm, at the level of its binary counterpart in the number of optimization variables. This enables us to implement computationally efficient solvers based on sequential minimal and chunking optimization. As a further contribution, the primal problem formulation is developed in terms of regularized risk minimization and the hinge loss, revealing the score function to be used in the actual classification of test patterns. We investigate Scatter SVM properties related to generalization ability, computational efficiency, sparsity and sensitivity maps, and report promising results.


Assuntos
Algoritmos , Modelos Teóricos , Máquina de Vetores de Suporte , Inteligência Artificial , Interpretação Estatística de Dados , Humanos , Reconhecimento Automatizado de Padrão
19.
PLoS One ; 7(8): e38897, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22936970

RESUMO

Combining information from various image features has become a standard technique in concept recognition tasks. However, the optimal way of fusing the resulting kernel functions is usually unknown in practical applications. Multiple kernel learning (MKL) techniques allow to determine an optimal linear combination of such similarity matrices. Classical approaches to MKL promote sparse mixtures. Unfortunately, 1-norm regularized MKL variants are often observed to be outperformed by an unweighted sum kernel. The main contributions of this paper are the following: we apply a recently developed non-sparse MKL variant to state-of-the-art concept recognition tasks from the application domain of computer vision. We provide insights on benefits and limits of non-sparse MKL and compare it against its direct competitors, the sum-kernel SVM and sparse MKL. We report empirical results for the PASCAL VOC 2009 Classification and ImageCLEF2010 Photo Annotation challenge data sets. Data sets (kernel matrices) as well as further information are available at http://doc.ml.tu-berlin.de/image_mkl/(Accessed 2012 Jun 25).


Assuntos
Algoritmos , Software , Modelos Teóricos , Reconhecimento Automatizado de Padrão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA