Pesquisa | Portal Regional da BVS

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge.

Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru.

IEEE Trans Pattern Anal Mach Intell ; 39(4): 652-663, 2017 04.

Artigo em Inglês | MEDLINE | ID: mdl-28055847

RESUMO

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. Finally, given the recent surge of interest in this task, a competition was organized in 2015 using the newly released COCO dataset. We describe and analyze the various improvements we applied to our own baseline and show the resulting performance in the competition, which we won ex-aequo with a team from Microsoft Research.

PRIME: probabilistic initial 3D model generation for single-particle cryo-electron microscopy.

Elmlund, Hans; Elmlund, Dominika; Bengio, Samy.

Structure ; 21(8): 1299-306, 2013 Aug 06.

Artigo em Inglês | MEDLINE | ID: mdl-23931142

RESUMO

Low-dose electron microscopy of cryo-preserved individual biomolecules (single-particle cryo-EM) is a powerful tool for obtaining information about the structure and dynamics of large macromolecular assemblies. Acquiring images with low dose reduces radiation damage, preserves atomic structural details, but results in low signal-to-noise ratio of the individual images. The projection directions of the two-dimensional images are random and unknown. The grand challenge is to achieve the precise three-dimensional (3D) alignment of many (tens of thousands to millions) noisy projection images, which may then be combined to obtain a faithful 3D map. An accurate initial 3D model is critical for obtaining the precise 3D alignment required for high-resolution (<10 Å) map reconstruction. We report a method (PRIME) that, in a single step and without prior structural knowledge, can generate an accurate initial 3D map directly from the noisy images.

Assuntos

Microscopia Crioeletrônica/métodos , Substâncias Macromoleculares/ultraestrutura , Imageamento Tridimensional/métodos , Modelos Moleculares , Modelos Estatísticos , Ribossomos/ultraestrutura , Software

Sound retrieval and ranking using sparse auditory representations.

Lyon, Richard F; Rehn, Martin; Bengio, Samy; Walters, Thomas C; Chechik, Gal.

Neural Comput ; 22(9): 2390-416, 2010 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-20569181

RESUMO

To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large-scale task. We have adapted a machine-vision method, the passive-aggressive model for image retrieval (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach, we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use an adaptive pole-zero filter cascade (PZFC) auditory filter bank and sparse-code feature extraction from stabilized auditory images with multiple vector quantizers. In addition to auditory image models, we compare a family of more conventional mel-frequency cepstral coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. When thousands of sound files with a query vocabulary of thousands of words were ranked, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC front end.

Assuntos

Percepção Auditiva/fisiologia , Modelos Neurológicos , Humanos , Som

A discriminative kernel-based approach to rank images from text queries.

Grangier, David; Bengio, Samy.

IEEE Trans Pattern Anal Mach Intell ; 30(8): 1371-84, 2008 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-18566492

RESUMO

This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

Assuntos

Algoritmos , Inteligência Artificial , Interpretação de Imagem Assistida por Computador/métodos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Análise Discriminante , Aumento da Imagem/métodos , Vocabulário Controlado

Performance generalization in biometric authentication using joint user-specific and sample bootstraps.

Poh, Norman; Martin, Alvin; Bengio, Samy.

IEEE Trans Pattern Anal Mach Intell ; 29(3): 492-8, 2007 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-17224618

RESUMO

Biometric authentication performance is often depicted by a detection error trade-off (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a two-step bootstrap procedure to take into account the three mentioned sources of variability. This is an extension to the Bolle et al.'s bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75 percent of DET coverage, an unseen DET curve with eight times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and, hence, more useful.

Assuntos

Algoritmos , Inteligência Artificial , Biometria/métodos , Face/anatomia & histologia , Interpretação de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Interface para o Reconhecimento da Fala , Simulação por Computador , Humanos , Modelos Estatísticos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade

Automatic analysis of multimodal group actions in meetings.

McCowan, Iain; Gatica-Perez, Daniel; Bengio, Samy; Lathoud, Guillaume; Barnard, Mark; Zhang, Dong.

IEEE Trans Pattern Anal Mach Intell ; 27(3): 305-17, 2005 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-15747787

RESUMO

This paper investigates the recognition of group actions in meetings. A framework is employed in which group actions result from the interactions of the individual participants. The group actions are modeled using different HMM-based approaches, where the observations are provided by a set of audiovisual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modeling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.

Assuntos

Algoritmos , Inteligência Artificial , Ciências do Comportamento/métodos , Processos Grupais , Armazenamento e Recuperação da Informação/métodos , Reconhecimento Automatizado de Padrão/métodos , Comportamento Social , Análise por Conglomerados , Simulação por Computador , Humanos , Modelos Biológicos , Modelos Estatísticos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade

Evaluation of formant-like features on an automatic vowel classification task.

de Wet, Febe; Weber, Katrin; Boves, Louis; Cranen, Bert; Bengio, Samy; Bourlard, Hervé.

J Acoust Soc Am ; 116(3): 1781-92, 2004 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-15478445

RESUMO

Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality.

Assuntos

Fonética , Acústica da Fala , Algoritmos , Bases de Dados Factuais , Análise Discriminante , Feminino , Humanos , Masculino , Cadeias de Markov , Modelos Biológicos , Ruído , Fatores Sexuais

Offline recognition of unconstrained handwritten texts using HMMs and statistical language models.

Vinciarelli, Alessandro; Bengio, Samy; Bunke, Horst.

IEEE Trans Pattern Anal Mach Intell ; 26(6): 709-20, 2004 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-18579932

RESUMO

This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, the error rate is reduced by approximately 50 percent for single writer data and by approximately 25 percent for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.

Assuntos

Inteligência Artificial , Biometria/métodos , Processamento Eletrônico de Dados/métodos , Escrita Manual , Interpretação de Imagem Assistida por Computador/métodos , Armazenamento e Recuperação da Informação/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Gráficos por Computador , Documentação , Aumento da Imagem/métodos , Cadeias de Markov , Modelos Estatísticos , Análise Numérica Assistida por Computador , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Processamento de Sinais Assistido por Computador , Técnica de Subtração , Interface Usuário-Computador

10.

A parallel mixture of SVMs for very large scale problems.

Collobert, Ronan; Bengio, Samy; Bengio, Yoshua.

Neural Comput ; 14(5): 1105-14, 2002 May.

Artigo em Inglês | MEDLINE | ID: mdl-11972909

RESUMO

Support vector machines (SVMs) are the state-of-the-art models for many classification problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a significant improvement in generalization was observed.

Assuntos

Algoritmos , Inteligência Artificial , Software

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA