Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
2.
Sci Robot ; 9(89): eadi9579, 2024 Apr 17.
Artigo em Inglês | MEDLINE | ID: mdl-38630806

RESUMO

Humanoid robots that can autonomously operate in diverse environments have the potential to help address labor shortages in factories, assist elderly at home, and colonize new planets. Although classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based approach for real-world humanoid locomotion. Our controller is a causal transformer that takes the history of proprioceptive observations and actions as input and predicts the next action. We hypothesized that the observation-action history contains useful information about the world that a powerful transformer model can use to adapt its behavior in context, without updating its weights. We trained our model with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deployed it to the real-world zero-shot. Our controller could walk over various outdoor terrains, was robust to external disturbances, and could adapt in context.


Assuntos
Robótica , Humanos , Idoso , Robótica/métodos , Locomoção , Caminhada , Aprendizagem , Reforço Psicológico
3.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15380-15393, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37540611

RESUMO

Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions in images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of object regions on a pair of images for contrastive learning. We combine this similarity learning with multiple existing object detectors to build Quasi-Dense Tracking (QDTrack), which does not require displacement regression or motion priors. We find that the resulting distinctive feature space admits a simple nearest neighbor search at inference time for object association. In addition, we show that our similarity learning scheme is not limited to video data, but can learn effective instance similarity even from static input, enabling a competitive tracking performance without training on videos or using tracking supervision. We conduct extensive experiments on a wide variety of popular MOT benchmarks. We find that, despite its simplicity, QDTrack rivals the performance of state-of-the-art tracking methods on all benchmarks and sets a new state-of-the-art on the large-scale BDD100K MOT benchmark, while introducing negligible computational overhead to the detector.

4.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 1992-2008, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35439131

RESUMO

A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving. We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform. The object association leverages quasi-dense similarity learning to identify objects in various poses and viewpoints with appearance cues only. After initial 2D association, we further utilize 3D bounding boxes depth-ordering heuristics for robust instance association and motion-based 3D trajectory prediction for re-identification of occluded vehicles. In the end, an LSTM-based object velocity learning module aggregates the long-term trajectory information for more accurate motion extrapolation. Experiments on our proposed simulation data and real-world benchmarks, including KITTI, nuScenes, and Waymo datasets, show that our tracking framework offers robust object association and tracking on urban-driving scenarios. On the Waymo Open benchmark, we establish the first camera-only baseline in the 3D tracking and 3D detection challenges. Our quasi-dense 3D tracking pipeline achieves impressive improvements on the nuScenes 3D tracking benchmark with near five times tracking accuracy of the best vision-only submission among all published methods.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3032-3046, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35704542

RESUMO

Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we examine simple approaches to improve machine recognition of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate input transformation model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks, and training datasets. This makes the methods applicable even when we do not have the knowledge of future recognition models, e.g., when uploading processed images to the Internet. We conduct experiments on multiple image processing tasks paired with ImageNet classification and PASCAL VOC detection as recognition tasks. With these simple yet effective methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/liuzhuang13/Transferable_RA.

6.
IEEE Trans Pattern Anal Mach Intell ; 42(3): 749-763, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-30575529

RESUMO

Fine-grained classification describes the automated recognition of visually similar object categories like birds species. Previous works were usually based on explicit pose normalization, i.e., the detection and description of object parts. However, recent models based on a final global average or bilinear pooling have achieved a comparable accuracy without this concept. In this paper, we analyze the advantages of these approaches over generic CNNs and explicit pose normalization approaches. We also show how they can achieve an implicit normalization of the object pose. A novel visualization technique called activation flow is introduced to investigate limitations in pose handling in traditional CNNs like AlexNet and VGG. Afterward, we present and compare the explicit pose normalization approach neural activation constellations and a generalized framework for the final global average and bilinear pooling called α-pooling. We observe that the latter often achieves a higher accuracy improving common CNN models by up to 22.9 percent, but lacks the interpretability of the explicit approaches. We present a visualization approach for understanding and analyzing predictions of the model to address this issue. Furthermore, we show that our approaches for fine-grained recognition are beneficial for other fields like action recognition.


Assuntos
Processamento de Imagem Assistida por Computador , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão , Algoritmos , Animais , Aprendizado de Máquina
7.
IEEE Trans Pattern Anal Mach Intell ; 39(4): 677-691, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-27608449

RESUMO

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

8.
IEEE Trans Pattern Anal Mach Intell ; 39(4): 640-651, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-27244717

RESUMO

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional networks achieve improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.

9.
IEEE Trans Pattern Anal Mach Intell ; 38(1): 142-58, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26656583

RESUMO

Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50 percent relative to the previous best result on VOC 2012-achieving a mAP of 62.4 percent. Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

10.
IEEE Trans Pattern Anal Mach Intell ; 37(5): 1001-12, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-26353324

RESUMO

The problem of real-time multiclass object recognition is of great practical importance in object recognition. In this paper, we describe a framework that simultaneously utilizes shared representation, reconstruction sparsity, and parallelism to enable real-time multiclass object detection with deformable part models at 5Hz on a laptop computer with almost no decrease in task performance. Our framework is trained in the standard structured output prediction formulation and is generically applicable for speeding up object recognition systems where the computational bottleneck is in multiclass, multi-convolutional inference. We experimentally demonstrate the efficiency and task performance of our method on PASCAL VOC, subset of ImageNet, Caltech101 and Caltech256 dataset.

11.
IEEE Trans Pattern Anal Mach Intell ; 36(11): 2185-98, 2014 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26353060

RESUMO

To produce images that are suitable for display, tone-mapping is widely used in digital cameras to map linear color measurements into narrow gamuts with limited dynamic range. This introduces non-linear distortion that must be undone, through a radiometric calibration process, before computer vision systems can analyze such photographs radiometrically. This paper considers the inherent uncertainty of undoing the effects of tone-mapping. We observe that this uncertainty varies substantially across color space, making some pixels more reliable than others. We introduce a model for this uncertainty and a method for fitting it to a given camera or imaging pipeline. Once fit, the model provides for each pixel in a tone-mapped digital photograph a probability distribution over linear scene colors that could have induced it. We demonstrate how these distributions can be useful for visual inference by incorporating them into estimation algorithms for a representative set of vision tasks.

12.
Acad Emerg Med ; 19(11): 1227-34, 2012 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-23167852

RESUMO

OBJECTIVES: Reuniting children with their families after a disaster poses unique challenges. The objective was to pilot test the ability of a novel image-based tool to assist a parent in identifying a picture of his or her children. METHODS: A previously developed image-based indexing and retrieval tool that employs two advanced vision search algorithms was used. One algorithm, Feature-Attribute-Matching, extracts facial features (skin color, eye color, and age) of a photograph and then matches according to parental input. The other algorithm, User-Feedback, allows parents to choose children on the screen that appear similar to theirs and then reprioritizes the images in the database. This was piloted in a convenience sample of parent-child pairs in a pediatric tertiary care hospital. A photograph of each participating child was added to a preexisting image database. A double-blind randomized crossover trial was performed to measure the percentage of database reviewed and time using the Feature-Attribute-Matching-plus-User-Feedback strategy or User-Feedback strategy only. Search results were compared to a theoretical random search. Afterward, parents completed a survey evaluating satisfaction. RESULTS: Fifty-one parent-child pairs completed the study. The Feature-Attribute-Matching-plus-User-Feedback strategy was superior to the User-Feedback strategy in decreasing the percentage of database reviewed (mean ± SD = 24.1 ± 20.1% vs. 35.6 ± 27.2%; mean difference = -11.5%; 95% confidence interval [CI] = -21.5% to -1.4%; p = 0.03). Both were superior to the random search (p < 0.001). Time for both searches was similar despite fewer images reviewed in the Feature-Attribute-Matching-plus-User-Feedback strategy. Sixty-eight percent of parents were satisfied with the search and 87% felt that this tool would be very or extremely helpful in a disaster. CONCLUSIONS: This novel image-based reunification system reduced the number of images reviewed before parents identified their children. This technology could be further developed to assist future family reunifications in a disaster.


Assuntos
Algoritmos , Desastres , Família , Processamento de Imagem Assistida por Computador , Sistemas de Identificação de Pacientes/métodos , Fotografação , Adulto , Criança , Pré-Escolar , Intervalos de Confiança , Estudos Cross-Over , Método Duplo-Cego , Socorristas , Retroalimentação , Feminino , Humanos , Masculino , Projetos Piloto
13.
IEEE Trans Pattern Anal Mach Intell ; 31(9): 1700-7, 2009 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-19574628

RESUMO

We study the problem of automatic visual speech recognition (VSR) using dynamic Bayesian network (DBN)-based models consisting of multiple sequences of hidden states, each corresponding to an articulatory feature (AF) such as lip opening (LO) or lip rounding (LR). A bank of discriminative articulatory feature classifiers provides input to the DBN, in the form of either virtual evidence (VE) (scaled likelihoods) or raw classifier margin outputs. We present experiments on two tasks, a medium-vocabulary word-ranking task and a small-vocabulary phrase recognition task. We show that articulatory feature-based models outperform baseline models, and we study several aspects of the models, such as the effects of allowing articulatory asynchrony, of using dictionary-based versus whole-word models, and of incorporating classifier outputs via virtual evidence versus alternative observation models.


Assuntos
Interpretação de Imagem Assistida por Computador/métodos , Lábio/anatomia & histologia , Lábio/fisiologia , Leitura Labial , Modelos Biológicos , Medida da Produção da Fala/métodos , Interface para o Reconhecimento da Fala , Algoritmos , Simulação por Computador , Humanos , Aumento da Imagem/métodos , Modelos Anatômicos , Reconhecimento Automatizado de Padrão/métodos
14.
IEEE Trans Pattern Anal Mach Intell ; 29(10): 1759-75, 2007 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-17699921

RESUMO

We describe a semi-supervised regression algorithm that learns to transform one time series into another time series given examples of the transformation. This algorithm is applied to tracking, where a time series of observations from sensors is transformed to a time series describing the pose of a target. Instead of defining and implementing such transformations for each tracking task separately, our algorithm learns a memoryless transformation of time series from a few example input-output mappings. The algorithm searches for a smooth function that fits the training examples and, when applied to the input time series, produces a time series that evolves according to assumed dynamics. The learning procedure is fast and lends itself to a closed-form solution. It is closely related to nonlinear system identification and manifold learning techniques. We demonstrate our algorithm on the tasks of tracking RFID tags from signal strength measurements, recovering the pose of rigid objects, deformable bodies, and articulated bodies from video sequences. For these tasks, this algorithm requires significantly fewer examples compared to fully-supervised regression algorithms or semi-supervised learning algorithms that do not take the dynamics of the output time series into account.


Assuntos
Algoritmos , Inteligência Artificial , Armazenamento e Recuperação da Informação/métodos , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Processamento de Sinais Assistido por Computador , Simulação por Computador , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Fatores de Tempo
15.
IEEE Trans Pattern Anal Mach Intell ; 29(10): 1848-53, 2007 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-17699927

RESUMO

We present a discriminative latent variable model for classification problems in structured domains where inputs can be represented by a graph of local observations. A hidden-state Conditional Random Field framework learns a set of latent variables conditioned on local features. Observations need not be independent and may overlap in space and time.


Assuntos
Algoritmos , Inteligência Artificial , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Simulação por Computador , Interpretação Estatística de Dados , Cadeias de Markov , Modelos Estatísticos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA