Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nat Biomed Eng ; 2024 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-39354052

RESUMO

The application of machine learning to tasks involving volumetric biomedical imaging is constrained by the limited availability of annotated datasets of three-dimensional (3D) scans for model training. Here we report a deep-learning model pre-trained on 2D scans (for which annotated data are relatively abundant) that accurately predicts disease-risk factors from 3D medical-scan modalities. The model, which we named SLIViT (for 'slice integration by vision transformer'), preprocesses a given volumetric scan into 2D images, extracts their feature map and integrates it into a single prediction. We evaluated the model in eight different learning tasks, including classification and regression for six datasets involving four volumetric imaging modalities (computed tomography, magnetic resonance imaging, optical coherence tomography and ultrasound). SLIViT consistently outperformed domain-specific state-of-the-art models and was typically as accurate as clinical specialists who had spent considerable time manually annotating the analysed scans. Automating diagnosis tasks involving volumetric scans may save valuable clinician hours, reduce data acquisition costs and duration, and help expedite medical research and clinical applications.

2.
Commun Eng ; 3(1): 125, 2024 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-39227676

RESUMO

Understanding a person's behavior from their 3D motion sequence is a fundamental problem in computer vision with many applications. An important component of this problem is 3D action localization, which involves recognizing what actions a person is performing, and when the actions occur in the sequence. To promote the progress of the 3D action localization community, we introduce a new, challenging, and more complex benchmark dataset, BABEL-TAL (BT), for 3D action localization. Important baselines and evaluating metrics, as well as human evaluations, are carefully established on this benchmark. We also propose a strong baseline model, i.e., Localizing Actions with Transformers (LocATe), that jointly localizes and recognizes actions in a 3D sequence. The proposed LocATe shows superior performance on BABEL-TAL as well as on the large-scale PKU-MMD dataset, achieving state-of-the-art performance by using only 10% of the labeled training data. Our research could advance the development of more accurate and efficient systems for human behavior analysis, with potential applications in areas such as human-computer interaction and healthcare.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38959151

RESUMO

Generative models make huge progress to the photorealistic image synthesis in recent years. To enable human to steer the image generation process and customize the output, many works explore the interpretable dimensions of the latent space in GANs. Existing methods edit the attributes of the output image such as orientation or color scheme by varying the latent code along certain directions. However, these methods usually require additional human annotations for each pretrained model, and they mostly focus on editing global attributes. In this work, we propose a self-supervised approach to improve the spatial steerability of GANs without searching for steerable directions in the latent space or requiring extra annotations. Specifically, we design randomly sampled Gaussian heatmaps to be encoded into the intermediate layers of generative models as spatial inductive bias. Along with training the GAN model from scratch, these heatmaps are being aligned with the emerging attention of the GAN's discriminator in a self-supervised learning manner. During inference, users can interact with the spatial heatmaps in an intuitive manner, enabling them to edit the output image by adjusting the scene layout, moving, or removing objects. Moreover, we incorporate DragGAN into our framework, which facilitates fine-grained manipulation within a reasonable time and supports a coarse-to-fine editing process. Extensive experiments show that the proposed method not only enables spatial editing over human faces, animal faces, outdoor scenes, and complicated multi-object indoor scenes but also brings improvement in synthesis quality. Code, models, and demo video are available at https://genforce.github.io/SpatialGAN/.

4.
Nature ; 630(8016): 353-359, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38867127

RESUMO

Exoskeletons have enormous potential to improve human locomotive performance1-3. However, their development and broad dissemination are limited by the requirement for lengthy human tests and handcrafted control laws2. Here we show an experiment-free method to learn a versatile control policy in simulation. Our learning-in-simulation framework leverages dynamics-aware musculoskeletal and exoskeleton models and data-driven reinforcement learning to bridge the gap between simulation and reality without human experiments. The learned controller is deployed on a custom hip exoskeleton that automatically generates assistance across different activities with reduced metabolic rates by 24.3%, 13.1% and 15.4% for walking, running and stair climbing, respectively. Our framework may offer a generalizable and scalable strategy for the rapid development and widespread adoption of a variety of assistive robots for both able-bodied and mobility-impaired individuals.


Assuntos
Simulação por Computador , Exoesqueleto Energizado , Quadril , Robótica , Humanos , Exoesqueleto Energizado/provisão & distribuição , Exoesqueleto Energizado/tendências , Aprendizagem , Robótica/instrumentação , Robótica/métodos , Corrida , Caminhada , Pessoas com Deficiência , Tecnologia Assistiva/provisão & distribuição , Tecnologia Assistiva/tendências
5.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2607-2621, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38300785

RESUMO

Generative Adversarial Networks (GANs) have significantly advanced image synthesis through mapping randomly sampled latent codes to high-fidelity synthesized images. However, applying well-trained GANs to real image editing remains challenging. A common solution is to find an approximate latent code that can adequately recover the input image to edit, which is also known as GAN inversion. To invert a GAN model, prior works typically focus on reconstructing the target image at the pixel level, yet few studies are conducted on whether the inverted result can well support manipulation at the semantic level. This work fills in this gap by proposing in-domain GAN inversion, which consists of a domain-guided encoder and a domain-regularized optimizer, to regularize the inverted code in the native latent space of the pre-trained GAN model. In this way, we manage to sufficiently reuse the knowledge learned by GANs for image reconstruction, facilitating a wide range of editing applications without any retraining. We further make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property. Such a trade-off sheds light on how a GAN model represents an image with various semantics encoded in the learned latent distribution.

6.
Res Sq ; 2023 Nov 21.
Artigo em Inglês | MEDLINE | ID: mdl-38045283

RESUMO

We present SLIViT, a deep-learning framework that accurately measures disease-related risk factors in volumetric biomedical imaging, such as magnetic resonance imaging (MRI) scans, optical coherence tomography (OCT) scans, and ultrasound videos. To evaluate SLIViT, we applied it to five different datasets of these three different data modalities tackling seven learning tasks (including both classification and regression) and found that it consistently and significantly outperforms domain-specific state-of-the-art models, typically improving performance (ROC AUC or correlation) by 0.1-0.4. Notably, compared to existing approaches, SLIViT can be applied even when only a small number of annotated training samples is available, which is often a constraint in medical applications. When trained on less than 700 annotated volumes, SLIViT obtained accuracy comparable to trained clinical specialists while reducing annotation time by a factor of 5,000 demonstrating its utility to automate and expedite ongoing research and other practical clinical scenarios.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3121-3138, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37022469

RESUMO

GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model so that the image can be faithfully reconstructed from the inverted code by the generator. As an emerging technique to bridge the real and fake image domains, GAN inversion plays an essential role in enabling pretrained GAN models, such as StyleGAN and BigGAN, for applications of real image editing. Moreover, GAN inversion interprets GAN's latent space and examines how realistic images can be generated. In this paper, we provide a survey of GAN inversion with a focus on its representative algorithms and its applications in image restoration and image manipulation. We further discuss the trends and challenges for future research. A curated list of GAN inversion methods, datasets, and other related information can be found at https://github.com/weihaox/awesome-gan-inversion.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7395-7411, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-36455092

RESUMO

Recent years witness the tremendous success of generative adversarial networks (GANs) in synthesizing photo-realistic images. GAN generator learns to compose realistic images and reproduce the real data distribution. Through that, a hierarchical visual feature with multi-level semantics spontaneously emerges. In this work we investigate that such a generative feature learned from image synthesis exhibits great potentials in solving a wide range of computer vision tasks, including both generative ones and more importantly discriminative ones. We first train an encoder by considering the pre-trained StyleGAN generator as a learned loss function. The visual features produced by our encoder, termed as Generative Hierarchical Features (GH-Feat), highly align with the layer-wise GAN representations, and hence describe the input image adequately from the reconstruction perspective. Extensive experiments support the versatile transferability of GH-Feat across a range of applications, such as image editing, image processing, image harmonization, face verification, landmark detection, layout prediction, image retrieval, etc. We further show that, through a proper spatial expansion, our developed GH-Feat can also facilitate fine-grained semantic segmentation using only a few annotations. Both qualitative and quantitative results demonstrate the appealing performance of GH-Feat. Code and models are available at https://genforce.github.io/ghfeat/.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3461-3475, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35830412

RESUMO

Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the RL agent's generalizability. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive.

10.
IEEE Trans Pattern Anal Mach Intell ; 44(4): 2004-2018, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-33108282

RESUMO

Although generative adversarial networks (GANs) have made significant progress in face synthesis, there lacks enough understanding of what GANs have learned in the latent representation to map a random code to a photo-realistic image. In this work, we propose a framework called InterFaceGAN to interpret the disentangled face representation learned by the state-of-the-art GAN models and study the properties of the facial semantics encoded in the latent space. We first find that GANs learn various semantics in some linear subspaces of the latent space. After identifying these subspaces, we can realistically manipulate the corresponding facial attributes without retraining the model. We then conduct a detailed study on the correlation between different semantics and manage to better disentangle them via subspace projection, resulting in more precise control of the attribute manipulation. Besides manipulating the gender, age, expression, and presence of eyeglasses, we can even alter the face pose and fix the artifacts accidentally made by GANs. Furthermore, we perform an in-depth face identity analysis and a layer-wise analysis to evaluate the editing results quantitatively. Finally, we apply our approach to real face editing by employing GAN inversion approaches and explicitly training feed-forward models based on the synthetic data established by InterFaceGAN. Extensive experimental results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable face representation.


Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador , Cabeça , Processamento de Imagem Assistida por Computador/métodos , Semântica
11.
IEEE Trans Image Process ; 30: 9112-9124, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34723802

RESUMO

Patch-based methods and deep networks have been employed to tackle image inpainting problem, with their own strengths and weaknesses. Patch-based methods are capable of restoring a missing region with high-quality texture through searching nearest neighbor patches from the unmasked regions. However, these methods bring problematic contents when recovering large missing regions. Deep networks, on the other hand, show promising results in completing large regions. Nonetheless, the results often lack faithful and sharp details that resemble the surrounding area. By bringing together the best of both paradigms, we propose a new deep inpainting framework where texture generation is guided by a texture memory of patch samples extracted from unmasked regions. The framework has a novel design that allows texture memory retrieval to be trained end-to-end with the deep inpainting network. In addition, we introduce a patch distribution loss to encourage high-quality patch synthesis. The proposed method shows superior performance both qualitatively and quantitatively on three challenging image benchmarks, i.e., Places, CelebA-HQ, and Paris Street-View datasets (Code will be made publicly available in https://github.com/open-mmlab/mmediting).

12.
Proc Natl Acad Sci U S A ; 117(48): 30071-30078, 2020 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-32873639

RESUMO

Deep neural networks excel at finding hierarchical representations that solve complex tasks over large datasets. How can we humans understand these learned representations? In this work, we present network dissection, an analytic framework to systematically identify the semantics of individual hidden units within image classification and image generation networks. First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. We find evidence that the network has learned many object classes that play crucial roles in classifying scene classes. Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes. By analyzing changes made when small sets of units are activated or deactivated, we find that objects can be added and removed from the output scenes while adapting to the context. Finally, we apply our analytic framework to understanding adversarial attacks and to semantic image editing.

13.
IEEE Trans Pattern Anal Mach Intell ; 42(2): 502-508, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-30802849

RESUMO

We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds. Modeling the spatial-audio-temporal dynamics even for actions occurring in 3 second videos poses many challenges: meaningful events do not include only people, but also objects, animals, and natural phenomena; visual and auditory events can be symmetrical in time ("opening" is "closing" in reverse), and either transient or sustained. We describe the annotation process of our dataset (each video is tagged with one action or activity label among 339 different classes), analyze its scale and diversity in comparison to other large-scale video datasets for action recognition, and report results of several baseline models addressing separately, and jointly, three modalities: spatial, temporal and auditory. The Moments in Time dataset, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.


Assuntos
Bases de Dados Factuais , Gravação em Vídeo , Animais , Atividades Humanas/classificação , Humanos , Processamento de Imagem Assistida por Computador , Reconhecimento Automatizado de Padrão
14.
R Soc Open Sci ; 6(3): 181375, 2019 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-31032000

RESUMO

Understanding the visual discrepancy and heterogeneity of different places is of great interest to architectural design, urban design and tourism planning. However, previous studies have been limited by the lack of adequate data and efficient methods to quantify the visual aspects of a place. This work proposes a data-driven framework to explore the place-informative scenes and objects by employing deep convolutional neural network to learn and measure the visual knowledge of place appearance automatically from a massive dataset of photos and imagery. Based on the proposed framework, we compare the visual similarity and visual distinctiveness of 18 cities worldwide using millions of geo-tagged photos obtained from social media. As a result, we identify the visual cues of each city that distinguish that city from others: other than landmarks, a large number of historical architecture, religious sites, unique urban scenes, along with some unusual natural landscapes have been identified as the most place-informative elements. In terms of the city-informative objects, taking vehicles as an example, we find that the taxis, police cars and ambulances are the most place-informative objects. The results of this work are inspiring for various fields-providing insights on what large-scale geo-tagged data can achieve in understanding place formalization and urban design.

15.
IEEE Trans Pattern Anal Mach Intell ; 41(9): 2131-2145, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-30040625

RESUMO

The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. In this work, we describe Network Dissection, a method that interprets networks by providing meaningful labels to their individual units. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and visual semantic concepts. By identifying the best alignments, units are given interpretable labels ranging from colors, materials, textures, parts, objects and scenes. The method reveals that deep representations are more transparent and interpretable than they would be under a random equivalently powerful basis. We apply our approach to interpret and compare the latent representations of several network architectures trained to solve a wide range of supervised and self-supervised tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initialization parameters, as well as networks depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a given CNN prediction for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into what hierarchical structures can learn.

16.
IEEE Trans Pattern Anal Mach Intell ; 40(6): 1452-1464, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-28692961

RESUMO

The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.

17.
IEEE Trans Pattern Anal Mach Intell ; 36(8): 1586-99, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26353340

RESUMO

Collective motions of crowds are common in nature and have attracted a great deal of attention in a variety of multidisciplinary fields. Collectiveness, which indicates the degree of individuals acting as a union, is a fundamental and universal measurement for various crowd systems. By quantifying the topological structures of collective manifolds of crowd, this paper proposes a descriptor of collectiveness and its efficient computation for the crowd and its constituent individuals. The Collective Merging algorithm is then proposed to detect collective motions from random motions. We validate the effectiveness and robustness of the proposed collectiveness on the system of self-driven particles as well as other real crowd systems such as pedestrian crowds and bacteria colony. We compare the collectiveness descriptor with human perception for collective motion and show their high consistency. As a universal descriptor, the proposed crowd collectiveness can be used to compare different crowd systems. It has a wide range of applications, such as detecting collective motions from crowd clutters, monitoring crowd dynamics, and generating maps of collectiveness for crowded scenes. A new Collective Motion Database, which consists of 413 video clips from 62 crowded scenes, is released to the public.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA