Pesquisa | Biblioteca Virtual em Saúde

1.

Automatically Predicting Material Properties with Microscopic Images: Polymer Miscibility as an Example.

Liang, Zhilong; Tan, Zhenzhi; Hong, Ruixin; Ouyang, Wanli; Yuan, Jinying; Zhang, Changshui.

J Chem Inf Model ; 63(19): 5971-5980, 2023 10 09.

Artigo em Inglês | MEDLINE | ID: mdl-37589216

RESUMO

Many material properties are manifested in the morphological appearance and characterized using microscopic images, such as scanning electron microscopy (SEM). Polymer miscibility is a key physical quantity of polymer materials and is commonly and intuitively judged using SEM images. However, human observation and judgment of the images is time-consuming, labor-intensive, and hard to be quantified. Computer image recognition with machine learning methods can make up for the defects of artificial judging, giving accurate and quantitative judgment. We achieve automatic miscibility recognition utilizing a convolutional neural network and transfer learning methods, and the model obtains up to 94% accuracy. We also put forward a quantitative criterion for polymer miscibility with this model. The proposed method can be widely applied to the quantitative characterization of the microstructure and properties of various materials.

Assuntos

Redes Neurais de Computação , Polímeros , Humanos , Aprendizado de Máquina

2.

3D Object Detection From Images for Autonomous Driving: A Survey.

Ma, Xinzhu; Ouyang, Wanli; Simonelli, Andrea; Ricci, Elisa.

IEEE Trans Pattern Anal Mach Intell ; 46(5): 3537-3556, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38145536

RESUMO

3D object detection from images, one of the fundamental and challenging problems in autonomous driving, has received increasing attention from both industry and academia in recent years. Benefiting from the rapid development of deep learning technologies, image-based 3D detection has achieved remarkable progress. Particularly, more than 200 works have studied this problem from 2015 to 2021, encompassing a broad spectrum of theories, algorithms, and applications. However, to date no recent survey exists to collect and organize this knowledge. In this paper, we fill this gap in the literature and provide the first comprehensive survey of this novel and continuously growing research field, summarizing the most commonly used pipelines for image-based 3D detection and deeply analyzing each of their components. Additionally, we also propose two new taxonomies to organize the state-of-the-art methods into different categories, with the intent of providing a more systematic review of existing methods and facilitating fair comparisons with future works. In retrospect of what has been achieved so far, we also analyze the current challenges in the field and discuss future directions for image-based 3D detection research.

3.

Cap4Video++: Enhancing Video Understanding with Auxiliary Captions.

Wu, Wenhao; Wang, Xiaohan; Luo, Haipeng; Wang, Jingdong; Yang, Yi; Ouyang, Wanli.

IEEE Trans Pattern Anal Mach Intell ; PP2024 Sep 09.

Artigo em Inglês | MEDLINE | ID: mdl-39250359

RESUMO

Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked interest in leveraging their capabilities for enhanced video understanding, showing marked advancements in both performance and efficiency. However, current methods often neglect vital user-generated metadata such as video titles. In this paper, we present Cap4Video++, a universal framework that leverages auxiliary captions to enrich video understanding. More recently, we witness the flourishing of large language models (LLMs) like ChatGPT. Cap4Video++ harnesses the synergy of vision-language models (VLMs) and large language models (LLMs) to generate video captions, utilized in three key phases: (i) Input stage employs Semantic Pair Sampling to extract beneficial samples from captions, aiding contrastive learning. (ii) Intermediate stage sees Video-Caption Cross-modal Interaction and Adaptive Caption Selection work together to bolster video and caption representations. (iii) Output stage introduces a Complementary Caption-Text Matching branch, enhancing the primary video branch by improving similarity calculations. Our comprehensive experiments on text-video retrieval and video action recognition across nine benchmarks clearly demonstrate Cap4Video++'s superiority over existing models, highlighting its effectiveness in utilizing automatically generated captions to advance video understanding.

4.

Adaptive pessimism via target Q-value for offline reinforcement learning.

Liu, Jie; Zhang, Yinmin; Li, Chuming; Yang, Yaodong; Liu, Yu; Ouyang, Wanli.

Neural Netw ; 180: 106588, 2024 Aug 05.

Artigo em Inglês | MEDLINE | ID: mdl-39180907

RESUMO

Offline reinforcement learning (RL) methods learn from datasets without further environment interaction, facing errors due to out-of-distribution (OOD) actions. Although effective methods have been proposed to conservatively estimate the Q-values of those OOD actions to mitigate this problem, insufficient or excessive pessimism under constant constraints often harms the policy learning process. Moreover, since the distribution of each task on the dataset varies among different environments and behavior policies, it is desirable to learn an adaptive weight for balancing constraints on the conservative estimation of Q-value and the standard RL objectives depending on each task. To achieve this, in this paper, we point out that the quantile of the Q-value is an effective metric to refer to the Q-value distribution of the fixed data set. Based on this observation, we design Adaptive Pessimism via a Target Q-value (APTQ) algorithm that balances between the pessimism constraint and the RL objective; this leads the expectation of Q-value to stably converge to a given target Q-value from a reasonable quantile of the Q-value distribution of the dataset. Experiments show that our method remarkably improves the performance of the state-of-the-art method CQL by 6.20% on the D4RL-v0 and 1.89% on the D4RL-v2.

5.

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation.

Xu, Lian; Bennamoun, Mohammed; Boussaid, Farid; Laga, Hamid; Ouyang, Wanli; Xu, Dan.

IEEE Trans Pattern Anal Mach Intell ; PP2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38781059

RESUMO

This paper proposes a novel transformer-based framework to generate accurate class-specific object localization maps for weakly supervised semantic segmentation (WSSS). Leveraging the insight that the attended regions of the one-class token in the standard vision transformer can generate class-agnostic localization maps, we investigate the transformer's capacity to capture class-specific attention for class-discriminative object localization by learning multiple class tokens. We present the Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with patch tokens. This is facilitated by a class-aware training strategy that establishes a one-to-one correspondence between output class tokens and ground-truth class labels. We also introduce a Contrastive-Class-Token (CCT) module to enhance the learning of discriminative class tokens, enabling the model to better capture the unique characteristics of each class. Consequently, the proposed framework effectively generates class-discriminative object localization maps from the class-to-patch attentions associated with different class tokens. To refine these localization maps, we propose the utilization of patch-level pairwise affinity derived from the patch-to-patch transformer attention. Furthermore, the proposed framework seamlessly complements the Class Activation Mapping (CAM) method, yielding significant improvements in WSSS performance on PASCAL VOC 2012 and MS COCO 2014. These results underline the importance of the class token for WSSS. The codes and models are publicly available here.

6.

Content-Aware Rectified Activation for Zero-Shot Fine-Grained Image Retrieval.

Wang, Shijie; Chang, Jianlong; Wang, Zhihui; Li, Haojie; Ouyang, Wanli; Tian, Qi.

IEEE Trans Pattern Anal Mach Intell ; 46(6): 4366-4380, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38236683

RESUMO

Fine-grained image retrieval mainly focuses on learning salient features from the seen subcategories as discriminative embedding while neglecting the problems behind zero-shot settings. We argue that retrieving fine-grained objects from unseen subcategories may rely on more diverse clues, which are easily restrained by the salient features learnt from seen subcategories. To address this issue, we propose a novel Content-aware Rectified Activation model, which enables this model to suppress the activation on salient regions while preserving their discrimination, and spread activation to adjacent non-salient regions, thus mining more diverse discriminative features for retrieving unseen subcategories. Specifically, we construct a content-aware rectified prototype (CARP) by perceiving semantics of salient regions. CARP acts as a channel-wise non-destructive activation upper bound and can be selectively used to suppress salient regions for obtaining the rectified features. Moreover, two regularizations are proposed: 1) a semantic coherency constraint that imposes a restriction on semantic coherency of CARP and salient regions, aiming at propagating the discriminative ability of salient regions to CARP, 2) a feature-navigated constraint to further guide the model to adaptively balance the discrimination power of rectified features and the suppression power of salient features. Experimental results on fine-grained and product retrieval benchmarks demonstrate that our method consistently outperforms the state-of-the-art methods.

7.

Auxiliary Tasks Enhanced Dual-Affinity Learning for Weakly Supervised Semantic Segmentation.

Xu, Lian; Bennamoun, Mohammed; Boussaid, Farid; Ouyang, Wanli; Sohel, Ferdous; Xu, Dan.

IEEE Trans Neural Netw Learn Syst ; PP2024 Mar 13.

Artigo em Inglês | MEDLINE | ID: mdl-38478447

RESUMO

Most existing weakly supervised semantic segmentation (WSSS) methods rely on class activation mapping (CAM) to extract coarse class-specific localization maps using image-level labels. Prior works have commonly used an off-line heuristic thresholding process that combines the CAM maps with off-the-shelf saliency maps produced by a general pretrained saliency model to produce more accurate pseudo-segmentation labels. We propose AuxSegNet + , a weakly supervised auxiliary learning framework to explore the rich information from these saliency maps and the significant intertask correlation between saliency detection and semantic segmentation. In the proposed AuxSegNet + , saliency detection and multilabel image classification are used as auxiliary tasks to improve the primary task of semantic segmentation with only image-level ground-truth labels. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps. In particular, we propose a cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation. The learned cross-task pairwise affinity can also be used to refine and propagate CAM maps to provide better pseudo labels for both tasks. Iterative improvement of segmentation performance is enabled by cross-task affinity learning and pseudo-label updating. Extensive experiments demonstrate the effectiveness of the proposed approach with new state-of-the-art WSSS results on the challenging PASCAL VOC and MS COCO benchmarks.

8.

TCFormer: Visual Recognition via Token Clustering Transformer.

Zeng, Wang; Jin, Sheng; Xu, Lumin; Liu, Wentao; Qian, Chen; Ouyang, Wanli; Luo, Ping; Wang, Xiaogang.

IEEE Trans Pattern Anal Mach Intell ; PP2024 Jul 11.

Artigo em Inglês | MEDLINE | ID: mdl-38990751

RESUMO

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.

9.

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline.

Li, Yangguang; Huang, Bin; Chen, Zeren; Cui, Yufeng; Liang, Feng; Shen, Mingzhu; Liu, Fenggang; Xie, Enze; Sheng, Lu; Ouyang, Wanli; Shao, Jing.

IEEE Trans Pattern Anal Mach Intell ; PP2024 Jun 14.

Artigo em Inglês | MEDLINE | ID: mdl-38875097

RESUMO

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We innovatively propose (1) a lightweight deploymentfriendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multiframe feature fusion mechanism to leverage the temporal information. Among them, (1) and (3) enable Fast-BEV to be fast inference and deployment friendly on the on-vehicle chips, (2), (4) and (5) ensure that Fast-BEV has competitive performance. All these make Fast-BEV a solution with high performance, fast inference speed, and deployment-friendly on the on-vehicle chips of autonomous driving. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model [1] and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model [2]. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/FastBEV.

10.

Improving multiple sclerosis lesion segmentation across clinical sites: A federated learning approach with noise-resilient training.

Bai, Lei; Wang, Dongang; Wang, Hengrui; Barnett, Michael; Cabezas, Mariano; Cai, Weidong; Calamante, Fernando; Kyle, Kain; Liu, Dongnan; Ly, Linda; Nguyen, Aria; Shieh, Chun-Chien; Sullivan, Ryan; Zhan, Geng; Ouyang, Wanli; Wang, Chenyu.

Artif Intell Med ; 152: 102872, 2024 06.

Artigo em Inglês | MEDLINE | ID: mdl-38701636

RESUMO

Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations to train better deep learning models with lower cost in data collection and annotation.

Assuntos

Aprendizado Profundo , Imageamento por Ressonância Magnética , Esclerose Múltipla , Esclerose Múltipla/diagnóstico por imagem , Humanos , Imageamento por Ressonância Magnética/métodos , Interpretação de Imagem Assistida por Computador/métodos , Processamento de Imagem Assistida por Computador/métodos

11.

Integration of cognitive tasks into artificial general intelligence test for large models.

Qu, Youzhi; Wei, Chen; Du, Penghui; Che, Wenxin; Zhang, Chi; Ouyang, Wanli; Bian, Yatao; Xu, Feiyang; Hu, Bin; Du, Kai; Wu, Haiyan; Liu, Jia; Liu, Quanying.

iScience ; 27(4): 109550, 2024 Apr 19.

Artigo em Inglês | MEDLINE | ID: mdl-38595796

RESUMO

During the evolution of large models, performance evaluation is necessary for assessing their capabilities. However, current model evaluations mainly rely on specific tasks and datasets, lacking a united framework for assessing the multidimensional intelligence of large models. In this perspective, we advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests, including crystallized, fluid, social, and embodied intelligence. The AGI tests consist of well-designed cognitive tests adopted from human intelligence tests, and then naturally encapsulates into an immersive virtual community. We propose increasing the complexity of AGI testing tasks commensurate with advancements in large models and emphasizing the necessity for the interpretation of test results to avoid false negatives and false positives. We believe that cognitive science-inspired AGI tests will effectively guide the targeted improvement of large models in specific dimensions of intelligence and accelerate the integration of large models into human society.

12.

Multidimensional Pruning and Its Extension: A Unified Framework for Model Compression.

Guo, Jinyang; Xu, Dong; Ouyang, Wanli.

IEEE Trans Neural Netw Learn Syst ; PP2023 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-37220047

RESUMO

Observing that the existing model compression approaches only focus on reducing the redundancies in convolutional neural networks (CNNs) along one particular dimension (e.g., the channel or spatial or temporal dimension), in this work, we propose our multidimensional pruning (MDP) framework, which can compress both 2-D CNNs and 3-D CNNs along multiple dimensions in an end-to-end fashion. Specifically, MDP indicates the simultaneous reduction of channels and more redundancy on other additional dimensions. The redundancy of additional dimensions depends on the input data, i.e., spatial dimension for 2-D CNNs when using images as the input data, and spatial and temporal dimensions for 3-D CNNs when using videos as the input data. We further extend our MDP framework to the MDP-Point approach for compressing point cloud neural networks (PCNNs) whose inputs are irregular point clouds (e.g., PointNet). In this case, the redundancy along the additional dimension indicates the point dimension (i.e., the number of points). Comprehensive experiments on six benchmark datasets demonstrate the effectiveness of our MDP framework and its extended version MDP-Point for compressing CNNs and PCNNs, respectively.

13.

Weakly Supervised Semantic Segmentation via Box-Driven Masking and Filling Rate Shifting.

Song, Chunfeng; Ouyang, Wanli; Zhang, Zhaoxiang.

IEEE Trans Pattern Anal Mach Intell ; 45(12): 15996-16012, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-37531304

RESUMO

Semantic segmentation has achieved huge progress via adopting deep Fully Convolutional Networks (FCN). However, the performance of FCN-based models severely rely on the amounts of pixel-level annotations which are expensive and time-consuming. Considering that bounding boxes also contain abundant semantic and objective information, an intuitive solution is to learn the segmentation with weak supervisions from the bounding boxes. How to make full use of the class-level and region-level supervisions from bounding boxes to estimate the uncertain regions is the critical challenge for the weakly supervised learning task. In this paper, we propose a mixture model to address this problem. First, we introduce a box-driven class-wise masking model (BCM) to remove irrelevant regions of each class. Moreover, based on the pixel-level segment proposal generated from the bounding box supervision, we calculate the mean filling rates of each class to serve as an important prior cue to guide the model ignoring the wrongly labeled pixels in proposals. To realize the more fine-grained supervision at instance-level, we further propose the anchor-based filling rate shifting module. Unlike previous methods that directly train models with the generated noisy proposals, our method can adjust the model learning dynamically with the adaptive segmentation loss. Thus it can help reduce the negative impacts from wrongly labeled proposals. Besides, based on the learned high-quality proposals with above pipeline, we explore to further boost the performance through two-stage learning. The proposed method is evaluated on the challenging PASCAL VOC 2012 benchmark and achieves 74.9 % and 76.4 % mean IoU accuracy under weakly and semi-supervised modes, respectively. Extensive experimental results show that the proposed method is effective and is on par with, or even better than current state-of-the-art methods.

14.

Identifying featured indels associated with SARS-CoV-2 fitness.

Li, Xiang; Yan, Hongliang; Wong, Gary; Ouyang, Wanli; Cui, Jie.

Microbiol Spectr ; : e0226923, 2023 Sep 12.

Artigo em Inglês | MEDLINE | ID: mdl-37698427

RESUMO

As an RNA virus, severe acute respiratory coronavirus 2 (SARS-CoV-2) is known for frequent substitution mutations, and substitutions in important genome regions are often associated with viral fitness. However, whether indel mutations are related to viral fitness is generally ignored. Here we developed a computational methodology to investigate indels linked to fitness occurring in over 9 million SARS-CoV-2 genomes. Remarkably, by analyzing 31,642,404 deletion records and 1,981,308 insertion records, our pipeline identified 26,765 deletion types and 21,054 insertion types and discovered 65 indel types with a significant association with Pango lineages. We proposed the concept of featured indels representing the population of specific Pango lineages and variants as substitution mutations and termed these 65 indels as featured indels. The selective pressure of all indel types is assessed using the Bayesian model to explore the importance of indels. Our results exhibited higher selective pressure of indels like substitution mutations, which are important for assessing viral fitness and consistent with previous studies in vitro. Evaluation of the growth rate of each viral lineage indicated that indels play key roles in SARS-CoV-2 evolution and deserve more attention as substitution mutations. IMPORTANCE The fitness of indels in pathogen genome evolution has rarely been studied. We developed a computational methodology to investigate the severe acute respiratory coronavirus 2 genomes and analyze over 33 million records of indels systematically, ultimately proposing the concept of featured indels that can represent specific Pango lineages and identifying 65 featured indels. Machine learning model based on Bayesian inference and viral lineage growth rate evaluation suggests that these featured indels exhibit selection pressure comparable to replacement mutations. In conclusion, indels are not negligible for evaluating viral fitness.

15.

OTP-NMS: Toward Optimal Threshold Prediction of NMS for Crowded Pedestrian Detection.

Tang, Yi; Liu, Min; Li, Baopu; Wang, Yaonan; Ouyang, Wanli.

IEEE Trans Image Process ; 32: 3176-3187, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37204946

RESUMO

Pedestrian detection is still a challenging task for computer vision, especially in crowded scenes where the overlaps between pedestrians tend to be large. The non-maximum suppression (NMS) plays an important role in removing the redundant false positive detection proposals while retaining the true positive detection proposals. However, the highly overlapped results may be suppressed if the threshold of NMS is lower. Meanwhile, a higher threshold of NMS will introduce a larger number of false positive results. To solve this problem, we propose an optimal threshold prediction (OTP) based NMS method that predicts a suitable threshold of NMS for each human instance. First, a visibility estimation module is designed to obtain the visibility ratio. Then, we propose a threshold prediction subnet to determine the optimal threshold of NMS automatically according to the visibility ratio and classification score. Finally, we re-formulate the objective function of the subnet and utilize the reward-guided gradient estimation algorithm to update the subnet. Comprehensive experiments on CrowdHuman and CityPersons show the superior performance of the proposed method in pedestrian detection, especially in crowded scenes.

16.

TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer.

Deng, Jiajun; Yang, Zhengyuan; Liu, Daqing; Chen, Tianlang; Zhou, Wengang; Zhang, Yanyong; Li, Houqiang; Ouyang, Wanli.

IEEE Trans Pattern Anal Mach Intell ; 45(11): 13636-13652, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37467085

RESUMO

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

17.

The Equalization Losses: Gradient-Driven Training for Long-tailed Object Recognition.

Tan, Jingru; Li, Bo; Lu, Xin; Yao, Yongqiang; Yu, Fengwei; He, Tong; Ouyang, Wanli.

IEEE Trans Pattern Anal Mach Intell ; 45(11): 13876-13892, 2023 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37486845

RESUMO

Long-tail distribution is widely spread in real-world applications. Due to the extremely small ratio of instances, tail categories often show inferior accuracy. In this paper, we find such performance bottleneck is mainly caused by the imbalanced gradients, which can be categorized into two parts: (1) positive part, deriving from the samples of the same category, and (2) negative part, contributed by other categories. Based on comprehensive experiments, it is also observed that the gradient ratio of accumulated positives to negatives is a good indicator to measure how balanced a category is trained. Inspired by this, we come up with a gradient-driven training mechanism to tackle the long-tail problem: re-balancing the positive/negative gradients dynamically according to current accumulative gradients, with a unified goal of achieving balance gradient ratios. Taking advantage of the simple and flexible gradient mechanism, we introduce a new family of gradient-driven loss functions, namely equalization losses. We conduct extensive experiments on a wide spectrum of visual tasks, including two-stage/single-stage long-tailed object detection (LVIS), long-tailed image classification (ImageNet-LT, Places-LT, iNaturalist), and long-tailed semantic segmentation (ADE20 K). Our method consistently outperforms the baseline models, demonstrating the effectiveness and generalization ability of the proposed equalization losses.

18.

Towards Trajectory Forecasting From Detection.

Zhang, Pu; Bai, Lei; Wang, Yuning; Fang, Jianwu; Xue, Jianru; Zheng, Nanning; Ouyang, Wanli.

IEEE Trans Pattern Anal Mach Intell ; 45(10): 12550-12561, 2023 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-37159310

RESUMO

Trajectory forecasting for traffic participants (e.g., vehicles) is critical for autonomous platforms to make safe plans. Currently, most trajectory forecasting methods assume that object trajectories have been extracted and directly develop trajectory predictors based on the ground truth trajectories. However, this assumption does not hold in practical situations. Trajectories obtained from object detection and tracking are inevitably noisy, which could cause serious forecasting errors to predictors built on ground truth trajectories. In this paper, we propose to predict trajectories directly based on detection results without relying on explicitly formed trajectories. Different from traditional methods which encode the motion cues of an agent based on its clearly defined trajectory, we extract the motion information only based on the affinity cues among detection results, in which an affinity-aware state update mechanism is designed to manage the state information. In addition, considering that there could be multiple plausible matching candidates, we aggregate the states of them. These designs take the uncertainty of association into account which relax the undesirable effect of noisy trajectory obtained from data association and improve the robustness of the predictor. Extensive experiments validate the effectiveness of our method and its generalization ability to different detectors or forecasting schemes.

19.

ZoomNAS: Searching for Whole-Body Human Pose Estimation in the Wild.

Xu, Lumin; Jin, Sheng; Liu, Wentao; Qian, Chen; Ouyang, Wanli; Luo, Ping; Wang, Xiaogang.

IEEE Trans Pattern Anal Mach Intell ; 45(4): 5296-5313, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-35939471

RESUMO

This paper investigates the task of 2D whole-body human pose estimation, which aims to localize dense landmarks on the entire human body including body, feet, face, and hands. We propose a single-network approach, termed ZoomNet, to take into account the hierarchical structure of the full human body and solve the scale variation of different body parts. We further propose a neural architecture search framework, termed ZoomNAS, to promote both the accuracy and efficiency of whole-body pose estimation. ZoomNAS jointly searches the model architecture and the connections between different sub-modules, and automatically allocates computational complexity for searched sub-modules. To train and evaluate ZoomNAS, we introduce the first large-scale 2D human whole-body dataset, namely COCO-WholeBody V1.0, which annotates 133 keypoints for in-the-wild images. Extensive experiments demonstrate the effectiveness of ZoomNAS and the significance of COCO-WholeBody V1.0.

Assuntos

Algoritmos , Corpo Humano , Humanos

20.

Learning from pseudo-labels: deep networks improve consistency in longitudinal brain volume estimation.

Zhan, Geng; Wang, Dongang; Cabezas, Mariano; Bai, Lei; Kyle, Kain; Ouyang, Wanli; Barnett, Michael; Wang, Chenyu.

Front Neurosci ; 17: 1196087, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37483345

RESUMO

Introduction: Brain atrophy is a critical biomarker of disease progression and treatment response in neurodegenerative diseases such as multiple sclerosis (MS). Confounding factors such as inconsistent imaging acquisitions hamper the accurate measurement of brain atrophy in the clinic. This study aims to develop and validate a robust deep learning model to overcome these challenges; and to evaluate its impact on the measurement of disease progression. Methods: Voxel-wise pseudo-atrophy labels were generated using SIENA, a widely adopted tool for the measurement of brain atrophy in MS. Deformation maps were produced for 195 pairs of longitudinal 3D T1 scans from patients with MS. A 3D U-Net, namely DeepBVC, was specifically developed overcome common variances in resolution, signal-to-noise ratio and contrast ratio between baseline and follow up scans. The performance of DeepBVC was compared against SIENA using McLaren test-retest dataset and 233 in-house MS subjects with MRI from multiple time points. Clinical evaluation included disability assessment with the Expanded Disability Status Scale (EDSS) and traditional imaging metrics such as lesion burden. Results: For 3 subjects in test-retest experiments, the median percent brain volume change (PBVC) for DeepBVC and SIENA was 0.105 vs. 0.198% (subject 1), 0.061 vs. 0.084% (subject 2), 0.104 vs. 0.408% (subject 3). For testing consistency across multiple time points in individual MS subjects, the mean (± standard deviation) PBVC difference of DeepBVC and SIENA were 0.028% (± 0.145%) and 0.031% (±0.154%), respectively. The linear correlation with baseline T2 lesion volume were r = -0.288 (p < 0.05) and r = -0.249 (p < 0.05) for DeepBVC and SIENA, respectively. There was no significant correlation of disability progression with PBVC as estimated by either method (p = 0.86, p = 0.84). Discussion: DeepBVC is a deep learning powered brain volume change estimation method for assessing brain atrophy used T1-weighted images. Compared to SIENA, DeepBVC demonstrates superior performance in reproducibility and in the context of common clinical scan variances such as imaging contrast, voxel resolution, random bias field, and signal-to-noise ratio. Enhanced measurement robustness, automation, and processing speed of DeepBVC indicate its potential for utilisation in both research and clinical environments for monitoring disease progression and, potentially, evaluating treatment effectiveness.

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA