Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 52
Filter
1.
Neural Netw ; 177: 106382, 2024 May 09.
Article in English | MEDLINE | ID: mdl-38761416

ABSTRACT

Occluded person re-identification (Re-ID) is a challenging task, as pedestrians are often obstructed by various occlusions, such as non-pedestrian objects or non-target pedestrians. Previous methods have heavily relied on auxiliary models to obtain information in unoccluded regions, such as human pose estimation. However, these auxiliary models fall short in accounting for pedestrian occlusions, thereby leading to potential misrepresentations. In addition, some previous works learned feature representations from single images, ignoring the potential relations among samples. To address these issues, this paper introduces a Multi-Level Relation-Aware Transformer (MLRAT) model for occluded person Re-ID. This model mainly encompasses two novel modules: Patch-Level Relation-Aware (PLRA) and Sample-Level Relation-Aware (SLRA). PLRA learns fine-grained local features by modeling the structural relations between key patches, bypassing the dependency on auxiliary models. It adopts a model-free method to select key patches that have high semantic correlation with the final pedestrian representation. In particular, to alleviate the interference of occlusion, PLRA captures the structural relations among key patches via a two-layer Graph Convolution Network (GCN), effectively guiding the local feature fusion and learning. SLRA is designed to facilitate the model to learn discriminative features by modeling the relations among samples. Specifically, to mitigate noisy relations of irrelevant samples, we present a Relation-Aware Transformer (RAT) block to capture the relations among neighbors. Furthermore, to bridge the gap between training and testing phases, a self-distillation method is employed to transfer the sample-level relations captured by SLRA to the backbone. Extensive experiments are conducted on four occluded datasets, two partial datasets and two holistic datasets. The results show that the proposed MLRAT model significantly outperforms existing baselines on four occluded datasets, while maintains top performance on two partial datasets and two holistic datasets.

2.
Int J Biol Macromol ; 268(Pt 1): 131729, 2024 May.
Article in English | MEDLINE | ID: mdl-38653429

ABSTRACT

In this case, various characterization technologies have been employed to probe dissociation mechanism of cellulose in N,N-dimethylacetamide/lithium chloride (DMAc/LiCl) system. These results indicate that coordination of DMAc ligands to the Li+-Cl- ion pair results in the formation of a series of Lix(DMAc)yClz (x = 1, 2; y = 1, 2, 3, 4; z = 1, 2) complexes. Analysis of interaction between DMAc ligand and Li center indicate that Li bond plays a major role for the formation of these Lix(DMAc)yClz complexes. And the saturation and directionality of Li bond in these Lix(DMAc)yClz complexes are found to be a tetrahedral structure. The hydrogen bonds between two cellulose chains could be broken at the nonreduced end of cellulose molecule via combined effects of basicity of Cl- ion and steric hindrance of [Li (DMAc)4]+ unit. The unique feature of Li bond in Lix(DMAc)yClz complexes is a key factor in determination of the dissociation mechanism.


Subject(s)
Acetamides , Cellulose , Lithium Chloride , Cellulose/chemistry , Acetamides/chemistry , Lithium Chloride/chemistry , Lithium/chemistry , Hydrogen Bonding
3.
Article in English | MEDLINE | ID: mdl-38683711

ABSTRACT

Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes. Codes and models will be made publicly available.

4.
IEEE Trans Image Process ; 33: 1600-1613, 2024.
Article in English | MEDLINE | ID: mdl-38373124

ABSTRACT

Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch. Our extensive experiments validate the efficacy of the proposed method, and our method achieves state-of-the-art performance on two public datasets. Code is available at https://github.com/qinghuannn/PAMFN.


Subject(s)
Image Interpretation, Computer-Assisted , Machine Learning
5.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4188-4205, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38227419

ABSTRACT

Existing studies on knowledge distillation typically focus on teacher-centered methods, in which the teacher network is trained according to its own standards before transferring the learned knowledge to a student one. However, due to differences in network structure between the teacher and the student, the knowledge learned by the former may not be desired by the latter. Inspired by human educational wisdom, this paper proposes a Student-Centered Distillation (SCD) method that enables the teacher network to adjust its knowledge transfer according to the student network's needs. We implemented SCD based on various human educational wisdom, e.g., the teacher network identified and learned the knowledge desired by the student network on the validation set, and then transferred it to the latter through the training set. To address the problems of current deficiency knowledge, hard sample learning and knowledge forgetting faced by a student network in the learning process, we introduce and improve Proportional-Integral-Derivative (PID) algorithms from automation fields to make them effective in identifying the current knowledge required by the student network. Furthermore, we propose a curriculum learning-based fuzzy strategy and apply it to the proposed PID control algorithm, such that the student network in SCD can actively pay attention to the learning of challenging samples after with certain knowledge. The overall performance of SCD is verified in multiple tasks by comparing it with state-of-the-art ones. Experimental results show that our student-centered distillation method outperforms existing teacher-centered ones.


Subject(s)
Algorithms , Students , Humans , Machine Learning , Fuzzy Logic , Knowledge
6.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2692-2708, 2024 May.
Article in English | MEDLINE | ID: mdl-37922161

ABSTRACT

Person re-identification (Re-ID) is a fundamental task in visual surveillance. Given a query image of the target person, conventional Re-ID focuses on the pairwise similarities between the candidate images and the query. However, conventional Re-ID does not evaluate the consistency of the retrieval results of whether the most similar images ranked in each place contain the same person, which is risky in some applications such as missing out a place where the patient passed will hinder the epidemiological investigation. In this work, we investigate a more challenging task: consistently and successfully retrieving the target person in all camera views. We define the task as continuous person Re-ID and propose a corresponding evaluation metric termed overall Rank-K accuracy. Different from the conventional Re-ID, any incorrect retrieval under an individual camera view that raises an inconsistency will fail the continuous Re-ID. Consequently, the defective cameras, in which the images are hard to be automatically associated with the images from other views, strongly degrade the performance of continuous person Re-ID. Since the camera deployment is crucial for continuous tracking across camera views, we rethink person Re-ID from the perspective of camera deployment and assess the quality of a camera network by performing continuous Re-ID. Moreover, we propose to automatically detect the defective cameras that greatly hamper the continuous Re-ID. Because brute-force search is costly when the camera network becomes complicated, we explicitly model the visual relations as well as the spatial relations among cameras and develop a relational deep Q-network to select the properly deployed cameras and the un-selected cameras are regarded as the defective cameras. Since most existing datasets do not provide topology information about the camera network, they are unsuitable for investigating the importance of spatial relations on camera selection. Thus, we collect a new dataset including 20 cameras with topology information. Compared with randomly removing cameras, the experimental results show that our method can effectively detect the defective cameras so that people could take further operations on these cameras in practice (https://www.isee-ai.cn/∼yixing/MCCPD.html).

7.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15512-15529, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37410652

ABSTRACT

Semi-supervised person re-identification (Re-ID) is an important approach for alleviating annotation costs when learning to match person images across camera views. Most existing works assume that training data contains abundant identities crossing camera views. However, this assumption is not true in many real-world applications, especially when images are captured in nonadjacent scenes for Re-ID in wider areas, where the identities rarely cross camera views. In this work, we operate semi-supervised Re-ID under a relaxed assumption of identities rarely crossing camera views, which is still largely ignored in existing methods. Since the identities rarely cross camera views, the underlying sample relations across camera views become much more uncertain, and deteriorate the noise accumulation problem in many advanced Re-ID methods that apply pseudo labeling for associating visually similar samples. To quantify such uncertainty, we parameterize the probabilistic relations between samples in a relation discovery objective for pseudo label training. Then, we introduce reward quantified by identification performance on a few labeled data to guide learning dynamic relations between samples for reducing uncertainty. Our strategy is called the Rewarded Relation Discovery (R 2D), of which the rewarded learning paradigm is under-explored in existing pseudo labeling methods. To further reduce the uncertainty in sample relations, we perform multiple relation discovery objectives learning to discover probabilistic relations based on different prior knowledge of intra-camera affinity and cross-camera style variation, and fuse the complementary knowledge of different probabilistic relations by similarity distillation. To better evaluate semi-supervised Re-ID on identities rarely crossing camera views, we collect a new real-world dataset called REID-CBD, and perform simulation on benchmark datasets. Experiment results show that our method outperforms a wide range of semi-supervised and unsupervised learning methods.

8.
IEEE Trans Image Process ; 32: 3806-3820, 2023.
Article in English | MEDLINE | ID: mdl-37418403

ABSTRACT

We are concerned with retrieving a query person from multiple videos captured by a non-overlapping camera network. Existing methods often rely on purely visual matching or consider temporal constraints but ignore the spatial information of the camera network. To address this issue, we propose a pedestrian retrieval framework based on cross-camera trajectory generation that integrates both temporal and spatial information. To obtain pedestrian trajectories, we propose a novel cross-camera spatio-temporal model that integrates pedestrians' walking habits and the path layout between cameras to form a joint probability distribution. Such a cross-camera spatio-temporal model can be specified using sparsely sampled pedestrian data. Based on the spatio-temporal model, cross-camera trajectories can be extracted by the conditional random field model and further optimised by restricted non-negative matrix factorization. Finally, a trajectory re-ranking technique is proposed to improve the pedestrian retrieval results. To verify the effectiveness of our method, we construct the first cross-camera pedestrian trajectory dataset, the Person Trajectory Dataset, in real surveillance scenarios. Extensive experiments verify the effectiveness and robustness of the proposed method.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 11120-11135, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37027255

ABSTRACT

Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g., a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly [Formula: see text] the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1 k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2 M parameters and 1.9 G FLOPs, which is comparable to several existing models with more than 20 M parameters and 4 G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.

10.
Article in English | MEDLINE | ID: mdl-37030682

ABSTRACT

In this work, we investigate online multi-view learning according to the multi-view complementarity and consistency principles to memorably process online multi-view data when fused across views. Online diverse features through different deep feature extractors under different views are used as input to an online learning method to privately and memorably optimize in each view for the discovery and memorization of the view-specific information. More specifically, according to the multi-view complementarity principle, a softmax-weighted reducible (SWR) loss is proposed to selectively retain credible views and neglect incredible ones for the online model's cross-view complementarity fusion. According to the multi-view consistency principle, we design a cross-view embedding consistency (CVEC) loss and a cross-view Kullback-Leibler (CVKL) divergence loss to maintain the cross-view consistency of the online model. Since the online multi-view learning setup needs to avoid repeatedly accessing online data to handle the knowledge forgetting in each view, we propose a knowledge registration unit (KRU) based on dictionary learning to incrementally register newly view-specific knowledge of online unlabeled data to the learnable and adjustable dictionary. Finally, by using the above strategies, we propose an online multi-view KRU approach and evaluate it with comprehensive experiments, thereby showing its superiority in online multi-view learning.

11.
Article in English | MEDLINE | ID: mdl-37022228

ABSTRACT

Imbalanced training data in medical image diagnosis is a significant challenge for diagnosing rare diseases. For this purpose, we propose a novel two-stage Progressive Class-Center Triplet (PCCT) framework to overcome the class imbalance issue. In the first stage, PCCT designs a class-balanced triplet loss to coarsely separate distributions of different classes. Triplets are sampled equally for each class at each training iteration, which alleviates the imbalanced data issue and lays solid foundation for the successive stage. In the second stage, PCCT further designs a class-center involved triplet strategy to enable a more compact distribution for each class. The positive and negative samples in each triplet are replaced by their corresponding class centers, which prompts compact class representations and benefits training stability. The idea of class-center involved loss can be extended to the pair-wise ranking loss and the quadruplet loss, which demonstrates the generalization of the proposed framework. Extensive experiments support that the PCCT framework works effectively for medical image classification with imbalanced training images. On four challenging class-imbalanced datasets (two skin datasets Skin7 and Skin 198, one chest X-ray dataset ChestXray-COVID, and one eye dataset Kaggle EyePACs), the proposed approach respectively obtains the mean F1 score 86.20, 65.20, 91.32, and 87.18 over all classes and 81.40, 63.87, 82.62, and 79.09 for rare classes, achieving state-of-the-art performance and outperforming the widely used methods for the class imbalance issue.

12.
Biomed Pharmacother ; 159: 114099, 2023 Mar.
Article in English | MEDLINE | ID: mdl-36641923

ABSTRACT

Intervertebral disc degeneration (IVDD), a common cartilage-degenerative disease, is considered the main cause of low back pain (LBP). Owing to the complex aetiology and pathophysiology of IVDD, its molecular mechanisms and definitive treatment of IVDD remain unclear. As an evolutionarily and functionally conserved signalling pathway, Hippo-YAP/TAZ signalling plays a crucial role in IVDD progression. In this review, we discuss the regulation of Hippo-YAP/TAZ signalling and summarise the recent research progress on its role in cartilage homeostasis and IVDD. We also discuss the current application and future prospects of IVDD treatments based on Hippo-YAP/TAZ signalling.


Subject(s)
Intervertebral Disc Degeneration , Intervertebral Disc , Humans , Hippo Signaling Pathway , Signal Transduction , Transcriptional Coactivator with PDZ-Binding Motif Proteins
13.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7001-7018, 2023 Jun.
Article in English | MEDLINE | ID: mdl-33079658

ABSTRACT

Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance. However, most existing methods focus on (single) person re-identification (re-id), ignoring the fact that people often walk in groups in real scenarios. In this work, we take a step further and consider employing context information for identifying groups of people, i.e., group re-id. On the one hand, group re-id is more challenging than single person re-id, since it requires both a robust modeling of local individual person appearance (with different illumination conditions, pose/viewpoint variations, and occlusions), as well as full awareness of global group structures (with group layout and group member variations). On the other hand, we believe that person re-id can be greatly enhanced by incorporating additional visual context from neighboring group members, a task which we formulate as group-aware (single) person re-id. In this paper, we propose a novel unified framework based on graph neural networks to simultaneously address the above two group-based re-id tasks, i.e., group re-id and group-aware person re-id. Specifically, we construct a context graph with group members as its nodes to exploit dependencies among different people. A multi-level attention mechanism is developed to formulate both intra-group and inter-group context, with an additional self-attention module for robust graph-level representations by attentively aggregating node-level features. The proposed model can be directly generalized to tackle group-aware person re-id using node-level representations. Meanwhile, to facilitate the deployment of deep learning models on these tasks, we build a new group re-id dataset which contains more than 3.8K images with 1.5K annotated groups, an order of magnitude larger than existing group re-id datasets. Extensive experiments on the novel dataset as well as three existing datasets clearly demonstrate the effectiveness of the proposed framework for both group-based re-id tasks.

14.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 489-507, 2023 01.
Article in English | MEDLINE | ID: mdl-35130146

ABSTRACT

Egocentric videos, which record the daily activities of individuals from a first-person point of view, have attracted increasing attention during recent years because of their growing use in many popular applications, including life logging, health monitoring and virtual reality. As a fundamental problem in egocentric vision, one of the tasks of egocentric action recognition aims to recognize the actions of the camera wearers from egocentric videos. In egocentric action recognition, relation modeling is important, because the interactions between the camera wearer and the recorded persons or objects form complex relations in egocentric videos. However, only a few of existing methods model the relations between the camera wearer and the interacting persons for egocentric action recognition, and moreover they require prior knowledge or auxiliary data to localize the interacting persons. In this work, we consider modeling the relations in a weakly supervised manner, i.e., without using annotations or prior knowledge about the interacting persons or objects, for egocentric action recognition. We form a weakly supervised framework by unifying automatic interactor localization and explicit relation modeling for the purpose of automatic relation modeling. First, we learn to automatically localize the interactors, i.e., the body parts of the camera wearer and the persons or objects that the camera wearer interacts with, by learning a series of keypoints directly from video data to localize the action-relevant regions with only action labels and some constraints on these keypoints. Second, more importantly, to explicitly model the relations between the interactors, we develop an ego-relational LSTM (long short-term memory) network with several candidate connections to model the complex relations in egocentric videos, such as the temporal, interactive, and contextual relations. In particular, to reduce human efforts and manual interventions needed to construct an optimal ego-relational LSTM structure, we search for the optimal connections by employing a differentiable network architecture search mechanism, which automatically constructs the ego-relational LSTM network to explicitly model different relations for egocentric action recognition. We conduct extensive experiments on egocentric video datasets to illustrate the effectiveness of our method.


Subject(s)
Algorithms , Virtual Reality , Humans , Learning
15.
Nat Med ; 28(9): 1883-1892, 2022 09.
Article in English | MEDLINE | ID: mdl-36109638

ABSTRACT

The storage of facial images in medical records poses privacy risks due to the sensitive nature of the personal biometric information that can be extracted from such images. To minimize these risks, we developed a new technology, called the digital mask (DM), which is based on three-dimensional reconstruction and deep-learning algorithms to irreversibly erase identifiable features, while retaining disease-relevant features needed for diagnosis. In a prospective clinical study to evaluate the technology for diagnosis of ocular conditions, we found very high diagnostic consistency between the use of original and reconstructed facial videos (κ ≥ 0.845 for strabismus, ptosis and nystagmus, and κ = 0.801 for thyroid-associated orbitopathy) and comparable diagnostic accuracy (P ≥ 0.131 for all ocular conditions tested) was observed. Identity removal validation using multiple-choice questions showed that compared to image cropping, the DM could much more effectively remove identity attributes from facial images. We further confirmed the ability of the DM to evade recognition systems using artificial intelligence-powered re-identification algorithms. Moreover, use of the DM increased the willingness of patients with ocular conditions to provide their facial images as health information during medical treatment. These results indicate the potential of the DM algorithm to protect the privacy of patients' facial images in an era of rapid adoption of digital health technologies.


Subject(s)
Artificial Intelligence , Privacy , Algorithms , Confidentiality , Face , Humans , Prospective Studies
16.
IEEE Trans Image Process ; 31: 6188-6199, 2022.
Article in English | MEDLINE | ID: mdl-36126030

ABSTRACT

Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-point, and similar clothing still bring great challenges. In order to tackle these challenges, extracting discriminative feature representation is crucial. Most of the existing methods focus on extracting ReID features from individual images separately. However, when matching two images, we propose that the ReID features of a query image should be dynamically adjusted based on the contextual information from the gallery image it matches. We call this type of ReID features conditional feature embedding. In this paper, we propose a novel ReID framework that extracts conditional feature embedding based on the aligned visual clues between image pairs, called Clue Alignment based Conditional Embedding (CACE-Net). CACE-Net applies an attention module to build a detailed correspondence graph between crucial visual clues in image pairs and uses discrepancy-based GCN to embed the obtained complex correspondence information into the conditional features. The experiments show that CACE-Net achieves state-of-the-art performance on three public datasets.


Subject(s)
Biometric Identification , Algorithms , Biometric Identification/methods , Humans
17.
Nat Commun ; 13(1): 2790, 2022 05 19.
Article in English | MEDLINE | ID: mdl-35589792

ABSTRACT

Epstein-Barr virus-associated gastric cancer (EBVaGC) shows a robust response to immune checkpoint inhibitors. Therefore, a cost-efficient and accessible tool is needed for discriminating EBV status in patients with gastric cancer. Here we introduce a deep convolutional neural network called EBVNet and its fusion with pathologists for predicting EBVaGC from histopathology. The EBVNet yields an averaged area under the receiver operating curve (AUROC) of 0.969 from the internal cross validation, an AUROC of 0.941 on an external dataset from multiple institutes and an AUROC of 0.895 on The Cancer Genome Atlas dataset. The human-machine fusion significantly improves the diagnostic performance of both the EBVNet and the pathologist. This finding suggests that our EBVNet could provide an innovative approach for the identification of EBVaGC and may help effectively select patients with gastric cancer for immunotherapy.


Subject(s)
Deep Learning , Epstein-Barr Virus Infections , Stomach Neoplasms , Herpesvirus 4, Human/genetics , Humans , Immune Checkpoint Inhibitors , Stomach Neoplasms/genetics , Stomach Neoplasms/pathology
18.
IEEE Trans Image Process ; 31: 3081-3094, 2022.
Article in English | MEDLINE | ID: mdl-35389866

ABSTRACT

Humans have the inherent advantage of understanding action intention, while it is an enormous challenge to train the machine to localize unintentional action in videos due to the lack of reliable annotations for stable training. The annotations of unintentional action are unreliable since different annotators are affected by their subjective appraisals and intrinsic ambiguity, which brings heavy difficulties for the training. To address this issue, we propose a probabilistic framework for unintentional action localization by modeling the uncertainty of annotations. Our framework consists of two main components, including Temporal Label Aggregation (TLA) and Dense Probabilistic Localization (DPL). We first formulate each annotated failure moment as a temporal label distribution. Then we propose a TLA component to aggregate temporal label distributions of different failure moments in an online manner and generate dense probabilistic supervision. Based on TLA, We further develop a DPL component to jointly train three heads (i.e., probabilistic dense classification, probabilistic temporal detection, and probabilistic regression) with different supervision granularities and make them highly collaborative. We evaluate our approach on the largest unintentional action dataset OOPS and demonstrate that our approach can achieve significant improvement over the baseline and state-of-the-art methods.


Subject(s)
Models, Statistical , Humans
19.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6074-6093, 2022 10.
Article in English | MEDLINE | ID: mdl-34048336

ABSTRACT

In conventional person re-identification (re-id), the images used for model training in the training probe set and training gallery set are all assumed to be instance-level samples that are manually labeled from raw surveillance video (likely with the assistance of detection) in a frame-by-frame manner. This labeling across multiple non-overlapping camera views from raw video surveillance is expensive and time consuming. To overcome these issues, we consider a weakly supervised person re-id modeling that aims to find the raw video clips where a given target person appears. In our weakly supervised setting, during training, given a sample of a person captured in one camera view, our weakly supervised approach aims to train a re-id model without further instance-level labeling for this person in another camera view. The weak setting refers to matching a target person with an untrimmed gallery video where we only know that the identity appears in the video without the requirement of annotating the identity in any frame of the video during the training procedure. The weakly supervised person re-id is challenging since it not only suffers from the difficulties occurring in conventional person re-id (e.g., visual ambiguity and appearance variations caused by occlusions, pose variations, background clutter, etc.), but more importantly, is also challenged by weakly supervised information because the instance-level labels and the ground-truth locations for person instances (i.e., the ground-truth bounding boxes of person instances) are absent. To solve the weakly supervised person re-id problem, we develop deep graph metric learning (DGML). On the one hand, DGML measures the consistency between intra-video spatial graphs of consecutive frames, where the spatial graph captures neighborhood relationship about the detected person instances in each frame. On the other hand, DGML distinguishes the inter-video spatial graphs captured from different camera views at different sites simultaneously. To further explicitly embed weak supervision into the DGML and solve the weakly supervised person re-id problem, we introduce weakly supervised regularization (WSR), which utilizes multiple weak video-level labels to learn discriminative features by means of a weak identity loss and a cross-video alignment loss. We conduct extensive experiments to demonstrate the feasibility of the weakly supervised person re-id approach and its special cases (e.g., its bag-to-bag extension) and show that the proposed DGML is effective.


Subject(s)
Biometric Identification , Algorithms , Biometric Identification/methods , Humans
20.
IEEE Trans Pattern Anal Mach Intell ; 44(7): 3386-3403, 2022 07.
Article in English | MEDLINE | ID: mdl-33571087

ABSTRACT

Despite the remarkable progress achieved in conventional instance segmentation, the problem of predicting instance segmentation results for unobserved future frames remains challenging due to the unobservability of future data. Existing methods mainly address this challenge by forecasting features of future frames. However, these methods always treat features of multiple levels (e.g., coarse-to-fine pyramid features) independently and do not exploit them collaboratively, which results in inaccurate prediction for future frames; and moreover, such a weakness can partially hinder self-adaption of a future segmentation prediction model for different input samples. To solve this problem, we propose an adaptive aggregation approach called Auto-Path Aggregation Network (APANet), where the spatio-temporal contextual information obtained in the features of each individual level is selectively aggregated using the developed "auto-path". The "auto-path" connects each pair of features extracted at different pyramid levels for task-specific hierarchical contextual information aggregation, which enables selective and adaptive aggregation of pyramid features in accordance with different videos/frames. Our APANet can be further optimized jointly with the Mask R-CNN head as a feature decoder and a Feature Pyramid Network (FPN) feature encoder, forming a joint learning system for future instance segmentation prediction. We experimentally show that the proposed method can achieve state-of-the-art performance on three video-based instance segmentation benchmarks for future instance segmentation prediction.


Subject(s)
Image Processing, Computer-Assisted , Neural Networks, Computer , Algorithms , Image Processing, Computer-Assisted/methods , Learning
SELECTION OF CITATIONS
SEARCH DETAIL
...