Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 96
Filtrar
1.
Heliyon ; 10(7): e28552, 2024 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-38560176

RESUMO

Introduction: Simultaneous involvement of the peripheral nervous system (PNS) and central nervous system (CNS) during the same period in diffuse large B-cell lymphoma (DLBCL) is rarely documented. In this particular case, the diagnosis of diffuse large B-cell lymphoma was pathologically confirmed, with invasion into the basal ganglia, diencephalon, and several peripheral nerves. The initial clinical manifestations were dyspnoea and hyperventilation. Case presentation: The patient presented to the hospital with fatigue, dyspnoea, and limb pain for over 7 months, accompanied by progressive breathlessness and unconsciousness in the last 6 days. Initial treatment with glucocorticoids for Guillain-Barre syndrome (GBS) proved ineffective in controlling the severe shortness of breath and hyperventilation, necessitating the use of ventilator-assisted ventilation. 18-Fluorodeoxyglucose positron emission tomography/computed tomography (18FDG PET/CT) showed that the basal ganglia, brainstem, and multiple peripheral nerves were thickened and metabolically active. There were atypical cells in the cerebrospinal fluid; the pathology indicated invasive B-cell lymphoma, demonstrating a propensity toward diffuse large B-cell lymphoma (DLBCL). After receiving chemotherapy, the patient regained consciousness and was successfully weaned off ventilator assistance but died of severe pneumonia. Discussion: The early clinical manifestations of DLBCL lack specificity, and multifocal DLBCL complicates the diagnostic process. When a single primary disease cannot explain multiple symptoms, the possibility of DLBCL should be considered, and nervous system invasion should be considered when nervous system symptoms are present. Once nervous system involvement occurs in DLBCL, whether the central or peripheral nervous system, it indicates a poor prognosis.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38416607

RESUMO

How to effectively explore the colors of exemplars and propagate them to colorize each frame is vital for exemplar-based video colorization. In this paper, we present a BiSTNet to explore colors of exemplars and utilize them to help video colorization by a bidirectional temporal feature fusion with the guidance of semantic image prior. We first establish the semantic correspondence between each frame and the exemplars in deep feature space to explore color information from exemplars. Then, we develop a simple yet effective bidirectional temporal feature fusion module to propagate the colors of exemplars into each frame and avoid inaccurate alignment. We note that there usually exist color-bleeding artifacts around the boundaries of important objects in videos. To overcome this problem, we develop a mixed expert block to extract semantic information for modeling the object boundaries of frames so that the semantic image prior can better guide the colorization process. In addition, we develop a multi-scale refinement block to progressively colorize frames in a coarse-to-fine manner. Extensive experimental results demonstrate that the proposed BiSTNet performs favorably against state-of-the-art methods on the benchmark datasets and real-world scenes. Moreover, the BiSTNet obtains one champion in NTIRE 2023 video colorization challenge [1].

3.
IEEE Trans Image Process ; 33: 1136-1148, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38300774

RESUMO

The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization. However, without proper constraints, the expanded activation will easily intrude into the background region. In this paper, we propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion. Specifically, we propose a CAM-driven reconstruction module to directly reconstruct the input image from deep CAM features, which constrains the diffusion of last-layer object attention by preserving the coarse spatial structure of the image content. Moreover, we propose an activation self-modulation module to refine CAMs with finer spatial structure details by enhancing regional consistency. Without external saliency models to provide background clues, our approach achieves 72.7% and 47.0% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively, demonstrating the superiority of our proposed approach. The source codes and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/SSC.

4.
IEEE Trans Pattern Anal Mach Intell ; 46(4): 2461-2474, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38015702

RESUMO

Stereo matching is a fundamental building block for many vision and robotics applications. An informative and concise cost volume representation is vital for stereo matching of high accuracy and efficiency. In this article, we present a novel cost volume construction method, named attention concatenation volume (ACV), which generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume. The ACV can be seamlessly embedded into most stereo matching networks, the resulting networks can use a more lightweight aggregation network and meanwhile achieve higher accuracy. We further design a fast version of ACV to enable real-time performance, named Fast-ACV, which generates high likelihood disparity hypotheses and the corresponding attention weights from low-resolution correlation clues to significantly reduce computational and memory cost and meanwhile maintain a satisfactory accuracy. Furthermore, we design a highly accurate network ACVNet and a real-time network Fast-ACVNet based on our ACV and Fast-ACV respectively, which achieve state-of-the-art performance on several benchmarks.

5.
IEEE Trans Image Process ; 33: 297-309, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38100340

RESUMO

Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.g. bounding box) to enhance the dynamic cues of video features. However, these approaches do not essentially eliminate the inherent inductive bias in the video, which can be regarded as the stumbling block for model generalization. Because the video features are usually extracted from the visually cluttered areas in which many objects cannot be removed or masked explicitly. To this end, this work attempts to implicitly accomplish semantic-level decoupling of "object-action" in the high-level feature space. Specifically, we propose a novel Semantic-Decoupling Transformer framework, dubbed as DeFormer, which contains two insightful sub-modules: Objects-Motion Decoupler (OMD) and Semantic-Decoupling Constrainer (SDC). In OMD, we initialize several learnable tokens incorporating annotation priors to learn an instance-level representation and then decouple it into the appearance feature and motion feature in high-level visual space. In SDC, we use textual information in the high-level language space to construct a dual-contrastive association to constrain the decoupled appearance feature and motion feature obtained in OMD. Extensive experiments verify the generalization ability of DeFormer. Specifically, compared to the baseline method, DeFormer achieves absolute improvements of 3%, 3.3%, and 5.4% under three different settings on STH-ELSE, while corresponding improvements on EPIC-KITCHENS-55 are 4.7%, 9.2%, and 4.4%. Besides, DeFormer gains state-of-the-art results either on ground-truth or detected annotations.

6.
Artigo em Inglês | MEDLINE | ID: mdl-38051621

RESUMO

Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios. The code is available at https://github.com/WayneTomas/TransCP.

7.
IEEE Trans Image Process ; 32: 6032-6046, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37910422

RESUMO

Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance, leading to intra-modal information distortion and ambiguity problems. Accordingly, in this paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we conduct fine-grained information excavation to mine modality-shared discriminative details for global alignment. Specifically, we propose a multi-level global feature learning (MGF) module that fully mines the discriminative local information within each modality, thereby emphasizing identity-related discriminative clues through enhanced interaction between global image (text) and informative local patches (words). MGF generates a set of enhanced global features for later inference. Furthermore, we design cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules to establish cross-modal correspondence at both coarse and fine-grained levels (image-word, sentence-patch, word-patch), ensuring the reliability of informative local patches/words. CFR and FCD are removed during inference to optimize computational efficiency. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method in TIReID.

8.
Materials (Basel) ; 16(22)2023 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-38005173

RESUMO

Alite dissolution plays a crucial role in cement hydration. However, quantitative investigations into alite powder dissolution are limited, especially regarding the influence of chemical admixtures. This study investigates the impact of particle size, temperature, saturation level, and mixing speed on alite powder dissolution rate, considering the real-time evolution of specific surface area during the alite powder dissolution process. Furthermore, the study delves into the influence of two organic toughening agents, chitosan oligosaccharide (COS) and anionic/non-ionic polyester-based polyurethane (PU), on the kinetics of alite powder dissolution. The results demonstrate a specific-surface-area change formula during alite powder dissolution: SS0=0.348e1-m/m0/0.085+0.651. Notably, the temperature and saturation level significantly affect dissolution rates, whereas the effect of particle size is more complicated. COS shows dosage-dependent effects on alite dissolution, acting through both its acidic nature and surface coverage. On the other hand, PU inhibits alite dissolution by blocking the active sites of alite through electrostatic adsorption, which is particularly evident at high temperatures.

9.
Artigo em Inglês | MEDLINE | ID: mdl-37995167

RESUMO

This article proposes a new hashing framework named relational consistency induced self-supervised hashing (RCSH) for large-scale image retrieval. To capture the potential semantic structure of data, RCSH explores the relational consistency between data samples in different spaces, which learns reliable data relationships in the latent feature space and then preserves the learned relationships in the Hamming space. The data relationships are uncovered by learning a set of prototypes that group similar data samples in the latent feature space. By uncovering the semantic structure of the data, meaningful data-to-prototype and data-to-data relationships are jointly constructed. The data-to-prototype relationships are captured by constraining the prototype assignments generated from different augmented views of an image to be the same. Meanwhile, these data-to-prototype relationships are preserved to learn informative compact hash codes by matching them with these reliable prototypes. To accomplish this, a novel dual prototype contrastive loss is proposed to maximize the agreement of prototype assignments in the latent feature space and Hamming space. The data-to-data relationships are captured by enforcing the distribution of pairwise similarities in the latent feature space and Hamming space to be consistent, which makes the learned hash codes preserve meaningful similarity relationships. Extensive experimental results on four widely used image retrieval datasets demonstrate that the proposed method significantly outperforms the state-of-the-art methods. Besides, the proposed method achieves promising performance in out-of-domain retrieval tasks, which shows its good generalization ability. The source code and models are available at https://github.com/IMAG-LuJin/RCSH.

10.
Phys Chem Chem Phys ; 25(35): 24097-24109, 2023 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-37655461

RESUMO

Polymers are known to effectively improve the toughness of inorganic matrices; however, the mechanism at the molecular level is still unclear. In this study, we used molecular dynamics simulations to unravel the effects and mechanisms of different molecular chain lengths of polyacrylic acid (PAA) on toughening calcium silicate hydrate (CSH), which is the basic building block of cement-based materials. Our simulation results indicate that an optimal molecular chain length of polymers contributes to the largest toughening effect on the matrix, leading to up to 60.98% increase in fracture energy. During the uniaxial tensile tests along the x-axis and z-axis direction, the configuration evolution of the PAA molecule determines the toughening effect. As the polymer unfolds and its size matches the defects of CSH, the stress distribution of the system becomes more homogeneous, which favors an increase in toughness. Furthermore, based on our simulation results and a mathematical model, we propose a theory of "strain rate/optimal chain length". This theory suggests that the optimal toughening effect can be achieved when the molecular chain length of the organic component is 1.3-1.5 times the largest defect size of the inorganic matrix. This work provides molecular-scale insights into the toughening mechanisms of an organic/inorganic system and may have practical implications for improving the toughness of cement-based materials.

11.
Artigo em Inglês | MEDLINE | ID: mdl-37713222

RESUMO

Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress, and state-of-the-art (SOTA) methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, the existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint multilevel alignment network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression (ISS) module, which suppresses image background and environmental factors by relation-guided localization (RGL) and channel attention filtration (CAF), respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Second, we propose an implicit local alignment (ILA) module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. Also, a global alignment (GA) is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.

12.
IEEE Trans Image Process ; 32: 4341-4354, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37490376

RESUMO

The visual feature pyramid has shown its superiority in both effectiveness and efficiency in a variety of applications. However, current methods overly focus on inter-layer feature interactions while disregarding the importance of intra-layer feature regulation. Despite some attempts to learn a compact intra-layer feature representation with the use of attention mechanisms or vision transformers, they overlook the crucial corner regions that are essential for dense prediction tasks. To address this problem, we propose a Centralized Feature Pyramid (CFP) network for object detection, which is based on a globally explicit centralized feature regulation. Specifically, we first propose a spatial explicit visual center scheme, where a lightweight MLP is used to capture the globally long-range dependencies, and a parallel learnable visual center mechanism is used to capture the local corner regions of the input images. Based on this, we then propose a globally centralized regulation for the commonly-used feature pyramid in a top-down fashion, where the explicit visual center information obtained from the deepest intra-layer feature is used to regulate frontal shallow features. Compared to the existing feature pyramids, CFP not only has the ability to capture the global long-range dependencies but also efficiently obtain an all-round yet discriminative feature representation. Experimental results on the challenging MS-COCO validate that our proposed CFP can achieve consistent performance gains on the state-of-the-art YOLOv5 and YOLOX object detection baselines.

13.
IEEE Trans Image Process ; 32: 2960-2971, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37195845

RESUMO

Weakly supervised semantic segmentation (WSSS) models relying on class activation maps (CAMs) have achieved desirable performance comparing to the non-CAMs-based counterparts. However, to guarantee WSSS task feasible, we need to generate pseudo labels by expanding the seeds from CAMs which is complex and time-consuming, thus hindering the design of efficient end-to-end (single-stage) WSSS approaches. To tackle the above dilemma, we resort to the off-the-shelf and readily accessible saliency maps for directly obtaining pseudo labels given the image-level class labels. Nevertheless, the salient regions may contain noisy labels and cannot seamlessly fit the target objects, and saliency maps can only be approximated as pseudo labels for simple images containing single-class objects. As such, the achieved segmentation model with these simple images cannot generalize well to the complex images containing multi-class objects. To this end, we propose an end-to-end multi-granularity denoising and bidirectional alignment (MDBA) model, to alleviate the noisy label and multi-class generalization issues. Specifically, we propose the online noise filtering and progressive noise detection modules to tackle image-level and pixel-level noise, respectively. Moreover, a bidirectional alignment mechanism is proposed to reduce the data distribution gap at both input and output space with simple-to-complex image synthesis and complex-to-simple adversarial learning. MDBA can reach the mIoU of 69.5% and 70.2% on validation and test sets for the PASCAL VOC 2012 dataset. The source codes and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/MDBA.

14.
Artigo em Inglês | MEDLINE | ID: mdl-37220053

RESUMO

Thanks to the advantages of the friendly annotations and the satisfactory performance, weakly-supervised semantic segmentation (WSSS) approaches have been extensively studied. Recently, the single-stage WSSS (SS-WSSS) was awakened to alleviate problems of the expensive computational costs and the complicated training procedures in multistage WSSS. However, the results of such an immature model suffer from problems of background incompleteness and object incompleteness. We empirically find that they are caused by the insufficiency of the global object context and the lack of local regional contents, respectively. Under these observations, we propose an SS-WSSS model with only the image-level class label supervisions, termed weakly supervised feature coupling network (WS-FCN), which can capture the multiscale context formed from the adjacent feature grids, and encode the fine-grained spatial information from the low-level features into the high-level ones. Specifically, a flexible context aggregation (FCA) module is proposed to capture the global object context in different granular spaces. Besides, a semantically consistent feature fusion (SF2) module is proposed in a bottom-up parameter-learnable fashion to aggregate the fine-grained local contents. Based on these two modules, WS-FCN lies in a self-supervised end-to-end training fashion. Extensive experimental results on the challenging PASCAL VOC 2012 and MS COCO 2014 demonstrate the effectiveness and efficiency of WS-FCN, which can achieve state-of-the-art results by 65.02% and 64.22% mIoU on PASCAL VOC 2012 val set and test set, 34.12% mIoU on MS COCO 2014 val set, respectively. The code and weight have been released at:WS-FCN.

15.
Artigo em Inglês | MEDLINE | ID: mdl-37022403

RESUMO

Deep learning-based models have been shown to outperform human beings in many computer vision tasks with massive available labeled training data in learning. However, humans have an amazing ability to easily recognize images of novel categories by browsing only a few examples of these categories. In this case, few-shot learning comes into being to make machines learn from extremely limited labeled examples. One possible reason why human beings can well learn novel concepts quickly and efficiently is that they have sufficient visual and semantic prior knowledge. Toward this end, this work proposes a novel knowledge-guided semantic transfer network (KSTNet) for few-shot image recognition from a supplementary perspective by introducing auxiliary prior knowledge. The proposed network jointly incorporates vision inferring, knowledge transferring, and classifier learning into one unified framework for optimal compatibility. A category-guided visual learning module is developed in which a visual classifier is learned based on the feature extractor along with the cosine similarity and contrastive loss optimization. To fully explore prior knowledge of category correlations, a knowledge transfer network is then developed to propagate knowledge information among all categories to learn the semantic-visual mapping, thus inferring a knowledge-based classifier for novel categories from base categories. Finally, we design an adaptive fusion scheme to infer the desired classifiers by effectively integrating the above knowledge and visual information. Extensive experiments are conducted on two widely used Mini-ImageNet and Tiered-ImageNet benchmarks to validate the effectiveness of KSTNet. Compared with the state of the art, the results show that the proposed method achieves favorable performance with minimal bells and whistles, especially in the case of one-shot learning.

16.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10317-10330, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37030795

RESUMO

In order to enable the model to generalize to unseen "action-objects" (compositional action), previous methods encode multiple pieces of information (i.e., the appearance, position, and identity of visual instances) independently and concatenate them for classification. However, these methods ignore the potential supervisory role of instance information (i.e., position and identity) in the process of visual perception. To this end, we present a novel framework, namely Progressive Instance-aware Feature Learning (PIFL), to progressively extract, reason, and predict dynamic cues of moving instances from videos for compositional action recognition. Specifically, this framework extracts features from foreground instances that are likely to be relevant to human actions (Position-aware Appearance Feature Extraction in Section III-B1), performs identity-aware reasoning among instance-centric features with semantic-specific interactions (Identity-aware Feature Interaction in Section III-B2), and finally predicts instances' position from observed states to force the model into perceiving their movement (Semantic-aware Position Prediction in Section III-B3). We evaluate our approach on two compositional action recognition benchmarks, namely, Something-Else and IKEA-Assembly. Our approach achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms in terms of both ground truth and detected position of instances.


Assuntos
Algoritmos , Percepção Visual , Humanos , Aprendizagem
17.
Int J Biol Macromol ; 242(Pt 2): 124661, 2023 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-37119898

RESUMO

Nanofibrous composite membranes consisting of polyvinyl alcohol (PVA), sodium alginate (SA), chitosan-nano zinc oxide nanoparticles (CS-Nano-ZnO) and curcumin (Cur) were prepared by ultrasonic processing and electrospinning. When the ultrasonic power was set to 100 W, the prepared CS-Nano-ZnO had a minimum size (404.67 ± 42.35 nm) and a generally uniform particle size distribution (PDI = 0.32 ± 0.10). The composite fiber membrane with Cur: CS-Nano-ZnO mass ratio of 5:5 exhibited the best water vapor permeability, strain and stress. Furthermore, the inhibitory rates against Escherichia coli and Staphylococcus aureus were 91.93 ± 2.07 % and 93.00 ± 0.83 %, respectively. The Kyoho grape fresh-keeping trial revealed that grape berries wrapped with composite fiber membrane still maintained good quality and a higher rate of good fruit (60.25 ± 1.46 %) after 12 days of storage. The shelf life of grape was extended by at least 4 days. Thus, nanofibrous composite membranes based on CS-Nano-ZnO and Cur was expected to be used as an active material for food packaging.


Assuntos
Quitosana , Curcumina , Nanofibras , Vitis , Óxido de Zinco , Óxido de Zinco/farmacologia , Antibacterianos/farmacologia , Quitosana/farmacologia , Curcumina/farmacologia
18.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9411-9425, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37022839

RESUMO

We present compact and effective deep convolutional neural networks (CNNs) by exploring properties of videos for video deblurring. Motivated by the non-uniform blur property that not all the pixels of the frames are blurry, we develop a CNN to integrate a temporal sharpness prior (TSP) for removing blur in videos. The TSP exploits sharp pixels from adjacent frames to facilitate the CNN for better frame restoration. Observing that the motion field is related to latent frames instead of blurry ones in the image formation model, we develop an effective cascaded training approach to solve the proposed CNN in an end-to-end manner. As videos usually contain similar contents within and across frames, we propose a non-local similarity mining approach based on a self-attention method with the propagation of global features to constrain CNNs for frame restoration. We show that exploring the domain knowledge of videos can make CNNs more compact and efficient, where the CNN with the non-local spatial-temporal similarity is 3× smaller than the state-of-the-art methods in terms of model parameters while its performance gains are at least 1 dB higher in terms of PSNRs. Extensive experimental results show that our method performs favorably against state-of-the-art approaches on benchmarks and real-world videos.


Assuntos
Algoritmos , Redes Neurais de Computação
19.
IEEE Trans Neural Netw Learn Syst ; 34(4): 1838-1851, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-32502968

RESUMO

Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both single-modal- and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text- and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.

20.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 6955-6968, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-33108281

RESUMO

Group activity recognition (GAR) is a challenging task aimed at recognizing the behavior of a group of people. It is a complex inference process in which visual cues collected from individuals are integrated into the final prediction, being aware of the interaction between them. This paper goes one step further beyond the existing approaches by designing a Hierarchical Graph-based Cross Inference Network (HiGCIN), in which three levels of information, i.e., the body-region level, person level, and group-activity level, are constructed, learned, and inferred in an end-to-end manner. Primarily, we present a generic Cross Inference Block (CIB), which is able to concurrently capture the latent spatiotemporal dependencies among body regions and persons. Based on the CIB, two modules are designed to extract and refine features for group activities at each level. Experiments on two popular benchmarks verify the effectiveness of our approach, particularly in the ability to infer with multilevel visual cues. In addition, training our approach does not require individual action labels to be provided, which greatly reduces the amount of labor required in data annotation.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...