Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 93
Filtrar
1.
IEEE Trans Image Process ; 33: 2880-2894, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38607703

RESUMO

Color transfer aims to change the color information of the target image according to the reference one. Many studies propose color transfer methods by analysis of color distribution and semantic relevance, which do not take the perceptual characteristics for visual quality into consideration. In this study, we propose a novel color transfer method based on the saliency information with brightness optimization. First, a saliency detection module is designed to separate the foreground regions from the background regions for images. Then a dual-branch module is introduced to implement color transfer for images. Finally, a brightness optimization operation is designed during the fusion of foreground and background regions for color transfer. Experimental results show that the proposed method can implement the color transfer for images while keeping the color consistency well. Compared with other existing studies, the proposed method can obtain significant performance improvement. The source code and pre-trained models are available at https://github.com/PlanktonQAQ/SCTNet.

2.
IEEE Trans Image Process ; 33: 2404-2418, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38517711

RESUMO

Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. Inspired by the characteristics of the human visual system, existing methods typically use a combination of global and local representations (i.e., multi-scale features) to achieve superior performance. However, most of them adopt simple linear fusion of multi-scale features, and neglect their possibly complex relationship and interaction. In contrast, humans typically first form a global impression to locate important regions and then focus on local details in those regions. We therefore propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions, named as TOPIQ. Our approach to IQA involves the design of a heuristic coarse-to-fine network (CFANet) that leverages multi-scale features and progressively propagates multi-level semantic information to low-level representations in a top-down manner. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features guided by higher level features. This mechanism emphasizes active semantic regions for low-level distortions, thereby improving performance. TOPIQ can be used for both Full-Reference (FR) and No-Reference (NR) IQA. We use ResNet50 as its backbone and demonstrate that TOPIQ achieves better or competitive performance on most public FR and NR benchmarks compared with state-of-the-art methods based on vision transformers, while being much more efficient (with only  âˆ¼ 13% FLOPS of the current best FR method). Codes are released at https://github.com/chaofengc/IQA-PyTorch.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38335084

RESUMO

Multiview clustering (MVC) has gained significant attention as it enables the partitioning of samples into their respective categories through unsupervised learning. However, there are a few issues as follows: 1) many existing deep clustering methods use the same latent features to achieve the conflict objectives, namely, reconstruction and view consistency. The reconstruction objective aims to preserve view-specific features for each individual view, while the view-consistency objective strives to obtain common features across all views; 2) some deep embedded clustering (DEC) approaches adopt view-wise fusion to obtain consensus feature representation. However, these approaches overlook the correlation between samples, making it challenging to derive discriminative consensus representations; and 3) many methods use contrastive learning (CL) to align the view's representations; however, they do not take into account cluster information during the construction of sample pairs, which can lead to the presence of false negative pairs. To address these issues, we propose a novel multiview representation learning network, called anchor-sharing and clusterwise CL (CwCL) network for multiview representation learning. Specifically, we separate view-specific learning and view-common learning into different network branches, which addresses the conflict between reconstruction and consistency. Second, we design an anchor-sharing feature aggregation (ASFA) module, which learns the sharing anchors from different batch data samples, establishes the bipartite relationship between anchors and samples, and further leverages it to improve the samples' representations. This module enhances the discriminative power of the common representation from different samples. Third, we design CwCL module, which incorporates the learned transition probability into CL, allowing us to focus on minimizing the similarity between representations from negative pairs with a low transition probability. It alleviates the conflict in previous sample-level contrastive alignment. Experimental results demonstrate that our method outperforms the state-of-the-art performance.

4.
IEEE Trans Image Process ; 32: 6020-6031, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37910424

RESUMO

In this paper, we present the first attempt at determining where the achievable rate-distortion (R-D) performance bound in versatile video coding (VVC) intra coding is when considering the mutual dependency in the rate-distortion optimization (RDO) process. In particular, the abundant search space of encoding parameters in VVC intra coding is practically explored with a beam search-based joint rate-distortion optimization (BSJRDO) scheme. As such, the partitioning, prediction and transform decisions are jointly optimized across different coding units (CUs) with a customized search subset instead of the full space. To make the beam search process implementation-friendly for VVC, the dependencies among the CUs are truncated at different depths. To facilitate finer computational scalability, the beam size is flexibly adjusted based on the characteristics of the CUs, such that the operational points that satisfy different complexity demands for diverse applications can be practically obtained. The proposed BSJRDO approach, which fully conforms to the VVC decoding syntax, can serve as both the way toward the optimal RDO bound and a practical performance-boosting solution. BSJRDO is further implemented on a VVC coding platform (VVC Test model (VTM) 12.0), and extensive experiments show that BSJRDO can achieve 1.30% and 3.22% bit rate savings compared to the VTM anchor under the common test condition and low-bit-rate coding scenarios, respectively. Moreover, the performance gain can also be flexibly customized with different computational overheads.

6.
Artigo em Inglês | MEDLINE | ID: mdl-37647188

RESUMO

Deep learning approaches for Image Aesthetics Assessment (IAA) have shown promising results in recent years, but the internal mechanisms of these models remain unclear. Previous studies have demonstrated that image aesthetics can be predicted using semantic features, such as pre-trained object classification features. However, these semantic features are learned implicitly, and therefore, previous works have not elucidated what the semantic features are representing. In this work, we aim to create a more transparent deep learning framework for IAA by introducing explainable semantic features. To achieve this, we propose Tag-based Content Descriptors (TCDs), where each value in a TCD describes the relevance of an image to a human-readable tag that refers to a specific type of image content. This allows us to build IAA models from explicit descriptions of image contents. We first propose the explicit matching process to produce TCDs that adopt predefined tags to describe image contents. We show that a simple MLP-based IAA model with TCDs only based on predefined tags can achieve an SRCC of 0.767, which is comparable to most state-of-the-art methods. However, predefined tags may not be sufficient to describe all possible image contents that the model may encounter. Therefore, we further propose the implicit matching process to describe image contents that cannot be described by predefined tags. By integrating components obtained from the implicit matching process into TCDs, the IAA model further achieves an SRCC of 0.817, which significantly outperforms existing IAA methods. Both the explicit matching process and the implicit matching process are realized by the proposed TCD generator. To evaluate the performance of the proposed TCD generator in matching images with predefined tags, we also labeled 5101 images with photography-related tags to form a validation set. And experimental results show that the proposed TCD generator can meaningfully assign photography-related tags to images.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8003-8019, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37276121

RESUMO

Compared with current RGB or RGB-D saliency detection datasets, those for light field saliency detection often suffer from many defects, e.g., insufficient data amount and diversity, incomplete data formats, and rough annotations, thus impeding the prosperity of this field. To settle these issues, we elaborately build a large-scale light field dataset, dubbed PKU-LF, comprising 5,000 light fields and covering diverse indoor and outdoor scenes. Our PKU-LF provides all-inclusive representation formats of light fields and offers a unified platform for comparing algorithms utilizing different input formats. For sparking new vitality in saliency detection tasks, we present many unexplored scenarios (such as underwater and high-resolution scenes) and the richest annotations (such as scribble annotations, bounding boxes, object-/instance-level annotations, and edge annotations), on which many potential attention modeling tasks can be investigated. To facilitate the development of saliency detection, we systematically evaluate and analyze 16 representative 2D, 3D, and 4D methods on four existing datasets and the proposed dataset, furnishing a thorough benchmark. Furthermore, tailored to the distinct structural characteristics of light fields, a novel symmetric two-stream architecture (STSA) network is proposed to predict the saliency of light fields more accurately. Specifically, our STSA incorporates a focalness interweavement module (FIM) and three partial decoder modules (PDM). The former is designed to efficiently establish long-range dependencies across focal slices, while the latter aims to effectively aggregate the extracted coadjutant features in a mutual-enhancement way. Extensive experiments demonstrate that our method can significantly outperform the competitors.

8.
Artigo em Inglês | MEDLINE | ID: mdl-37028049

RESUMO

Point cloud registration is a popular topic that has been widely used in 3D model reconstruction, location, and retrieval. In this paper, we propose a new registration method, KSS-ICP, to address the rigid registration task in Kendall shape space (KSS) with Iterative Closest Point (ICP). The KSS is a quotient space that removes influences of translations, scales, and rotations for shape feature-based analysis. Such influences can be concluded as the similarity transformations that do not change the shape feature. The point cloud representation in KSS is invariant to similarity transformations. We utilize such property to design the KSS-ICP for point cloud registration. To tackle the difficulty to achieve the KSS representation in general, the proposed KSS-ICP formulates a practical solution that does not require complex feature analysis, data training, and optimization. With a simple implementation, KSS-ICP achieves more accurate registration from point clouds. It is robust to similarity transformation, non-uniform density, noise, and defective parts. Experiments show that KSS-ICP has better performance than the state-of-the-art. Code1 and executable files2 are made public.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3274-3291, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35737618

RESUMO

With rapid development of 3D scanning technology, 3D point cloud based research and applications are becoming more popular. However, major difficulties are still exist which affect the performance of point cloud utilization. Such difficulties include lack of local adjacency information, non-uniform point density, and control of point numbers. In this paper, we propose a two-step intrinsic and isotropic (I&I) resampling framework to address the challenge of these three major difficulties. The efficient intrinsic control provides geodesic measurement for a point cloud to improve local region detection and avoids redundant geodesic calculation. Then the geometrically-optimized resampling uses a geometric update process to optimize a point cloud into an isotropic or adaptively-isotropic one. The point cloud density can be adjusted to global uniform (isotropic) or local uniform with geometric feature keeping (being adaptively isotropic). The point cloud number can be controlled based on application requirement or user-specification. Experiments show that our point cloud resampling framework achieves outstanding performance in different applications: point cloud simplification, mesh reconstruction and shape registration. We provide the implementation codes of our resampling method at https://github.com/vvvwo/II-resampling.

10.
IEEE Trans Cybern ; 53(1): 526-538, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-35417367

RESUMO

Salient object detection (SOD) in optical remote sensing images (RSIs), or RSI-SOD, is an emerging topic in understanding optical RSIs. However, due to the difference between optical RSIs and natural scene images (NSIs), directly applying NSI-SOD methods to optical RSIs fails to achieve satisfactory results. In this article, we propose a novel adjacent context coordination network (ACCoNet) to explore the coordination of adjacent features in an encoder-decoder architecture for RSI-SOD. Specifically, ACCoNet consists of three parts: 1) an encoder; 2) adjacent context coordination modules (ACCoMs); and 3) a decoder. As the key component of ACCoNet, ACCoM activates the salient regions of output features of the encoder and transmits them to the decoder. ACCoM contains a local branch and two adjacent branches to coordinate the multilevel features simultaneously. The local branch highlights the salient regions in an adaptive way, while the adjacent branches introduce global information of adjacent levels to enhance salient regions. In addition, to extend the capabilities of the classic decoder block (i.e., several cascaded convolutional layers), we extend it with two bifurcations and propose a bifurcation-aggregation block (BAB) to capture the contextual information in the decoder. Extensive experiments on two benchmark datasets demonstrate that the proposed ACCoNet outperforms 22 state-of-the-art methods under nine evaluation metrics, and runs up to 81 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/ACCoNet.

11.
IEEE Trans Image Process ; 31: 4937-4951, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35853054

RESUMO

Due to the rapid increase in video traffic and relatively limited delivery infrastructure, end users often experience dynamically varying quality over time when viewing streaming videos. The user quality-of-experience (QoE) must be continuously monitored to deliver an optimized service. However, modern approaches for continuous-time video QoE estimation require densely annotating the continuous-time QoE labels, which is labor-intensive and time-consuming. To cope with such limitations, we propose a novel weakly-supervised domain adaptation approach for continuous-time QoE evaluation, by making use of a small amount of continuously labeled data in the source domain and abundant weakly-labeled data (only containing the retrospective QoE labels) in the target domain. Specifically, given a pair of videos from source and target domains, effective spatiotemporal segment-level feature representation is first learned by a combination of 2D and 3D convolutional networks. Then, a multi-task prediction framework is developed to simultaneously achieve continuous-time and retrospective QoE predictions, where a quality attentive adaptation approach is investigated to effectively alleviate the domain discrepancy without hampering the prediction performance. This approach is enabled by explicitly attending to the video-level discrimination and segment-level transferability in terms of the domain discrepancy. Experiments on benchmark databases demonstrate that the proposed method significantly improves the prediction performance under the cross-domain setting.

12.
IEEE Trans Image Process ; 31: 3697-3712, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35594233

RESUMO

Just noticeable difference (JND) of natural images refers to the maximum pixel intensity change magnitude that typical human visual system (HVS) cannot perceive. Existing efforts on JND estimation mainly dedicate to modeling the diverse masking effects in either/both spatial or/and frequency domains, and then fusing them into an overall JND estimate. In this work, we turn to a dramatically different way to address this problem with a top-down design philosophy. Instead of explicitly formulating and fusing different masking effects in a bottom-up way, the proposed JND estimation model dedicates to first predicting a critical perceptual lossless (CPL) counterpart of the original image and then calculating the difference map between the original image and the predicted CPL image as the JND map. We conduct subjective experiments to determine the critical points of 500 images and find that the distribution of cumulative normalized KLT coefficient energy values over all 500 images at these critical points can be well characterized by a Weibull distribution. Given a testing image, its corresponding critical point is determined by a simple weighted average scheme where the weights are determined by a fitted Weibull distribution function. The performance of the proposed JND model is evaluated explicitly with direct JND prediction and implicitly with two applications including JND-guided noise injection and JND-guided image compression. Experimental results have demonstrated that our proposed JND model can achieve better performance than several latest JND models. In addition, we also compare the proposed JND model with existing visual difference predicator (VDP) metrics in terms of the capability in distortion detection and discrimination. The results indicate that our JND model also has a good performance in this task. The code of this work are available at https://github.com/Zhentao-Liu/KLT-JND.


Assuntos
Algoritmos , Compressão de Dados , Compressão de Dados/métodos , Limiar Diferencial , Humanos
13.
IEEE Trans Image Process ; 31: 3066-3080, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35394908

RESUMO

In contemporary society full of stereoscopic images, how to assess visual quality of 3D images has attracted an increasing attention in field of Stereoscopic Image Quality Assessment (SIQA). Compared with 2D-IQA, SIQA is more challenging because some complicated features of Human Visual System (HVS), such as binocular interaction and binocular fusion, must be considered. In this paper, considering both binocular interaction and fusion mechanisms of the HVS, a hierarchical no-reference stereoscopic image quality assessment network (StereoIF-Net) is proposed to simulate the whole quality perception of 3D visual signals in human cortex, including two key modules: BIM and BFM. In particular, Binocular Interaction Modules (BIMs) are constructed to simulate binocular interaction in V2-V5 visual cortex regions, in which a novel cross convolution is designed to explore the interaction details in each region. In the BIMs, different output channel numbers are designed to imitate various receptive fields in V2-V5. Furthermore, a Binocular Fusion Module (BFM) with automatic learned weights is proposed to model binocular fusion of the HVS in higher cortex layers. The verification experiments are conducted on the LIVE 3D, IVC and Waterloo-IVC SIQA databases and three indices including PLCC, SROCC and RMSE are employed to evaluate the assessment consistency between StereoIF-Net and the HVS. The proposed StereoIF-Net achieves almost the best results compared with advanced SIQA methods. Specifically, the metric values on LIVE 3D, IVC and WIVC-I are the best, and are the second-best on the WIVC-II.


Assuntos
Percepção de Profundidade , Imageamento Tridimensional , Bases de Dados Factuais , Humanos , Imageamento Tridimensional/métodos
14.
IEEE Trans Image Process ; 31: 2279-2294, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35239481

RESUMO

Numerous single image super-resolution (SISR) algorithms have been proposed during the past years to reconstruct a high-resolution (HR) image from its low-resolution (LR) observation. However, how to fairly compare the performance of different SISR algorithms/results remains a challenging problem. So far, the lack of comprehensive human subjective study on large-scale real-world SISR datasets and accurate objective SISR quality assessment metrics makes it unreliable to truly understand the performance of different SISR algorithms. We in this paper make efforts to tackle these two issues. Firstly, we construct a real-world SISR quality dataset (i.e., RealSRQ) and conduct human subjective studies to compare the performance of the representative SISR algorithms. Secondly, we propose a new objective metric, i.e., KLTSRQA, based on the Karhunen-Loéve Transform (KLT) to evaluate the quality of SISR images in a no-reference (NR) manner. Experiments on our constructed RealSRQ and the latest synthetic SISR quality dataset (i.e., QADS) have demonstrated the superiority of our proposed KLTSRQA metric, achieving higher consistency with human subjective scores than relevant existing NR image quality assessment (NR-IQA) metrics. The dataset and the code will be made available at https://github.com/Zhentao-Liu/RealSRQ-KLTSRQA.


Assuntos
Algoritmos , Redes Neurais de Computação , Benchmarking , Humanos
15.
IEEE Trans Image Process ; 31: 2027-2039, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35167450

RESUMO

Quality assessment of 3D-synthesized images has traditionally been based on detecting specific categories of distortions such as stretching, black-holes, blurring, etc. However, such approaches have limitations in accurately detecting distortions entirely in 3D synthesized images affecting their performance. This work proposes an algorithm to efficiently detect the distortions and subsequently evaluate the perceptual quality of 3D synthesized images. The process of generation of 3D synthesized images produces a few pixel shift between reference and 3D synthesized image, and hence they are not properly aligned with each other. To address this, we propose using morphological operation (opening) in the residual image to reduce perceptually unimportant information between the reference and the distorted 3D synthesized image. The residual image suppresses the perceptually unimportant information and highlights the geometric distortions which significantly affect the overall quality of 3D synthesized images. We utilized the information present in the residual image to quantify the perceptual quality measure and named this algorithm as Perceptually Unimportant Information Reduction (PU-IR) algorithm. At the same time, the residual image cannot capture the minor structural and geometric distortions due to the usage of erosion operation. To address this, we extract the perceptually important deep features from the pre-trained VGG-16 architectures on the Laplacian pyramid. The distortions in 3D synthesized images are present in patches, and the human visual system perceives even the small levels of these distortions. With this view, to compare these deep features between reference and distorted image, we propose using cosine similarity and named this algorithm as Deep Features extraction and comparison using Cosine Similarity (DF-CS) algorithm. The cosine similarity is based upon their similarity rather than computing the magnitude of the difference of deep features. Finally, the pooling is done to obtain the objective quality scores using simple multiplication to both PU-IR and DF-CS algorithms. Our source code is available online: https://github.com/sadbhawnathakur/3D-Image-Quality-Assessment.


Assuntos
Algoritmos , Imageamento Tridimensional , Humanos
16.
Artigo em Inglês | MEDLINE | ID: mdl-37015489

RESUMO

With the development of 3D digital geometry technology, 3D triangular meshes are becoming more useful and valuable in industrial manufacturing and digital entertainment. A high quality triangular mesh can be used to represent a real world object with geometric and physical characteristics. While anisotropic meshes have advantages of representing shapes with sharp features (such as trimmed surfaces) more efficiently and accurately, isotropic meshes allow more numerically stable computations. When there is no anisotropic mesh requirement, isotropic triangles are always a good choice. In this paper, we propose a remeshing method to convert an input mesh into an adaptively isotropic one based on a curvature smoothed field (CSF). With the help of the CSF, adaptively isotropic remeshing can retain the curvature sensitivity, which enables more geometric features to be kept, and avoid the occurrence of obtuse triangles in the remeshed model as much as possible. The remeshed triangles with locally isotropic property benefit various geometric processes such as neighbor-based feature extraction and analysis. The experimental results show that our method achieves better balance between geometric feature preservation and mesh quality improvement compared to peers. We provide the implementation codes of our resampling method at https://github.com/vvvwo/Adaptively-Isotropic-Remeshing.

17.
IEEE Trans Image Process ; 30: 8426-8438, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34606454

RESUMO

We present a simple yet effective progressive self-guided loss function to facilitate deep learning-based salient object detection (SOD) in images. The saliency maps produced by the most relevant works still suffer from incomplete predictions due to the internal complexity of salient objects. Our proposed progressive self-guided loss simulates a morphological closing operation on the model predictions for progressively creating auxiliary training supervisions to step-wisely guide the training process. We demonstrate that this new loss function can guide the SOD model to highlight more complete salient objects step-by-step and meanwhile help to uncover the spatial dependencies of the salient object pixels in a region growing manner. Moreover, a new feature aggregation module is proposed to capture multi-scale features and aggregate them adaptively by a branch-wise attention mechanism. Benefiting from this module, our SOD framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively. Experimental results on several benchmark datasets show that our loss function not only advances the performance of existing SOD models without architecture modification but also helps our proposed framework to achieve state-of-the-art performance.

18.
IEEE Trans Image Process ; 30: 7241-7255, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34403339

RESUMO

A point cloud as an information-intensive 3D representation usually requires a large amount of transmission, storage and computing resources, which seriously hinder its usage in many emerging fields. In this paper, we propose a novel point cloud simplification method, Approximate Intrinsic Voxel Structure (AIVS), to meet the diverse demands in real-world application scenarios. The method includes point cloud pre-processing (denoising and down-sampling), AIVS-based realization for isotropic simplification and flexible simplification with intrinsic control of point distance. To demonstrate the effectiveness of the proposed AIVS-based method, we conducted extensive experiments by comparing it with several relevant point cloud simplification methods on three public datasets, including Stanford, SHREC, and RGB-D scene models. The experimental results indicate that AIVS has great advantages over peers in terms of moving least squares (MLS) surface approximation quality, curvature-sensitive sampling, sharp-feature keeping and processing speed. The source code of the proposed method is publicly available. (https://github.com/vvvwo/AIVS-project).

19.
IEEE Trans Neural Netw Learn Syst ; 32(10): 4278-4290, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34460393

RESUMO

This article devises a photograph-based monitoring model to estimate the real-time PM2.5 concentrations, overcoming currently popular electrochemical sensor-based PM2.5 monitoring methods' shortcomings such as low-density spatial distribution and time delay. Combining the proposed monitoring model, the photographs taken by various camera devices (e.g., surveillance camera, automobile data recorder, and mobile phone) can widely monitor PM2.5 concentration in megacities. This is beneficial to offering helpful decision-making information for atmospheric forecast and control, thus reducing the epidemic of COVID-19. To specify, the proposed model fuses Information Abundance measurement and Wide and Deep learning, dubbed as IAWD, for PM2.5 monitoring. First, our model extracts two categories of features in a newly proposed DS transform space to measure the information abundance (IA) of a given photograph since the growth of PM2.5 concentration decreases its IA. Second, to simultaneously possess the advantages of memorization and generalization, a new wide and deep neural network is devised to learn a nonlinear mapping between the above-mentioned extracted features and the groundtruth PM2.5 concentration. Experiments on two recently established datasets totally including more than 100 000 photographs demonstrate the effectiveness of our extracted features and the superiority of our proposed IAWD model as compared to state-of-the-art relevant computing techniques.


Assuntos
Aprendizado Profundo , Monitoramento Ambiental/métodos , Tamanho da Partícula , Algoritmos , COVID-19/prevenção & controle , Bases de Dados Factuais , Humanos , Dinâmica não Linear , Material Particulado , Fotografação , SARS-CoV-2
20.
IEEE Trans Image Process ; 30: 3528-3542, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33667161

RESUMO

Existing RGB-D Salient Object Detection (SOD) methods take advantage of depth cues to improve the detection accuracy, while pay insufficient attention to the quality of depth information. In practice, a depth map is often with uneven quality and sometimes suffers from distractors, due to various factors in the acquisition procedure. In this article, to mitigate distractors in depth maps and highlight salient objects in RGB images, we propose a Hierarchical Alternate Interactions Network (HAINet) for RGB-D SOD. Specifically, HAINet consists of three key stages: feature encoding, cross-modal alternate interaction, and saliency reasoning. The main innovation in HAINet is the Hierarchical Alternate Interaction Module (HAIM), which plays a key role in the second stage for cross-modal feature interaction. HAIM first uses RGB features to filter distractors in depth features, and then the purified depth features are exploited to enhance RGB features in turn. The alternate RGB-depth-RGB interaction proceeds in a hierarchical manner, which progressively integrates local and global contexts within a single feature scale. In addition, we adopt a hybrid loss function to facilitate the training of HAINet. Extensive experiments on seven datasets demonstrate that our HAINet not only achieves competitive performance as compared with 19 relevant state-of-the-art methods, but also reaches a real-time processing speed of 43 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/HAINet.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...