Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 28
Filter
1.
Opt Express ; 32(11): 18527-18538, 2024 May 20.
Article in English | MEDLINE | ID: mdl-38859006

ABSTRACT

Dynamic range (DR) is a pivotal characteristic of imaging systems. Current frame-based cameras struggle to achieve high dynamic range imaging due to the conflict between globally uniform exposure and spatially variant scene illumination. In this paper, we propose AsynHDR, a pixel-asynchronous HDR imaging system, based on key insights into the challenges in HDR imaging and the unique event-generating mechanism of dynamic vision sensors (DVS). Our proposed AsynHDR system integrates the DVS with a set of LCD panels. The LCD panels modulate the irradiance incident upon the DVS by altering their transparency, thereby triggering the pixel-independent event streams. The HDR image is subsequently decoded from the event streams through our temporal-weighted algorithm. Experiments under the standard test platform and several challenging scenes have verified the feasibility of the system in HDR imaging tasks.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4926-4943, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38349824

ABSTRACT

Change captioning aims to describe the semantic change between two similar images. In this process, as the most typical distractor, viewpoint change leads to the pseudo changes about appearance and position of objects, thereby overwhelming the real change. Besides, since the visual signal of change appears in a local region with weak feature, it is difficult for the model to directly translate the learned change features into the sentence. In this paper, we propose a syntax-calibrated multi-aspect relation transformer to learn effective change features under different scenes, and build reliable cross-modal alignment between the change features and linguistic words during caption generation. Specifically, a multi-aspect relation learning network is designed to 1) explore the fine-grained changes under irrelevant distractors (e.g., viewpoint change) by embedding the relations of semantics and relative position into the features of each image; 2) learn two view-invariant image representations by strengthening their global contrastive alignment relation, so as to help capture a stable difference representation; 3) provide the model with the prior knowledge about whether and where the semantic change happened by measuring the relation between the representations of captured difference and the image pair. Through the above manner, the model can learn effective change features for caption generation. Further, we introduce the syntax knowledge of Part-of-Speech (POS) and devise a POS-based visual switch to calibrate the transformer decoder. The POS-based visual switch dynamically utilizes visual information during different word generation based on the POS of words. This enables the decoder to build reliable cross-modal alignment, so as to generate a high-level linguistic sentence about change. Extensive experiments show that the proposed method achieves the state-of-the-art performance on the three public datasets.

3.
IEEE Trans Image Process ; 33: 1938-1951, 2024.
Article in English | MEDLINE | ID: mdl-38224517

ABSTRACT

Generalized Zero-Shot Learning (GZSL) aims at recognizing images from both seen and unseen classes by constructing correspondences between visual images and semantic embedding. However, existing methods suffer from a strong bias problem, where unseen images in the target domain tend to be recognized as seen classes in the source domain. To address this issue, we propose a Prototype-augmented Self-supervised Generative Network by integrating self-supervised learning and prototype learning into a feature generating model for GZSL. The proposed model enjoys several advantages. First, we propose a Self-supervised Learning Module to exploit inter-domain relationships, where we introduce anchors as a bridge between seen and unseen categories. In the shared space, we pull the distribution of the target domain away from the source domain and obtain domain-aware features. To our best knowledge, this is the first work to introduce self-supervised learning into GZSL as learning guidance. Second, a Prototype Enhancing Module is proposed to utilize class prototypes to model reliable target domain distribution in finer granularity. In this module, a Prototype Alignment mechanism and a Prototype Dispersion mechanism are combined to guide the generation of better target class features with intra-class compactness and inter-class separability. Extensive experimental results on five standard benchmarks demonstrate that our model performs favorably against state-of-the-art GZSL methods.

4.
IEEE Trans Image Process ; 33: 625-638, 2024.
Article in English | MEDLINE | ID: mdl-38198242

ABSTRACT

How to model the effect of reflection is crucial for single image reflection removal (SIRR) task. Modern SIRR methods usually simplify the reflection formulation with the assumption of linear combination of a transmission layer and a reflection layer. However, the large variations in image content and the real-world picture-taking conditions often result in far more complex reflection. In this paper, we introduce a new screen-blur combination based on two important factors, namely the intensity and the blurriness of reflection, to better characterize the reflection formulation in SIRR. Specifically, we present Screen-blur Reflection Networks (SRNet), which executes the screen-blur formulation in its network design and adapts to the complex reflection on real scenes. Technically, SRNet consists of three components: a blended image generator, a reflection estimator and a reflection removal module. The image generator exploits the screen-blur combination to synthesize the training blended images. The reflection estimator learns the reflection layer and a blur degree that measures the level of blurriness for reflection. The reflection removal module further uses the blended image, blur degree and reflection layer to filter out the transmission layer in a cascaded manner. Superior results on three different SIRR methods are reported when generating the training data on the principle of the screen-blur combination. Moreover, extensive experiments on six datasets quantitatively and qualitatively demonstrate the efficacy of SRNet over the state-of-the-art methods.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12978-12995, 2023 Nov.
Article in English | MEDLINE | ID: mdl-35709118

ABSTRACT

Existing deep learning based de-raining approaches have resorted to the convolutional architectures. However, the intrinsic limitations of convolution, including local receptive fields and independence of input content, hinder the model's ability to capture long-range and complicated rainy artifacts. To overcome these limitations, we propose an effective and efficient transformer-based architecture for the image de-raining. First, we introduce general priors of vision tasks, i.e., locality and hierarchy, into the network architecture so that our model can achieve excellent de-raining performance without costly pre-training. Second, since the geometric appearance of rainy artifacts is complicated and of significant variance in space, it is essential for de-raining models to extract both local and non-local features. Therefore, we design the complementary window-based transformer and spatial transformer to enhance locality while capturing long-range dependencies. Besides, to compensate for the positional blindness of self-attention, we establish a separate representative space for modeling positional relationship, and design a new relative position enhanced multi-head self-attention. In this way, our model enjoys powerful abilities to capture dependencies from both content and position, so as to achieve better image content recovery while removing rainy artifacts. Experiments substantiate that our approach attains more appealing results than state-of-the-art methods quantitatively and qualitatively.

6.
Article in English | MEDLINE | ID: mdl-37467094

ABSTRACT

Audiovisual event localization aims to localize the event that is both visible and audible in a video. Previous works focus on segment-level audio and visual feature sequence encoding and neglect the event proposals and boundaries, which are crucial for this task. The event proposal features provide event internal consistency between several consecutive segments constructing one proposal, while the event boundary features offer event boundary consistency to make segments located at boundaries be aware of the event occurrence. In this article, we explore the proposal-level feature encoding and propose a novel context-aware proposal-boundary (CAPB) network to address audiovisual event localization. In particular, we design a local-global context encoder (LGCE) to aggregate local-global temporal context information for visual sequence, audio sequence, event proposals, and event boundaries, respectively. The local context from temporally adjacent segments or proposals contributes to event discrimination, while the global context from the entire video provides semantic guidance of temporal relationship. Furthermore, we enhance the structural consistency between segments by exploiting the above-encoded proposal and boundary representations. CAPB leverages the context information and structural consistency to obtain context-aware event-consistent cross-modal representation for accurate event localization. Extensive experiments conducted on the audiovisual event (AVE) dataset show that our approach outperforms the state-of-the-art methods by clear margins in both supervised event localization and cross-modality localization.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7711-7725, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37015417

ABSTRACT

We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level while neglecting informative correlation between segments of the two modalities and between multi-scale event proposals. We propose a novel Semantic and Relation Modulation Network (SRMN) to learn the above correlation and leverage it to modulate the related auditory, visual, and fused features. In particular, for semantic modulation, we propose intra-modal normalization and cross-modal normalization. The former modulates features of a single modality with the event-relevant semantic guidance of the same modality. The latter modulates features of two modalities by establishing and exploiting the cross-modal relationship. For relation modulation, we propose a multi-scale proposal modulating module and a multi-alignment segment modulating module to introduce multi-scale event proposals and enable dense matching between cross-modal segments, which strengthen correlations between successive segments within one proposal and between all segments. With the features modulated by the correlation information regarding audio-visual events, SRMN performs accurate event localization. Extensive experiments conducted on the public AVE dataset demonstrate that our method outperforms the state-of-the-art methods in both supervised event localization and cross-modality localization tasks.

8.
Article in English | MEDLINE | ID: mdl-37943649

ABSTRACT

With high temporal resolution, high dynamic range, and low latency, event cameras have made great progress in numerous low-level vision tasks. To help restore low-quality (LQ) video sequences, most existing event-based methods usually employ convolutional neural networks (CNNs) to extract sparse event features without considering the spatial sparse distribution or the temporal relation in neighboring events. It brings about insufficient use of spatial and temporal information from events. To address this problem, we propose a new spiking-convolutional network (SC-Net) architecture to facilitate event-driven video restoration. Specifically, to properly extract the rich temporal information contained in the event data, we utilize a spiking neural network (SNN) to suit the sparse characteristics of events and capture temporal correlation in neighboring regions; to make full use of spatial consistency between events and frames, we adopt CNNs to transform sparse events as an extra brightness prior to being aware of detailed textures in video sequences. In this way, both the temporal correlation in neighboring events and the mutual spatial information between the two types of features are fully explored and exploited to accurately restore detailed textures and sharp edges. The effectiveness of the proposed network is validated in three representative video restoration tasks: deblurring, super-resolution, and deraining. Extensive experiments on synthetic and real-world benchmarks have illuminated that our method performs better than existing competing methods.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3003-3018, 2023 Mar.
Article in English | MEDLINE | ID: mdl-35759595

ABSTRACT

Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression while lacking the correspondence between target and expression. Two main problems exist in weakly supervised REG. First, the lack of region-level annotations introduces ambiguities between proposals and queries. Second, most previous weakly supervised REG methods ignore the discriminative location and context of the referent, causing difficulties in distinguishing the target from other same-category objects. To address the above challenges, we design an entity-enhanced adaptive reconstruction network (EARN). Specifically, EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction. In entity enhancement, we calculate semantic similarity as supervision to select the candidate proposals. Adaptive grounding calculates the ranking score of candidate proposals upon subject, location and context with hierarchical attention. Collaborative reconstruction measures the ranking result from three perspectives: adaptive reconstruction, language reconstruction and attribute classification. The adaptive mechanism helps to alleviate the variance of different referring expressions. Experiments on five datasets show EARN outperforms existing state-of-the-art methods. Qualitative results demonstrate that the proposed EARN can better handle the situation where multiple objects of a particular category are situated together.

10.
Article in English | MEDLINE | ID: mdl-37220051

ABSTRACT

Reflection from glasses is ubiquitous in daily life, but it is usually undesirable in photographs. To remove these unwanted noises, existing methods utilize either correlative auxiliary information or handcrafted priors to constrain this ill-posed problem. However, due to their limited capability to describe the properties of reflections, these methods are unable to handle strong and complex reflection scenes. In this article, we propose a hue guidance network (HGNet) with two branches for single image reflection removal (SIRR) by integrating image information and corresponding hue information. The complementarity between image information and hue information has not been noticed. The key to this idea is that we found that hue information can describe reflections well and thus can be used as a superior constraint for the specific SIRR task. Accordingly, the first branch extracts the salient reflection features by directly estimating the hue map. The second branch leverages these effective features, which can help locate salient reflection regions to obtain a high-quality restored image. Furthermore, we design a new cyclic hue loss to provide a more accurate optimization direction for the network training. Experiments substantiate the superiority of our network, especially its excellent generalization ability to various reflection scenes, as compared with state-of-the-arts both qualitatively and quantitatively. Source codes are available at https://github.com/zhuyr97/HGRR.

11.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9534-9551, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37022385

ABSTRACT

Image deraining is a challenging task since rain streaks have the characteristics of a spatially long structure and have a complex diversity. Existing deep learning-based methods mainly construct the deraining networks by stacking vanilla convolutional layers with local relations, and can only handle a single dataset due to catastrophic forgetting, resulting in a limited performance and insufficient adaptability. To address these issues, we propose a new image deraining framework to effectively explore nonlocal similarity, and to continuously learn on multiple datasets. Specifically, we first design a patchwise hypergraph convolutional module, which aims to better extract the nonlocal properties with higher-order constraints on the data, to construct a new backbone and to improve the deraining performance. Then, to achieve better generalizability and adaptability in real-world scenarios, we propose a biological brain-inspired continual learning algorithm. By imitating the plasticity mechanism of brain synapses during the learning and memory process, our continual learning process allows the network to achieve a subtle stability-plasticity tradeoff. This it can effectively alleviate catastrophic forgetting and enables a single network to handle multiple datasets. Compared with the competitors, our new deraining network with unified parameters attains a state-of-the-art performance on seen synthetic datasets and has a significantly improved generalizability on unseen real rainy images.


Subject(s)
Algorithms , Brain , Memory
12.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 710-722, 2022 02.
Article in English | MEDLINE | ID: mdl-30969916

ABSTRACT

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network-can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.


Subject(s)
Algorithms , Policy , Animals , Horses , Humans
13.
IEEE Trans Neural Netw Learn Syst ; 33(11): 6802-6816, 2022 Nov.
Article in English | MEDLINE | ID: mdl-34081590

ABSTRACT

Deep learning-based methods have achieved notable progress in removing blocking artifacts caused by lossy JPEG compression on images. However, most deep learning-based methods handle this task by designing black-box network architectures to directly learn the relationships between the compressed images and their clean versions. These network architectures are always lack of sufficient interpretability, which limits their further improvements in deblocking performance. To address this issue, in this article, we propose a model-driven deep unfolding method for JPEG artifacts removal, with interpretable network structures. First, we build a maximum posterior (MAP) model for deblocking using convolutional dictionary learning and design an iterative optimization algorithm using proximal operators. Second, we unfold this iterative algorithm into a learnable deep network structure, where each module corresponds to a specific operation of the iterative algorithm. In this way, our network inherits the benefits of both the powerful model ability of data-driven deep learning method and the interpretability of traditional model-driven method. By training the proposed network in an end-to-end manner, all learnable modules can be automatically explored to well characterize the representations of both JPEG artifacts and image content. Experiments on synthetic and real-world datasets show that our method is able to generate competitive or even better deblocking results, compared with state-of-the-art methods both quantitatively and qualitatively.

14.
Article in English | MEDLINE | ID: mdl-36121960

ABSTRACT

The analysis of neuronal morphological data is essential to investigate the neuronal properties and brain mechanisms. The complex morphologies, absence of annotations, and sheer volume of these data pose significant challenges in neuronal morphological analysis, such as identifying neuron types and large-scale neuron retrieval, all of which require accurate measuring and efficient matching algorithms. Recently, many studies have been conducted to describe neuronal morphologies quantitatively using predefined measurements. However, hand-crafted features are usually inadequate for distinguishing fine-grained differences among massive neurons. In this article, we propose a novel morphology-aware contrastive graph neural network (MACGNN) for unsupervised neuronal morphological representation learning. To improve the retrieval efficiency in large-scale neuronal morphological datasets, we further propose Hash-MACGNN by introducing an improved deep hash algorithm to train the network end-to-end to learn binary hash representations of neurons. We conduct extensive experiments on the largest dataset, NeuroMorpho, which contains more than 100 000 neurons. The experimental results demonstrate the effectiveness and superiority of our MACGNN and Hash-MACGNN for large-scale neuronal morphological analysis.

15.
IEEE Trans Image Process ; 31: 2726-2738, 2022.
Article in English | MEDLINE | ID: mdl-35324439

ABSTRACT

Video captioning aims to generate a natural language sentence to describe the main content of a video. Since there are multiple objects in videos, taking full exploration of the spatial and temporal relationships among them is crucial for this task. The previous methods wrap the detected objects as input sequences, and leverage vanilla self-attention or graph neural network to reason about visual relations. This cannot make full use of the spatial and temporal nature of a video, and suffers from the problems of redundant connections, over-smoothing, and relation ambiguity. In order to address the above problems, in this paper we construct a long short-term graph (LSTG) that simultaneously captures short-term spatial semantic relations and long-term transformation dependencies. Further, to perform relational reasoning over the LSTG, we design a global gated graph reasoning module (G3RM), which introduces a global gating based on global context to control information propagation between objects and alleviate relation ambiguity. Finally, by introducing G3RM into Transformer instead of self-attention, we propose the long short-term relation transformer (LSRT) to fully mine objects' relations for caption generation. Experiments on MSVD and MSR-VTT datasets show that the LSRT achieves superior performance compared with state-of-the-art methods. The visualization results indicate that our method alleviates problem of over-smoothing and strengthens the ability of relational reasoning.

16.
IEEE Trans Image Process ; 31: 3565-3577, 2022.
Article in English | MEDLINE | ID: mdl-35312620

ABSTRACT

TV show captioning aims to generate a linguistic sentence based on the video and its associated subtitle. Compared to purely video-based captioning, the subtitle can provide the captioning model with useful semantic clues such as actors' sentiments and intentions. However, the effective use of subtitle is also very challenging, because it is the pieces of scrappy information and has semantic gap with visual modality. To organize the scrappy information together and yield a powerful omni-representation for all the modalities, an efficient captioning model requires understanding video contents, subtitle semantics, and the relations in between. In this paper, we propose an Intra- and Inter-relation Embedding Transformer (I2Transformer), consisting of an Intra-relation Embedding Block (IAE) and an Inter-relation Embedding Block (IEE) under the framework of a Transformer. First, the IAE captures the intra-relation in each modality via constructing the learnable graphs. Then, IEE learns the cross attention gates, and selects useful information from each modality based on their inter-relations, so as to derive the omni-representation as the input to the Transformer. Experimental results on the public dataset show that the I2Transformer achieves the state-of-the-art performance. We also evaluate the effectiveness of the IAE and IEE on two other relevant tasks of video with text inputs, i.e., TV show retrieval and video-guided machine translation. The encouraging performance further validates that the IAE and IEE blocks have a good generalization ability. The code is available at https://github.com/tuyunbin/I2Transformer.


Subject(s)
Intention , Semantics
17.
IEEE Trans Neural Netw Learn Syst ; 32(11): 5034-5046, 2021 11.
Article in English | MEDLINE | ID: mdl-33290230

ABSTRACT

Many computer vision tasks, such as monocular depth estimation and height estimation from a satellite orthophoto, have a common underlying goal, which is regression of dense continuous values for the pixels given a single image. We define them as dense continuous-value regression (DCR) tasks. Recent approaches based on deep convolutional neural networks significantly improve the performance of DCR tasks, particularly on pixelwise regression accuracy. However, it still remains challenging to simultaneously preserve the global structure and fine object details in complex scenes. In this article, we take advantage of the efficiency of Laplacian pyramid on representing multiscale contents to reconstruct high-quality signals for complex scenes. We design a Laplacian pyramid neural network (LAPNet), which consists of a Laplacian pyramid decoder (LPD) for signal reconstruction and an adaptive dense feature fusion (ADFF) module to fuse features from the input image. More specifically, we build an LPD to effectively express both global and local scene structures. In our LPD, the upper and lower levels, respectively, represent scene layouts and shape details. We introduce a residual refinement module to progressively complement high-frequency details for signal prediction at each level. To recover the signals at each individual level in the pyramid, an ADFF module is proposed to adaptively fuse multiscale image features for accurate prediction. We conduct comprehensive experiments to evaluate a number of variants of our model on three important DCR tasks, i.e., monocular depth estimation, single-image height estimation, and density map estimation for crowd counting. Experiments demonstrate that our method achieves new state-of-the-art performance in both qualitative and quantitative evaluation on the NYU-D V2 and KITTI for monocular depth estimation, the challenging Urban Semantic 3D (US3D) for satellite height estimation, and four challenging benchmarks for crowd counting. These results demonstrate that the proposed LAPNet is a universal and effective architecture for DCR problems.


Subject(s)
Deep Learning/trends , Image Processing, Computer-Assisted/trends , Neural Networks, Computer , Pattern Recognition, Automated/trends , Humans , Image Processing, Computer-Assisted/methods , Pattern Recognition, Automated/methods
18.
IEEE Trans Neural Netw Learn Syst ; 32(2): 722-735, 2021 02.
Article in English | MEDLINE | ID: mdl-32275611

ABSTRACT

Person re-identification (re-ID) favors discriminative representations over unseen shots to recognize identities in disjoint camera views. Effective methods are developed via pair-wise similarity learning to detect a fixed set of region features, which can be mapped to compute the similarity value. However, relevant parts of each image are detected independently without referring to the correlation on the other image. Also, region-based methods spatially position local features for their aligned similarities. In this article, we introduce the deep coattention-based comparator (DCC) to fuse codependent representations of paired images so as to correlate the best relevant parts and produce their relative representations accordingly. The proposed approach mimics the human foveation to detect the distinct regions concurrently across images and alternatively attends to fuse them into the similarity learning. Our comparator is capable of learning representations relative to a test shot and well-suited to reidentifying pedestrians in surveillance. We perform extensive experiments to provide the insights and demonstrate the state of the arts achieved by our method in benchmark data sets: 1.2 and 2.5 points gain in mean average precision (mAP) on DukeMTMC-reID and Market-1501, respectively.


Subject(s)
Attention , Automated Facial Recognition , Deep Learning , Image Processing, Computer-Assisted/methods , Algorithms , Artificial Intelligence , Benchmarking , Biometric Identification , Databases, Factual , Humans , Neural Networks, Computer , Reproducibility of Results , Software
19.
IEEE Trans Med Imaging ; 40(11): 3205-3216, 2021 11.
Article in English | MEDLINE | ID: mdl-33999814

ABSTRACT

Manually labeling neurons from high-resolution but noisy and low-contrast optical microscopy (OM) images is tedious. As a result, the lack of annotated data poses a key challenge when applying deep learning techniques for reconstructing neurons from noisy and low-contrast OM images. While traditional tracing methods provide a possible way to efficiently generate labels for supervised network training, the generated pseudo-labels contain many noisy and incorrect labels, which lead to severe performance degradation. On the other hand, the publicly available dataset, BigNeuron, provides a large number of single 3D neurons that are reconstructed using various imaging paradigms and tracing methods. Though the raw OM images are not fully available for these neurons, they convey essential morphological priors for complex 3D neuron structures. In this paper, we propose a new approach to exploit morphological priors from neurons that have been reconstructed for training a deep neural network to extract neuron signals from OM images. We integrate a deep segmentation network in a generative adversarial network (GAN), expecting the segmentation network to be weakly supervised by pseudo-labels at the pixel level while utilizing the supervision of previously reconstructed neurons at the morphology level. In our morphological-prior-guided neuron reconstruction GAN, named MP-NRGAN, the segmentation network extracts neuron signals from raw images, and the discriminator network encourages the extracted neurons to follow the morphology distribution of reconstructed neurons. Comprehensive experiments on the public VISoR-40 dataset and BigNeuron dataset demonstrate that our proposed MP-NRGAN outperforms state-of-the-art approaches with less training effort.


Subject(s)
Image Processing, Computer-Assisted , Microscopy , Neural Networks, Computer , Neurons
20.
IEEE Trans Neural Netw Learn Syst ; 31(7): 2398-2408, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32071003

ABSTRACT

High-level semantic knowledge in addition to low-level visual cues is essentially crucial for co-saliency detection. This article proposes a novel end-to-end deep learning approach for robust co-saliency detection by simultaneously learning high-level groupwise semantic representation as well as deep visual features of a given image group. The interimage interaction at the semantic level and the complementarity between the group semantics and visual features are exploited to boost the inferring capability of co-salient regions. Specifically, the proposed approach consists of a co-category learning branch and a co-saliency detection branch. While the former is proposed to learn a groupwise semantic vector using co-category association of an image group as supervision, the latter is to infer precise co-salient maps based on the ensemble of group-semantic knowledge and deep visual cues. The group-semantic vector is used to augment visual features at multiple scales and acts as a top-down semantic guidance for boosting the bottom-up inference of co-saliency. Moreover, we develop a pyramidal attention (PA) module that endows the network with the capability of concentrating on important image patches and suppressing distractions. The co-category learning and co-saliency detection branches are jointly optimized in a multitask learning manner, further improving the robustness of the approach. We construct a new large-scale co-saliency data set COCO-SEG to facilitate research of the co-saliency detection. Extensive experimental results on COCO-SEG and a widely used benchmark Cosal2015 have demonstrated the superiority of the proposed approach compared with state-of-the-art methods.

SELECTION OF CITATIONS
SEARCH DETAIL