Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Opt Express ; 32(11): 18527-18538, 2024 May 20.
Artículo en Inglés | MEDLINE | ID: mdl-38859006

RESUMEN

Dynamic range (DR) is a pivotal characteristic of imaging systems. Current frame-based cameras struggle to achieve high dynamic range imaging due to the conflict between globally uniform exposure and spatially variant scene illumination. In this paper, we propose AsynHDR, a pixel-asynchronous HDR imaging system, based on key insights into the challenges in HDR imaging and the unique event-generating mechanism of dynamic vision sensors (DVS). Our proposed AsynHDR system integrates the DVS with a set of LCD panels. The LCD panels modulate the irradiance incident upon the DVS by altering their transparency, thereby triggering the pixel-independent event streams. The HDR image is subsequently decoded from the event streams through our temporal-weighted algorithm. Experiments under the standard test platform and several challenging scenes have verified the feasibility of the system in HDR imaging tasks.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4926-4943, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38349824

RESUMEN

Change captioning aims to describe the semantic change between two similar images. In this process, as the most typical distractor, viewpoint change leads to the pseudo changes about appearance and position of objects, thereby overwhelming the real change. Besides, since the visual signal of change appears in a local region with weak feature, it is difficult for the model to directly translate the learned change features into the sentence. In this paper, we propose a syntax-calibrated multi-aspect relation transformer to learn effective change features under different scenes, and build reliable cross-modal alignment between the change features and linguistic words during caption generation. Specifically, a multi-aspect relation learning network is designed to 1) explore the fine-grained changes under irrelevant distractors (e.g., viewpoint change) by embedding the relations of semantics and relative position into the features of each image; 2) learn two view-invariant image representations by strengthening their global contrastive alignment relation, so as to help capture a stable difference representation; 3) provide the model with the prior knowledge about whether and where the semantic change happened by measuring the relation between the representations of captured difference and the image pair. Through the above manner, the model can learn effective change features for caption generation. Further, we introduce the syntax knowledge of Part-of-Speech (POS) and devise a POS-based visual switch to calibrate the transformer decoder. The POS-based visual switch dynamically utilizes visual information during different word generation based on the POS of words. This enables the decoder to build reliable cross-modal alignment, so as to generate a high-level linguistic sentence about change. Extensive experiments show that the proposed method achieves the state-of-the-art performance on the three public datasets.

3.
IEEE Trans Image Process ; 33: 625-638, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38198242

RESUMEN

How to model the effect of reflection is crucial for single image reflection removal (SIRR) task. Modern SIRR methods usually simplify the reflection formulation with the assumption of linear combination of a transmission layer and a reflection layer. However, the large variations in image content and the real-world picture-taking conditions often result in far more complex reflection. In this paper, we introduce a new screen-blur combination based on two important factors, namely the intensity and the blurriness of reflection, to better characterize the reflection formulation in SIRR. Specifically, we present Screen-blur Reflection Networks (SRNet), which executes the screen-blur formulation in its network design and adapts to the complex reflection on real scenes. Technically, SRNet consists of three components: a blended image generator, a reflection estimator and a reflection removal module. The image generator exploits the screen-blur combination to synthesize the training blended images. The reflection estimator learns the reflection layer and a blur degree that measures the level of blurriness for reflection. The reflection removal module further uses the blended image, blur degree and reflection layer to filter out the transmission layer in a cascaded manner. Superior results on three different SIRR methods are reported when generating the training data on the principle of the screen-blur combination. Moreover, extensive experiments on six datasets quantitatively and qualitatively demonstrate the efficacy of SRNet over the state-of-the-art methods.

4.
IEEE Trans Image Process ; 33: 1938-1951, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38224517

RESUMEN

Generalized Zero-Shot Learning (GZSL) aims at recognizing images from both seen and unseen classes by constructing correspondences between visual images and semantic embedding. However, existing methods suffer from a strong bias problem, where unseen images in the target domain tend to be recognized as seen classes in the source domain. To address this issue, we propose a Prototype-augmented Self-supervised Generative Network by integrating self-supervised learning and prototype learning into a feature generating model for GZSL. The proposed model enjoys several advantages. First, we propose a Self-supervised Learning Module to exploit inter-domain relationships, where we introduce anchors as a bridge between seen and unseen categories. In the shared space, we pull the distribution of the target domain away from the source domain and obtain domain-aware features. To our best knowledge, this is the first work to introduce self-supervised learning into GZSL as learning guidance. Second, a Prototype Enhancing Module is proposed to utilize class prototypes to model reliable target domain distribution in finer granularity. In this module, a Prototype Alignment mechanism and a Prototype Dispersion mechanism are combined to guide the generation of better target class features with intra-class compactness and inter-class separability. Extensive experimental results on five standard benchmarks demonstrate that our model performs favorably against state-of-the-art GZSL methods.

5.
Artículo en Inglés | MEDLINE | ID: mdl-37943649

RESUMEN

With high temporal resolution, high dynamic range, and low latency, event cameras have made great progress in numerous low-level vision tasks. To help restore low-quality (LQ) video sequences, most existing event-based methods usually employ convolutional neural networks (CNNs) to extract sparse event features without considering the spatial sparse distribution or the temporal relation in neighboring events. It brings about insufficient use of spatial and temporal information from events. To address this problem, we propose a new spiking-convolutional network (SC-Net) architecture to facilitate event-driven video restoration. Specifically, to properly extract the rich temporal information contained in the event data, we utilize a spiking neural network (SNN) to suit the sparse characteristics of events and capture temporal correlation in neighboring regions; to make full use of spatial consistency between events and frames, we adopt CNNs to transform sparse events as an extra brightness prior to being aware of detailed textures in video sequences. In this way, both the temporal correlation in neighboring events and the mutual spatial information between the two types of features are fully explored and exploited to accurately restore detailed textures and sharp edges. The effectiveness of the proposed network is validated in three representative video restoration tasks: deblurring, super-resolution, and deraining. Extensive experiments on synthetic and real-world benchmarks have illuminated that our method performs better than existing competing methods.

6.
Artículo en Inglés | MEDLINE | ID: mdl-37467094

RESUMEN

Audiovisual event localization aims to localize the event that is both visible and audible in a video. Previous works focus on segment-level audio and visual feature sequence encoding and neglect the event proposals and boundaries, which are crucial for this task. The event proposal features provide event internal consistency between several consecutive segments constructing one proposal, while the event boundary features offer event boundary consistency to make segments located at boundaries be aware of the event occurrence. In this article, we explore the proposal-level feature encoding and propose a novel context-aware proposal-boundary (CAPB) network to address audiovisual event localization. In particular, we design a local-global context encoder (LGCE) to aggregate local-global temporal context information for visual sequence, audio sequence, event proposals, and event boundaries, respectively. The local context from temporally adjacent segments or proposals contributes to event discrimination, while the global context from the entire video provides semantic guidance of temporal relationship. Furthermore, we enhance the structural consistency between segments by exploiting the above-encoded proposal and boundary representations. CAPB leverages the context information and structural consistency to obtain context-aware event-consistent cross-modal representation for accurate event localization. Extensive experiments conducted on the audiovisual event (AVE) dataset show that our approach outperforms the state-of-the-art methods by clear margins in both supervised event localization and cross-modality localization.

7.
Artículo en Inglés | MEDLINE | ID: mdl-37220051

RESUMEN

Reflection from glasses is ubiquitous in daily life, but it is usually undesirable in photographs. To remove these unwanted noises, existing methods utilize either correlative auxiliary information or handcrafted priors to constrain this ill-posed problem. However, due to their limited capability to describe the properties of reflections, these methods are unable to handle strong and complex reflection scenes. In this article, we propose a hue guidance network (HGNet) with two branches for single image reflection removal (SIRR) by integrating image information and corresponding hue information. The complementarity between image information and hue information has not been noticed. The key to this idea is that we found that hue information can describe reflections well and thus can be used as a superior constraint for the specific SIRR task. Accordingly, the first branch extracts the salient reflection features by directly estimating the hue map. The second branch leverages these effective features, which can help locate salient reflection regions to obtain a high-quality restored image. Furthermore, we design a new cyclic hue loss to provide a more accurate optimization direction for the network training. Experiments substantiate the superiority of our network, especially its excellent generalization ability to various reflection scenes, as compared with state-of-the-arts both qualitatively and quantitatively. Source codes are available at https://github.com/zhuyr97/HGRR.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7711-7725, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-37015417

RESUMEN

We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level while neglecting informative correlation between segments of the two modalities and between multi-scale event proposals. We propose a novel Semantic and Relation Modulation Network (SRMN) to learn the above correlation and leverage it to modulate the related auditory, visual, and fused features. In particular, for semantic modulation, we propose intra-modal normalization and cross-modal normalization. The former modulates features of a single modality with the event-relevant semantic guidance of the same modality. The latter modulates features of two modalities by establishing and exploiting the cross-modal relationship. For relation modulation, we propose a multi-scale proposal modulating module and a multi-alignment segment modulating module to introduce multi-scale event proposals and enable dense matching between cross-modal segments, which strengthen correlations between successive segments within one proposal and between all segments. With the features modulated by the correlation information regarding audio-visual events, SRMN performs accurate event localization. Extensive experiments conducted on the public AVE dataset demonstrate that our method outperforms the state-of-the-art methods in both supervised event localization and cross-modality localization tasks.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9534-9551, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37022385

RESUMEN

Image deraining is a challenging task since rain streaks have the characteristics of a spatially long structure and have a complex diversity. Existing deep learning-based methods mainly construct the deraining networks by stacking vanilla convolutional layers with local relations, and can only handle a single dataset due to catastrophic forgetting, resulting in a limited performance and insufficient adaptability. To address these issues, we propose a new image deraining framework to effectively explore nonlocal similarity, and to continuously learn on multiple datasets. Specifically, we first design a patchwise hypergraph convolutional module, which aims to better extract the nonlocal properties with higher-order constraints on the data, to construct a new backbone and to improve the deraining performance. Then, to achieve better generalizability and adaptability in real-world scenarios, we propose a biological brain-inspired continual learning algorithm. By imitating the plasticity mechanism of brain synapses during the learning and memory process, our continual learning process allows the network to achieve a subtle stability-plasticity tradeoff. This it can effectively alleviate catastrophic forgetting and enables a single network to handle multiple datasets. Compared with the competitors, our new deraining network with unified parameters attains a state-of-the-art performance on seen synthetic datasets and has a significantly improved generalizability on unseen real rainy images.


Asunto(s)
Algoritmos , Encéfalo , Memoria
10.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12978-12995, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-35709118

RESUMEN

Existing deep learning based de-raining approaches have resorted to the convolutional architectures. However, the intrinsic limitations of convolution, including local receptive fields and independence of input content, hinder the model's ability to capture long-range and complicated rainy artifacts. To overcome these limitations, we propose an effective and efficient transformer-based architecture for the image de-raining. First, we introduce general priors of vision tasks, i.e., locality and hierarchy, into the network architecture so that our model can achieve excellent de-raining performance without costly pre-training. Second, since the geometric appearance of rainy artifacts is complicated and of significant variance in space, it is essential for de-raining models to extract both local and non-local features. Therefore, we design the complementary window-based transformer and spatial transformer to enhance locality while capturing long-range dependencies. Besides, to compensate for the positional blindness of self-attention, we establish a separate representative space for modeling positional relationship, and design a new relative position enhanced multi-head self-attention. In this way, our model enjoys powerful abilities to capture dependencies from both content and position, so as to achieve better image content recovery while removing rainy artifacts. Experiments substantiate that our approach attains more appealing results than state-of-the-art methods quantitatively and qualitatively.

11.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3003-3018, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-35759595

RESUMEN

Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression while lacking the correspondence between target and expression. Two main problems exist in weakly supervised REG. First, the lack of region-level annotations introduces ambiguities between proposals and queries. Second, most previous weakly supervised REG methods ignore the discriminative location and context of the referent, causing difficulties in distinguishing the target from other same-category objects. To address the above challenges, we design an entity-enhanced adaptive reconstruction network (EARN). Specifically, EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction. In entity enhancement, we calculate semantic similarity as supervision to select the candidate proposals. Adaptive grounding calculates the ranking score of candidate proposals upon subject, location and context with hierarchical attention. Collaborative reconstruction measures the ranking result from three perspectives: adaptive reconstruction, language reconstruction and attribute classification. The adaptive mechanism helps to alleviate the variance of different referring expressions. Experiments on five datasets show EARN outperforms existing state-of-the-art methods. Qualitative results demonstrate that the proposed EARN can better handle the situation where multiple objects of a particular category are situated together.

12.
Artículo en Inglés | MEDLINE | ID: mdl-36121960

RESUMEN

The analysis of neuronal morphological data is essential to investigate the neuronal properties and brain mechanisms. The complex morphologies, absence of annotations, and sheer volume of these data pose significant challenges in neuronal morphological analysis, such as identifying neuron types and large-scale neuron retrieval, all of which require accurate measuring and efficient matching algorithms. Recently, many studies have been conducted to describe neuronal morphologies quantitatively using predefined measurements. However, hand-crafted features are usually inadequate for distinguishing fine-grained differences among massive neurons. In this article, we propose a novel morphology-aware contrastive graph neural network (MACGNN) for unsupervised neuronal morphological representation learning. To improve the retrieval efficiency in large-scale neuronal morphological datasets, we further propose Hash-MACGNN by introducing an improved deep hash algorithm to train the network end-to-end to learn binary hash representations of neurons. We conduct extensive experiments on the largest dataset, NeuroMorpho, which contains more than 100 000 neurons. The experimental results demonstrate the effectiveness and superiority of our MACGNN and Hash-MACGNN for large-scale neuronal morphological analysis.

13.
IEEE Trans Image Process ; 31: 2726-2738, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35324439

RESUMEN

Video captioning aims to generate a natural language sentence to describe the main content of a video. Since there are multiple objects in videos, taking full exploration of the spatial and temporal relationships among them is crucial for this task. The previous methods wrap the detected objects as input sequences, and leverage vanilla self-attention or graph neural network to reason about visual relations. This cannot make full use of the spatial and temporal nature of a video, and suffers from the problems of redundant connections, over-smoothing, and relation ambiguity. In order to address the above problems, in this paper we construct a long short-term graph (LSTG) that simultaneously captures short-term spatial semantic relations and long-term transformation dependencies. Further, to perform relational reasoning over the LSTG, we design a global gated graph reasoning module (G3RM), which introduces a global gating based on global context to control information propagation between objects and alleviate relation ambiguity. Finally, by introducing G3RM into Transformer instead of self-attention, we propose the long short-term relation transformer (LSRT) to fully mine objects' relations for caption generation. Experiments on MSVD and MSR-VTT datasets show that the LSRT achieves superior performance compared with state-of-the-art methods. The visualization results indicate that our method alleviates problem of over-smoothing and strengthens the ability of relational reasoning.

14.
IEEE Trans Image Process ; 31: 3565-3577, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35312620

RESUMEN

TV show captioning aims to generate a linguistic sentence based on the video and its associated subtitle. Compared to purely video-based captioning, the subtitle can provide the captioning model with useful semantic clues such as actors' sentiments and intentions. However, the effective use of subtitle is also very challenging, because it is the pieces of scrappy information and has semantic gap with visual modality. To organize the scrappy information together and yield a powerful omni-representation for all the modalities, an efficient captioning model requires understanding video contents, subtitle semantics, and the relations in between. In this paper, we propose an Intra- and Inter-relation Embedding Transformer (I2Transformer), consisting of an Intra-relation Embedding Block (IAE) and an Inter-relation Embedding Block (IEE) under the framework of a Transformer. First, the IAE captures the intra-relation in each modality via constructing the learnable graphs. Then, IEE learns the cross attention gates, and selects useful information from each modality based on their inter-relations, so as to derive the omni-representation as the input to the Transformer. Experimental results on the public dataset show that the I2Transformer achieves the state-of-the-art performance. We also evaluate the effectiveness of the IAE and IEE on two other relevant tasks of video with text inputs, i.e., TV show retrieval and video-guided machine translation. The encouraging performance further validates that the IAE and IEE blocks have a good generalization ability. The code is available at https://github.com/tuyunbin/I2Transformer.


Asunto(s)
Intención , Semántica
15.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 710-722, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-30969916

RESUMEN

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network-can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.


Asunto(s)
Algoritmos , Políticas , Animales , Caballos , Humanos
16.
IEEE Trans Neural Netw Learn Syst ; 33(11): 6802-6816, 2022 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-34081590

RESUMEN

Deep learning-based methods have achieved notable progress in removing blocking artifacts caused by lossy JPEG compression on images. However, most deep learning-based methods handle this task by designing black-box network architectures to directly learn the relationships between the compressed images and their clean versions. These network architectures are always lack of sufficient interpretability, which limits their further improvements in deblocking performance. To address this issue, in this article, we propose a model-driven deep unfolding method for JPEG artifacts removal, with interpretable network structures. First, we build a maximum posterior (MAP) model for deblocking using convolutional dictionary learning and design an iterative optimization algorithm using proximal operators. Second, we unfold this iterative algorithm into a learnable deep network structure, where each module corresponds to a specific operation of the iterative algorithm. In this way, our network inherits the benefits of both the powerful model ability of data-driven deep learning method and the interpretability of traditional model-driven method. By training the proposed network in an end-to-end manner, all learnable modules can be automatically explored to well characterize the representations of both JPEG artifacts and image content. Experiments on synthetic and real-world datasets show that our method is able to generate competitive or even better deblocking results, compared with state-of-the-art methods both quantitatively and qualitatively.

17.
IEEE Trans Med Imaging ; 40(11): 3205-3216, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-33999814

RESUMEN

Manually labeling neurons from high-resolution but noisy and low-contrast optical microscopy (OM) images is tedious. As a result, the lack of annotated data poses a key challenge when applying deep learning techniques for reconstructing neurons from noisy and low-contrast OM images. While traditional tracing methods provide a possible way to efficiently generate labels for supervised network training, the generated pseudo-labels contain many noisy and incorrect labels, which lead to severe performance degradation. On the other hand, the publicly available dataset, BigNeuron, provides a large number of single 3D neurons that are reconstructed using various imaging paradigms and tracing methods. Though the raw OM images are not fully available for these neurons, they convey essential morphological priors for complex 3D neuron structures. In this paper, we propose a new approach to exploit morphological priors from neurons that have been reconstructed for training a deep neural network to extract neuron signals from OM images. We integrate a deep segmentation network in a generative adversarial network (GAN), expecting the segmentation network to be weakly supervised by pseudo-labels at the pixel level while utilizing the supervision of previously reconstructed neurons at the morphology level. In our morphological-prior-guided neuron reconstruction GAN, named MP-NRGAN, the segmentation network extracts neuron signals from raw images, and the discriminator network encourages the extracted neurons to follow the morphology distribution of reconstructed neurons. Comprehensive experiments on the public VISoR-40 dataset and BigNeuron dataset demonstrate that our proposed MP-NRGAN outperforms state-of-the-art approaches with less training effort.


Asunto(s)
Procesamiento de Imagen Asistido por Computador , Microscopía , Redes Neurales de la Computación , Neuronas
18.
IEEE Trans Neural Netw Learn Syst ; 32(2): 722-735, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-32275611

RESUMEN

Person re-identification (re-ID) favors discriminative representations over unseen shots to recognize identities in disjoint camera views. Effective methods are developed via pair-wise similarity learning to detect a fixed set of region features, which can be mapped to compute the similarity value. However, relevant parts of each image are detected independently without referring to the correlation on the other image. Also, region-based methods spatially position local features for their aligned similarities. In this article, we introduce the deep coattention-based comparator (DCC) to fuse codependent representations of paired images so as to correlate the best relevant parts and produce their relative representations accordingly. The proposed approach mimics the human foveation to detect the distinct regions concurrently across images and alternatively attends to fuse them into the similarity learning. Our comparator is capable of learning representations relative to a test shot and well-suited to reidentifying pedestrians in surveillance. We perform extensive experiments to provide the insights and demonstrate the state of the arts achieved by our method in benchmark data sets: 1.2 and 2.5 points gain in mean average precision (mAP) on DukeMTMC-reID and Market-1501, respectively.


Asunto(s)
Atención , Reconocimiento Facial Automatizado , Aprendizaje Profundo , Procesamiento de Imagen Asistido por Computador/métodos , Algoritmos , Inteligencia Artificial , Benchmarking , Identificación Biométrica , Bases de Datos Factuales , Humanos , Redes Neurales de la Computación , Reproducibilidad de los Resultados , Programas Informáticos
19.
IEEE Trans Neural Netw Learn Syst ; 32(11): 5034-5046, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-33290230

RESUMEN

Many computer vision tasks, such as monocular depth estimation and height estimation from a satellite orthophoto, have a common underlying goal, which is regression of dense continuous values for the pixels given a single image. We define them as dense continuous-value regression (DCR) tasks. Recent approaches based on deep convolutional neural networks significantly improve the performance of DCR tasks, particularly on pixelwise regression accuracy. However, it still remains challenging to simultaneously preserve the global structure and fine object details in complex scenes. In this article, we take advantage of the efficiency of Laplacian pyramid on representing multiscale contents to reconstruct high-quality signals for complex scenes. We design a Laplacian pyramid neural network (LAPNet), which consists of a Laplacian pyramid decoder (LPD) for signal reconstruction and an adaptive dense feature fusion (ADFF) module to fuse features from the input image. More specifically, we build an LPD to effectively express both global and local scene structures. In our LPD, the upper and lower levels, respectively, represent scene layouts and shape details. We introduce a residual refinement module to progressively complement high-frequency details for signal prediction at each level. To recover the signals at each individual level in the pyramid, an ADFF module is proposed to adaptively fuse multiscale image features for accurate prediction. We conduct comprehensive experiments to evaluate a number of variants of our model on three important DCR tasks, i.e., monocular depth estimation, single-image height estimation, and density map estimation for crowd counting. Experiments demonstrate that our method achieves new state-of-the-art performance in both qualitative and quantitative evaluation on the NYU-D V2 and KITTI for monocular depth estimation, the challenging Urban Semantic 3D (US3D) for satellite height estimation, and four challenging benchmarks for crowd counting. These results demonstrate that the proposed LAPNet is a universal and effective architecture for DCR problems.


Asunto(s)
Aprendizaje Profundo/tendencias , Procesamiento de Imagen Asistido por Computador/tendencias , Redes Neurales de la Computación , Reconocimiento de Normas Patrones Automatizadas/tendencias , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos
20.
IEEE Trans Med Imaging ; 39(12): 4034-4046, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-32746145

RESUMEN

Reconstruction of neuronal populations from ultra-scale optical microscopy (OM) images is essential to investigate neuronal circuits and brain mechanisms. The noises, low contrast, huge memory requirement, and high computational cost pose significant challenges in the neuronal population reconstruction. Recently, many studies have been conducted to extract neuron signals using deep neural networks (DNNs). However, training such DNNs usually relies on a huge amount of voxel-wise annotations in OM images, which are expensive in terms of both finance and labor. In this paper, we propose a novel framework for dense neuronal population reconstruction from ultra-scale images. To solve the problem of high cost in obtaining manual annotations for training DNNs, we propose a progressive learning scheme for neuronal population reconstruction (PLNPR) which does not require any manual annotations. Our PLNPR scheme consists of a traditional neuron tracing module and a deep segmentation network that mutually complement and progressively promote each other. To reconstruct dense neuronal populations from a terabyte-sized ultra-scale image, we introduce an automatic framework which adaptively traces neurons block by block and fuses fragmented neurites in overlapped regions continuously and smoothly. We build a dataset "VISoR-40" which consists of 40 large-scale OM image blocks from cortical regions of a mouse. Extensive experimental results on our VISoR-40 dataset and the public BigNeuron dataset demonstrate the effectiveness and superiority of our method on neuronal population reconstruction and single neuron reconstruction. Furthermore, we successfully apply our method to reconstruct dense neuronal populations from an ultra-scale mouse brain slice. The proposed adaptive block propagation and fusion strategies greatly improve the completeness of neurites in dense neuronal population reconstruction.


Asunto(s)
Algoritmos , Microscopía , Neuronas , Animales , Ratones , Redes Neurales de la Computación , Neuritas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...