RESUMO
BACKGROUND: Syndrome differentiation aims at dividing patients into several types according to their clinical symptoms and signs, which is essential for traditional Chinese medicine (TCM). Several previous works were devoted to employing the classical algorithms to classify the syndrome and achieved delightful results. However, the presence of ambiguous symptoms substantially disturbed the performance of syndrome differentiation, This disturbance is always due to the diversity and complexity of the patients' symptoms. METHODS: To alleviate this issue, we proposed an algorithm based on the multilayer perceptron model with an attention mechanism (ATT-MLP). In particular, we first introduced an attention mechanism to assign different weights for different symptoms among the symptomatic features. In this manner, the symptoms of major significance were highlighted and ambiguous symptoms were restrained. Subsequently, those weighted features were further fed into an MLP to predict the syndrome type of AIDS. RESULTS: Experimental results for a real-world AIDS dataset show that our framework achieves significant and consistent improvements compared to other methods. Besides, our model can also capture the key symptoms corresponding to each type of syndrome. CONCLUSION: In conclusion, our proposed method can learn these intrinsic correlations between symptoms and types of syndromes. Our model is able to learn the core cluster of symptoms for each type of syndrome from limited data, while assisting medical doctors to diagnose patients efficiently.
Assuntos
Síndrome da Imunodeficiência Adquirida/diagnóstico , Diagnóstico por Computador/métodos , Medicina Tradicional Chinesa/métodos , Redes Neurais de Computação , Algoritmos , Atenção , HumanosRESUMO
Existing human parsing frameworks commonly employ joint learning of semantic edge detection and human parsing to facilitate the localization around boundary regions. Nevertheless, the parsing prediction within the interior of the part contour may still exhibit inconsistencies due to the inherent ambiguity of fine-grained semantics. In contrast, binary edge detection does not suffer from such fine-grained semantic ambiguity, leading to a typical failure case where misclassification occurs inner the part contour while the semantic edge is accurately detected. To address these challenges, we develop a novel diffusion scheme that incorporates guidance from the detected semantic edge to mitigate this problem by propagating corrected classified semantics into the misclassified regions. Building upon this diffusion scheme, we present an Edge Guided Diffusion Network (EGDNet) for human parsing, which can progressively refine the parsing predictions to enhance the accuracy and coherence of human parsing results. Moreover, we design a horizontal-vertical aggregation to exploit inherent correlations among body parts along both the horizontal and vertical axes, which aims at enhancing the initial parsing results. Extensive experimental evaluations on various challenging datasets demonstrate the effectiveness of the proposed EGDNet. Remarkably, our EGDNet shows impressive performances on six benchmark datasets, including four human body parsing datasets (LIP, CIHP, ATR, and PASCAL-Person-Part), and two human face parsing datasets (CelebAMask-HQ and LaPa).
Assuntos
Benchmarking , Aprendizagem , Humanos , SemânticaRESUMO
Composed query image retrieval task aims to retrieve the target image in the database by a query that composes two different modalities: a reference image and a sentence declaring that some details of the reference image need to be modified and replaced by new elements. Tackling this task needs to learn a multimodal embedding space, which can make semantically similar targets and queries close but dissimilar targets and queries as far away as possible. Most of the existing methods start from the perspective of model structure and design some clever interactive modules to promote the better fusion and embedding of different modalities. However, their learning objectives use conventional query-level examples as negatives while neglecting the composed query's multimodal characteristics, leading to the inadequate utilization of the training data and suboptimal construction of metric space. To this end, in this paper, we propose to improve the learning objective by constructing and mining hard negative examples from the perspective of multimodal fusion. Specifically, we compose the reference image and its logically unpaired sentences rather than paired ones to create component-level negative examples to better use data and enhance the optimization of metric space. In addition, we further propose a new sentence augmentation method to generate more indistinguishable multimodal negative examples from the element level and help the model learn a better metric space. Massive comparison experiments on four real-world datasets confirm the effectiveness of the proposed method.
RESUMO
Discovering the novel associations of biomedical entities is of great significance and can facilitate not only the identification of network biomarkers of disease but also the search for putative drug targets.Graph representation learning (GRL) has incredible potential to efficiently predict the interactions from biomedical networks by modeling the robust representation for each node.> However, the current GRL-based methods learn the representation of nodes by aggregating the features of their neighbors with equal weights. Furthermore, they also fail to identify which features of higher-order neighbors are integrated into the representation of the central node. In this work, we propose a novel graph representation learning framework: a multi-order graph neural network based on reconstructed specific subgraphs (MGRS) for biomedical interaction prediction. In the MGRS, we apply the multi-order graph aggregation module (MOGA) to learn the wide-view representation by integrating the multi-hop neighbor features. Besides, we propose a subgraph selection module (SGSM) to reconstruct the specific subgraph with adaptive edge weights for each node. SGSM can clearly explore the dependency of the node representation on the neighbor features and learn the subgraph-based representation based on the reconstructed weighted subgraphs. Extensive experimental results on four public biomedical networks demonstrate that the MGRS performs better and is more robust than the latest baselines.
Assuntos
Algoritmos , Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Humanos , Aprendizado de MáquinaRESUMO
This article explores how to harvest precise object segmentation masks while minimizing the human interaction cost. To achieve this, we propose a simple yet effective interaction scheme, named Inside-Outside Guidance (IOG). Concretely, we leverage an inside point that is clicked near the object center and two outside points at the symmetrical corner locations (top-left and bottom-right or top-right and bottom-left) of an almost-tight bounding box that encloses the target object. The interaction results in a total of one foreground click and four background clicks for segmentation. The advantages of our IOG are four-fold: 1) the two outside points can help remove distractions from other objects or background; 2) the inside point can help eliminate the unrelated regions inside the bounding box; 3) the inside and outside points are easily identified, reducing the confusion raised by the state-of-the-art DEXTR Maninis et al. 2018, in labeling some extreme samples; 4) it naturally supports additional click annotations for further correction. Despite its simplicity, our IOG not only achieves state-of-the-art performance on several popular benchmarks such as GrabCut Rother et al. 2004, PASCAL Everingham et al. 2010 and MS COCO Russakovsky et al. 2015, but also demonstrates strong generalization capability across different domains such as street scenes (Cityscapes Cordts et al. 2016), aerial imagery (Rooftop Sun et al. 2014 and Agriculture-Vision Chiu et al. 2020) and medical images (ssTEM Gerhard et al. 2013). Code is available at https://github.com/shiyinzhang/Inside-Outside-Guidancehttps://github.com/shiyinzhang/Inside-Outside-Guidance.
RESUMO
Composed image retrieval aims at retrieving the desired images, given a reference image and a text piece. To handle this task, two important subprocesses should be modeled reasonably. One is to erase irrelated details of the reference image against the text piece, and the other is to replenish the desired details in the image against the text piece. Nowadays, the existing methods neglect to distinguish between the two subprocesses and implicitly put them together to solve the composed image retrieval task. To explicitly and orderly model the two subprocesses of the task, we propose a novel composed image retrieval method which contains three key components, i.e., Multi-semantic Dynamic Suppression module (MDS), Text-semantic Complementary Selection module (TCS), and Semantic Space Alignment constraints (SSA). Concretely, MDS is to erase irrelated details of the reference image by suppressing its semantic features. TCS aims to select and enhance the semantic features of the text piece and then replenish them to the reference image. In the end, to facilitate the erasure and replenishment subprocesses, SSA aligns the semantics of the two modality features in the final space. Extensive experiments on three benchmark datasets (Shoes, FashionIQ, and Fashion200K) show the superior performance of our approach against state-of-the-art methods.
RESUMO
Garment transfer aims to transfer the desired garment from a model image with the desired clothing to a target person, which has attracted a great deal of attention due to its wider potential applications. However, considering the model and target persons are often given at different views, body shapes and poses, realistic garment transfer is facing the following challenges that have not been well addressed: 1) deforming the garment; 2) inferring unobserved appearance; 3) preserving fine texture details. To tackle these challenges, we propose a novel SPatial-Aware Texture Transformer (SPATT) model. Different from existing models, SPATT establishes correspondence and infers unobserved clothing appearance by leveraging the spatial prior information of a UV-space. Specifically, the source image is transformed into a partial UV texture map guided by the extracted dense pose. To better infer the unseen appearance utilizing seen region, we first propose a novel coordinate-prior map that defines the spatial relationship between the coordinates in the UV texture map, and design an algorithm to compute it. Based on the proposed coordinate-prior map, we present a novel spatial-aware texture generation network to complete the partial UV texture. In the second stage, we first transform the completed UV texture to fit the target person. To polish the details and improve realism, we introduce a refinement generative network conditioned on the warped image and source input. Compared with existing frameworks as shown experimentally, the proposed framework can generate more realistic images with better-preserved texture details. Furthermore, difficult cases where two persons have large pose and view differences can also be well handled by SPATT.
RESUMO
Convolutional neural network (CNN) is the primary technique that has greatly promoted the development of computer vision technologies. However, there is little research on how to allocate parameters in different convolution layers when designing CNNs. We research mainly on revealing the relationship between CNN parameter distribution, i.e., the allocation of parameters in convolution layers, and the discriminative performance of CNN. Unlike previous works, we do not append more elements into the network, such as more convolution layers or denser short connections. We focus on enhancing the discriminative performance of CNN through varying its parameter distribution under strict size constraint. We propose an energy function to represent the CNN parameter distribution, which establishes the connection between the allocation of parameters and the discriminative performance of CNN. Extensive experiments with shallow CNNs on three public image classification data sets demonstrate that the CNN parameter distribution with a higher energy value will promote the model to obtain better performance. According to the motivated observation, the problem of finding the optimal parameter distribution can be transformed into an optimization problem of finding the biggest energy value. We present a simple yet effective guideline that uses balanced parameter distribution to design CNNs. Extensive experiments on ImageNet with three popular backbones, i.e., AlexNet, ResNet34, and ResNet101, demonstrate that the proposed guideline can make consistent improvements upon different baselines under strict size constraint.
RESUMO
To efficiently browse long surveillance videos, the video synopsis technique is often used to rearrange tubes (i.e., tracks of moving objects) along the temporal axis to form a much shorter video. In this process, two key issues need to be addressed, i.e., the minimization of spatial tube collision and the maximization of temporal video condensation. In addition, when a surveillance video comes as a stream, an online algorithm with the capability of dynamically rearranging tubes is also required. Toward this end, this paper proposes a novel graph-based tube rearrangement approach for online video synopsis. The relationships among tubes are modeled with a dynamic graph, whose nodes (i.e., object masks of tubes) and edges (i.e., relationships) can be progressively inserted and updated. Based on this graph, we propose a dynamic graph coloring algorithm to efficiently rearrange all tubes by determining when they should appear. Extensive experimental results show that our approach can condense online surveillance video streams in real time with less tube collision and high compact ratio.
RESUMO
This paper presents an intelligent system named Magic-wall, which enables visualization of the effect of room decoration automatically. Concretely, given an image of the indoor scene and a preferred color, the Magic-wall can automatically locate the wall regions in the image and smoothly replace the existing wall with the required one. The key idea of the proposed Magic-wall is to leverage visual semantics to guide the entire process of color substitution, including wall segmentation and replacement. To strengthen the reality of visualization, we make the following contributions. First, we propose an edge-aware fully convolutional neural network (Edge-aware-FCN) for indoor semantic scene parsing, in which a novel edge-prior branch is introduced to identify the boundary of different semantic regions better. To further polish the details between the wall and other semantic regions, we leverage the output of Edge-aware-FCN as the prior knowledge, concatenating with the image to form a new input for the Enhanced-Net. In such a case, the Enhanced-Net is able to capture more semantic-aware information from the input and polish some ambiguous regions. Finally, to naturally replace the color of the original walls, a simple yet effective color space conversion method is proposed for replacement with brightness reserved. We build a new indoor scene dataset upon ADE20K for training and testing, which includes six semantic labels. Extensive experimental evaluations and visualizations well demonstrate that the proposed Magic-wall is effective and can automatically generate a set of visually pleasing results.
RESUMO
In content-based image retrieval (CBIR), one of the most challenging and ambiguous tasks are to correctly understand the human query intention and measure its semantic relevance with images in the database. Due to the impressive capability of visual saliency in predicting human visual attention that is closely related to the query intention, this paper attempts to explicitly discover the essential effect of visual saliency in CBIR via qualitative and quantitative experiments. Toward this end, we first generate the fixation density maps of images from a widely used CBIR dataset by using an eye-tracking apparatus. These ground-truth saliency maps are then used to measure the influence of visual saliency to the task of CBIR by exploring several probable ways of incorporating such saliency cues into the retrieval process. We find that visual saliency is indeed beneficial to the CBIR task, and the best saliency involving scheme is possibly different for different image retrieval models. Inspired by the findings, this paper presents two-stream attentive CNNs with saliency embedded inside for CBIR. The proposed network has two streams that simultaneously handle two tasks. The main stream focuses on extracting discriminative visual features that are tightly related to semantic attributes. Meanwhile, the auxiliary stream aims to facilitate the main stream by redirecting the feature extraction to the salient image content that human may pay attention to. By fusing these two streams into the Main and Auxiliary CNNs (MAC), image similarity can be computed as the human being does by reserving conspicuous content and suppressing irrelevant regions. Extensive experiments show that the proposed model achieves impressive performance in image retrieval on four public datasets.
RESUMO
Recently, convolutional neural network (CNN) visual features have demonstrated their powerful ability as a universal representation for various recognition tasks. In this paper, cross-modal retrieval with CNN visual features is implemented with several classic methods. Specifically, off-the-shelf CNN visual features are extracted from the CNN model, which is pretrained on ImageNet with more than one million images from 1000 object categories, as a generic image representation to tackle cross-modal retrieval. To further enhance the representational ability of CNN visual features, based on the pretrained CNN model on ImageNet, a fine-tuning step is performed by using the open source Caffe CNN library for each target data set. Besides, we propose a deep semantic matching method to address the cross-modal retrieval problem with respect to samples which are annotated with one or multiple labels. Extensive experiments on five popular publicly available data sets well demonstrate the superiority of CNN visual features for cross-modal retrieval.
RESUMO
Independent component analysis with soft reconstruction cost (RICA) has been recently proposed to linearly learn sparse representation with an overcomplete basis, and this technique exhibits promising performance even on unwhitened data. However, linear RICA may not be effective for the majority of real-world data because nonlinearly separable data structure pervasively exists in original data space. Meanwhile, RICA is essentially an unsupervised method and does not employ class information. Motivated by the success of the kernel trick that maps a nonlinearly separable data structure into a linearly separable case in a high-dimensional feature space, we propose a kernel RICA (kRICA) model to nonlinearly capture sparse representation in feature space. Furthermore, we extend the unsupervised kRICA to a supervised one by introducing a class-driven discrimination constraint, such that the data samples from the same class are well represented on the basis of the corresponding subset of basis vectors. This discrimination constraint minimizes inhomogeneous representation energy and maximizes homogeneous representation energy simultaneously, which is essentially equivalent to maximizing between-class scatter and minimizing within-class scatter at the same time in an implicit manner. Experimental results demonstrate that the proposed algorithm is more effective than other state-of-the-art methods on several datasets.
RESUMO
Nonnegative matrix factorization (NMF) is a useful technique to explore a parts-based representation by decomposing the original data matrix into a few parts-based basis vectors and encodings with nonnegative constraints. It has been widely used in image processing and pattern recognition tasks due to its psychological and physiological interpretation of natural data whose representation may be parts-based in human brain. However, the nonnegative constraint for matrix factorization is generally not sufficient to produce representations that are robust to local transformations. To overcome this problem, in this paper, we proposed a topographic NMF (TNMF), which imposes a topographic constraint on the encoding factor as a regularizer during matrix factorization. In essence, the topographic constraint is a two-layered network, which contains the square nonlinearity in the first layer and the square-root nonlinearity in the second layer. By pooling together the structure-correlated features belonging to the same hidden topic, the TNMF will force the encodings to be organized in a topographical map. Thus, the feature invariance can be promoted. Some experiments carried out on three standard datasets validate the effectiveness of our method in comparison to the state-of-the-art approaches.
RESUMO
The bag-of-words (BoW) model has been known as an effective method for large-scale image search and indexing. Recent work shows that the performance of the model can be further improved by using the embedding method. While different variants of the BoW model and embedding method have been developed, less effort has been made to discover their underlying working mechanism. In this paper, we systematically investigate the image search performance variation with respect to a few factors of the BoW model, and study how to employ the embedding method to further improve the image search performance. Subsequently, we summarize several observations based on the experiments on descriptor matching. To validate these observations in a real image search, we propose an effective and efficient image search scheme, in which the BoW model and embedding method are jointly optimized in terms of effectiveness and efficiency by following these observations. Our comprehensive experiments demonstrate that it is beneficial to employ these observations to develop an image search algorithm, and the proposed image search scheme outperforms state-of-the art methods in both effectiveness and efficiency.