Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
1.
Article in English | MEDLINE | ID: mdl-38995710

ABSTRACT

Contrastive unsupervised representation learning (CURL) is a technique that seeks to learn feature sets from unlabeled data. It has found widespread and successful application in unsupervised feature learning, with the design of positive and negative pairs serving as the type of data samples. While CURL has seen empirical successes in recent years, there is still room for improvement in terms of the pair data generation process. This includes tasks such as combining and re-filtering samples, or implementing transformations among positive/negative pairs. We refer to this as the sample selection process. In this article, we introduce an optimized pair-data sample selection method for CURL. This method efficiently ensures that the two types of sampled data (similar pair and dissimilar pair) do not belong to the same class. We provide a theoretical analysis to demonstrate why our proposed method enhances learning performance by analyzing its error probability. Furthermore, we extend our proof into PAC-Bayes generalization to illustrate how our method tightens the bounds provided in previous literature. Our numerical experiments on text/image datasets show that our method achieves competitive accuracy with good generalization bounds.

2.
Article in English | MEDLINE | ID: mdl-38861430

ABSTRACT

In this paper, we formally address universal object detection, which aims to detect every category in every scene. The dependence on human annotations, the limited visual information, and the novel categories in open world severely restrict the universality of detectors. We propose UniDetector, a universal object detector that recognizes enormous categories in the open world. The critical points for UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces in training through image-text alignment, which guarantees sufficient information for universal representations. 2) it involves heterogeneous supervision training, which alleviates the dependence on the limited fully-labeled images. 3) it generalizes to open world easily while keeping the balance between seen and unseen classes. 4) it further promotes generalizing to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot ability on large-vocabulary datasets - it surpasses supervised baselines by more than 5% without seeing any corresponding images. On 13 detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data.

5.
Forensic Sci Int ; 354: 111888, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38048699

ABSTRACT

Multi-model score fusion was considered a bottleneck problem in forensic face identification. While the score distribution of different face models varies greatly, the existing score processing methods cannot achieve accurate alignment. This paper proposed a score fusion framework named fine alignment and flexible fusion framework (FAFF). In FAFF, we took score-based likelihood ratios as the reference values to align the similarity scores generated by different face models. First, we set up a unified calibration test workflow based on the forensic likelihood ratio test. Then, 3 LLR anchor-based methods (LLRBA1, LLRBA2, and LLRBA3) and LLR curve-based methods (LLRBC) were proposed. Finally, we conducted fusion experiments on four face models (VGGface, Facenet, Arcface, and SFace). The experimental results show that on the CelebA dataset, compared with the existing MOEBA and PAN methods, LLRBC increased the TPR@ 10-7 FPR by 175.4 % and 162.9 %, and LLRBA increased by 55.6 % and 48.5 %.

6.
Article in English | MEDLINE | ID: mdl-38090851

ABSTRACT

Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal, where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.

7.
Diagnostics (Basel) ; 13(6)2023 Mar 13.
Article in English | MEDLINE | ID: mdl-36980394

ABSTRACT

(1) Background: Three-dimensional (3D) facial anatomical landmarks are the premise and foundation of facial morphology analysis. At present, there is no ideal automatic determination method for 3D facial anatomical landmarks. This research aims to realize the automatic determination of 3D facial anatomical landmarks based on the non-rigid registration algorithm developed by our research team and to evaluate its landmark localization accuracy. (2) Methods: A 3D facial scanner, Face Scan, was used to collect 3D facial data of 20 adult males without significant facial deformities. Using the radial basis function optimized non-rigid registration algorithm, TH-OCR, developed by our research team (experimental group: TH group) and the non-rigid registration algorithm, MeshMonk (control group: MM group), a 3D face template constructed in our previous research was deformed and registered to each participant's data. The automatic determination of 3D facial anatomical landmarks was realized according to the index of 32 facial anatomical landmarks determined on the 3D face template. Considering these 32 facial anatomical landmarks manually selected by experts on the 3D facial data as the gold standard, the distance between the automatically determined and the corresponding manually selected facial anatomical landmarks was calculated as the "landmark localization error" to evaluate the effect and feasibility of the automatic determination method (template method). (3) Results: The mean landmark localization error of all facial anatomical landmarks in the TH and MM groups was 2.34 ± 1.76 mm and 2.16 ± 1.97 mm, respectively. The automatic determination of the anatomical landmarks in the middle face was better than that in the upper and lower face in both groups. Further, the automatic determination of anatomical landmarks in the center of the face was better than in the marginal part. (4) Conclusions: In this study, the automatic determination of 3D facial anatomical landmarks was realized based on non-rigid registration algorithms. There is no significant difference in the automatic landmark localization accuracy between the TH-OCR algorithm and the MeshMonk algorithm, and both can meet the needs of oral clinical applications to a certain extent.

8.
Biomed Eng Online ; 21(1): 87, 2022 Dec 17.
Article in English | MEDLINE | ID: mdl-36528597

ABSTRACT

BACKGROUND: The evaluation of refraction is indispensable in ophthalmic clinics, generally requiring a refractor or retinoscopy under cycloplegia. Retinal fundus photographs (RFPs) supply a wealth of information related to the human eye and might provide a promising approach that is more convenient and objective. Here, we aimed to develop and validate a fusion model-based deep learning system (FMDLS) to identify ocular refraction via RFPs and compare with the cycloplegic refraction. In this population-based comparative study, we retrospectively collected 11,973 RFPs from May 1, 2020 to November 20, 2021. The performance of the regression models for sphere and cylinder was evaluated using mean absolute error (MAE). The accuracy, sensitivity, specificity, area under the receiver operating characteristic curve, and F1-score were used to evaluate the classification model of the cylinder axis. RESULTS: Overall, 7873 RFPs were retained for analysis. For sphere and cylinder, the MAE values between the FMDLS and cycloplegic refraction were 0.50 D and 0.31 D, representing an increase of 29.41% and 26.67%, respectively, when compared with the single models. The correlation coefficients (r) were 0.949 and 0.807, respectively. For axis analysis, the accuracy, specificity, sensitivity, and area under the curve value of the classification model were 0.89, 0.941, 0.882, and 0.814, respectively, and the F1-score was 0.88. CONCLUSIONS: The FMDLS successfully identified the ocular refraction in sphere, cylinder, and axis, and showed good agreement with the cycloplegic refraction. The RFPs can provide not only comprehensive fundus information but also the refractive state of the eye, highlighting their potential clinical value.


Subject(s)
Deep Learning , Retinoscopy , Humans , Retinoscopy/methods , Refraction, Ocular , Mydriatics , Retrospective Studies , Algorithms
9.
IEEE Trans Image Process ; 31: 2620-2632, 2022.
Article in English | MEDLINE | ID: mdl-35286259

ABSTRACT

In recent years, the community of object detection has witnessed remarkable progress with the development of deep neural networks. But the detection performance still suffers from the dilemma between complex networks and single-vector predictions. In this paper, we propose a novel approach to boost the object detection performance based on aggregating predictions. First, we propose a unified module with adjustable hyper-structure to generate multiple predictions from a single detection network. Second, we formulate the additive learning for aggregating predictions, which reduces the classification and regression losses by progressively adding the prediction values. Based on the gradient Boosting strategy, the optimization of the additional predictions is further modeled as weighted regression problems to fit the Newton-descent directions. By aggregating multiple predictions from a single network, we propose the BooDet approach which can Bootstrap the classification and bounding box regression for high-performance object Detection. In particular, we plug the BooDet into Cascade R-CNN for object detection. Extensive experiments show that the proposed approach is quite effective to improve object detection. We obtain a 1.3%~2.0% improvement over the strong baseline Cascade R-CNN on COCO val dataset. We achieve 56.5% AP on the COCO test-dev dataset with only bounding box annotations.


Subject(s)
Neural Networks, Computer
10.
IEEE Trans Image Process ; 31: 612-622, 2022.
Article in English | MEDLINE | ID: mdl-34890326

ABSTRACT

Data associations in multi-target multi-camera tracking (MTMCT) usually estimate affinity directly from re-identification (re-ID) feature distances. However, we argue that it might not be the best choice given the difference in matching scopes between re-ID and MTMCT problems. Re-ID systems focus on global matching, which retrieves targets from all cameras and all times. In contrast, data association in tracking is a local matching problem, since its candidates only come from neighboring locations and time frames. In this paper, we design experiments to verify such misfit between global re-ID feature distances and local matching in tracking, and propose a simple yet effective approach to adapt affinity estimations to corresponding matching scopes in MTMCT. Instead of trying to deal with all appearance changes, we tailor the affinity metric to specialize in ones that might emerge during data associations. To this end, we introduce a new data sampling scheme with temporal windows originally used for data associations in tracking. Minimizing the mismatch, the adaptive affinity module brings significant improvements over global re-ID distance, and produces competitive performance on CityFlow and DukeMTMC datasets.


Subject(s)
Algorithms , Image Processing, Computer-Assisted
11.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 7912-7927, 2022 Nov.
Article in English | MEDLINE | ID: mdl-34591757

ABSTRACT

The recent success in supervised multi-view stereopsis (MVS) relies on the onerously collected real-world 3D data. While the latest differentiable rendering techniques enable unsupervised MVS, they are restricted to discretized (e.g., point cloud) or implicit geometric representation, suffering from either low integrity for a textureless region or less geometric details for complex scenes. In this paper, we propose SurRF, an unsupervised MVS pipeline by learning Surface Radiance Field, i.e., a radiance field defined on a continuous and explicit 2D surface. Our key insight is that, in a local region, the explicit surface can be gradually deformed from a continuous initialization along view-dependent camera rays by differentiable rendering. That enables us to define the radiance field only on a 2D deformable surface rather than in a dense volume of 3D space, leading to compact representation while maintaining complete shape and realistic texture for large-scale complex scenes. We experimentally demonstrate that the proposed SurRF produces competitive results over the-state-of-the-art on various real-world challenging scenes, without any 3D supervision. Moreover, SurRF shows great potential in owning the joint advantages of mesh (scene manipulation), continuous surface (high geometric resolution), and radiance field (realistic rendering).

12.
IEEE Trans Pattern Anal Mach Intell ; 43(3): 902-917, 2021 03.
Article in English | MEDLINE | ID: mdl-31502963

ABSTRACT

Part-level features offer fine granularity for pedestrian image description. In this article, we generally aim to learn discriminative part-informed feature for person re-identification. Our contribution is two-fold. First, we introduce a general part-level feature learning method, named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. PCB is general in that it is able to accommodate several part partitioning strategies, including pose estimation, human parsing and uniform part partitioning. In experiment, we show that the learned descriptor has a significantly higher discriminative ability than the global descriptor. Second, based on PCB, we propose refined part pooling (RPP), which allows the parts to be more precisely located. Our idea is that pixels within a well-located part should be similar to each other while being dissimilar with pixels from other parts. We call it within-part consistency. When a pixel-wise feature vector in a part is more similar to some other part, it is then an outlier, indicating inappropriate partitioning. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. RPP requires no part labels and is trained in a weakly supervised manner. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2) percent mAP and (92.3+1.5) percent rank-1 accuracy, a competitive performance with the state of the art.

13.
IEEE Trans Pattern Anal Mach Intell ; 42(6): 1424-1438, 2020 Jun.
Article in English | MEDLINE | ID: mdl-30794167

ABSTRACT

We address the problem of weakly supervised object localization where only image-level annotations are available for training object detectors. Numerous methods have been proposed to tackle this problem through mining object proposals. However, a substantial amount of noise in object proposals causes ambiguities for learning discriminative object models. Such approaches are sensitive to model initialization and often converge to undesirable local minimum solutions. In this paper, we propose to overcome these drawbacks by progressive representation adaptation with two main steps: 1) classification adaptation and 2) detection adaptation. In classification adaptation, we transfer a pre-trained network to a multi-label classification task for recognizing the presence of a certain object in an image. Through the classification adaptation step, the network learns discriminative representations that are specific to object categories of interest. In detection adaptation, we mine class-specific object proposals by exploiting two scoring strategies based on the adapted classification network. Class-specific proposal mining helps remove substantial noise from the background clutter and potential confusion from similar objects. We further refine these proposals using multiple instance learning and segmentation cues. Using these refined object bounding boxes, we fine-tune all the layer of the classification network and obtain a fully adapted detection network. We present detailed experimental validation on the PASCAL VOC and ILSVRC datasets. Experimental results demonstrate that our progressive representation adaptation algorithm performs favorably against the state-of-the-art methods.

14.
Article in English | MEDLINE | ID: mdl-31831414

ABSTRACT

Object detection has been a challenging task in computer vision. Although significant progress has been made in object detection with deep neural networks, the attention mechanism has yet to be fully developed. In this paper, we propose a hybrid attention mechanism for single-stage object detection. First, we present the modules of spatial attention, channel attention and aligned attention for single-stage object detection. In particular, dilated convolution layers with symmetrically fixed rates are stacked to learn spatial attention. A channel attention mechanism with the cross-level group normalization and squeeze-and-excitation operation is proposed. Aligned attention is constructed with organized deformable filters. Second, the three types of attention are unified to construct the hybrid attention mechanism. We then plug the hybrid attention into Retina-Net and propose the efficient single-stage HAR-Net for object detection. The attention modules and the proposed HAR-Net are evaluated on the COCO detection dataset. The experiments demonstrate that hybrid attention can significantly improve the detection accuracy and that the HAR-Net can achieve a state-of-the-art 45.8% mAP, thus outperforming existing single-stage object detectors.

16.
IEEE Trans Image Process ; 26(7): 3128-3141, 2017 Jul.
Article in English | MEDLINE | ID: mdl-28141521

ABSTRACT

Recently, feature fusion has demonstrated its effectiveness in image search. However, bad features and inappropriate parameters usually bring about false positive images, i.e., outliers, leading to inferior performance. Therefore, a major challenge of fusion scheme is how to be robust to outliers. Towards this goal, this paper proposes a rank-level framework for robust feature fusion. First, we define Rank Distance to measure the relevance of images at rank level. Based on it, Bayes similarity is introduced to evaluate the retrieval quality of individual features, through which true matches tend to obtain higher weight than outliers. Then, we construct the directed ImageGraph to encode the relationship of images. Each image is connected to its K nearest neighbors with an edge, and the edge is weighted by Bayes similarity. Multiple rank lists resulted from different methods are merged via ImageGraph. Furthermore, on the fused ImageGraph, local ranking is performed to re-order the initial rank lists. It aims at local optimization, and thus is more robust to global outliers. Extensive experiments on four benchmark data sets validate the effectiveness of our method. Besides, the proposed method outperforms two popular fusion schemes, and the results are competitive to the state-of-the-art.

17.
IEEE Trans Pattern Anal Mach Intell ; 38(12): 2501-2514, 2016 12.
Article in English | MEDLINE | ID: mdl-26829777

ABSTRACT

Current person re-identification (ReID) methods typically rely on single-frame imagery features, whilst ignoring space-time information from image sequences often available in the practical surveillance scenarios. Single-frame (single-shot) based visual appearance matching is inherently limited for person ReID in public spaces due to the challenging visual ambiguity and uncertainty arising from non-overlapping camera views where viewing condition changes can cause significant people appearance variations. In this work, we present a novel model to automatically select the most discriminative video fragments from noisy/incomplete image sequences of people from which reliable space-time and appearance features can be computed, whilst simultaneously learning a video ranking function for person ReID. Using the PRID 2011, iLIDS-VID, and HDA+ image sequence datasets, we extensively conducted comparative evaluations to demonstrate the advantages of the proposed model over contemporary gait recognition, holistic image sequence matching and state-of-the-art single-/multi-shot ReID methods.


Subject(s)
Algorithms , Biometric Identification/methods , Discriminant Analysis , Image Interpretation, Computer-Assisted/methods , Pattern Recognition, Automated/methods , Photography/methods , Video Recording/methods , Humans
18.
IEEE Trans Neural Netw Learn Syst ; 27(12): 2740-2747, 2016 12.
Article in English | MEDLINE | ID: mdl-26600377

ABSTRACT

We propose a Boosting approach for multi-instance (MI) classification. Lp -norm is integrated to localize the witness instances and formulate the bag scores from classifier outputs. The contributions are twofold. First, a flexible and concise model for Boosting is proposed by the Lp -norm localization and exponential loss optimization. The scores for bag-level classification are directly fused from the instance feature space without probabilistic assumptions. Second, gradient and Newton descent optimizations are applied to derive the weak learners for Boosting. In particular, the instance correlations are exploited by fitting the weights and Newton updates for the weak learner construction. The final Boosted classifiers are the sums of iteratively chosen weak learners. Experiments demonstrate that the proposed Lp -norm-localized Boosting approach significantly improves the MI classification performance. Compared with the state of the art, the approach achieves the highest MI classification accuracy on 7/10 benchmark data sets.

19.
IEEE Trans Image Process ; 23(8): 3604-17, 2014 Aug.
Article in English | MEDLINE | ID: mdl-24919200

ABSTRACT

The inverse document frequency (IDF) is prevalently utilized in the bag-of-words-based image retrieval application. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, in the conventional IDF routine, the estimation of visual word frequency is coarse and heuristic. Therefore, its effectiveness is largely compromised and far from optimal. To address this problem, this paper introduces a novel IDF family by the use of Lp-norm pooling technique. Carefully designed, the proposed IDF considers the term frequency, document frequency, the complexity of images, as well as the codebook information. We further propose a parameter tuning strategy, which helps to produce optimal balancing between TF and pIDF weights, yielding the so-called Lp-norm IDF (pIDF). We show that the conventional IDF is a special case of our generalized version, and two novel IDFs, i.e., the average IDF and the max IDF, can be defined from the concept of pIDF. Further, by counting for the term-frequency in each image, the proposed pIDF helps to alleviate the visual word burstiness phenomenon. Our method is evaluated through extensive experiments on four benchmark data sets (Oxford 5K, Paris 6K, Holidays, and Ukbench). We show that the pIDF works well on large scale databases and when the codebook is trained on irrelevant data. We report an mean average precision improvement of as large as +13.0% over the baseline TF-IDF approach on a 1M data set. In addition, the pIDF has a wide application scope varying from buildings to general objects and scenes. When combined with postprocessing steps, we achieve competitive results compared with the state-of-the-art methods. In addition, since the pIDF is computed offline, no extra computation or memory cost is introduced to the system at all.


Subject(s)
Algorithms , Databases, Factual , Image Interpretation, Computer-Assisted/methods , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Subtraction Technique , Computer Simulation , Image Enhancement/methods , Models, Statistical , Reproducibility of Results , Sensitivity and Specificity
20.
IEEE Trans Image Process ; 23(8): 3368-80, 2014 Aug.
Article in English | MEDLINE | ID: mdl-24951697

ABSTRACT

Visual matching is a crucial step in image retrieval based on the bag-of-words (BoW) model. In the baseline method, two keypoints are considered as a matching pair if their SIFT descriptors are quantized to the same visual word. However, the SIFT visual word has two limitations. First, it loses most of its discriminative power during quantization. Second, SIFT only describes the local texture feature. Both drawbacks impair the discriminative power of the BoW model and lead to false positive matches. To tackle this problem, this paper proposes to embed multiple binary features at indexing level. To model correlation between features, a multi-IDF scheme is introduced, through which different binary features are coupled into the inverted file. We show that matching verification methods based on binary features, such as Hamming embedding, can be effectively incorporated in our framework. As an extension, we explore the fusion of binary color feature into image retrieval. The joint integration of the SIFT visual word and binary features greatly enhances the precision of visual matching, reducing the impact of false positive matches. Our method is evaluated through extensive experiments on four benchmark datasets (Ukbench, Holidays, DupImage, and MIR Flickr 1M). We show that our method significantly improves the baseline approach. In addition, large-scale experiments indicate that the proposed method requires acceptable memory usage and query time compared with other approaches. Further, when global color feature is integrated, our method yields competitive performance with the state-of-the-arts.


Subject(s)
Algorithms , Image Interpretation, Computer-Assisted/methods , Pattern Recognition, Automated/methods , Photography/methods , Subtraction Technique , Artificial Intelligence , Image Enhancement/methods , Information Storage and Retrieval/methods , Models, Biological , Models, Statistical , Reproducibility of Results , Sensitivity and Specificity
SELECTION OF CITATIONS
SEARCH DETAIL