Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 72
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Entropy (Basel) ; 23(7)2021 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-34206941

RESUMEN

Diabetic retinopathy (DR) is a common complication of diabetes mellitus (DM), and it is necessary to diagnose DR in the early stages of treatment. With the rapid development of convolutional neural networks in the field of image processing, deep learning methods have achieved great success in the field of medical image processing. Various medical lesion detection systems have been proposed to detect fundus lesions. At present, in the image classification process of diabetic retinopathy, the fine-grained properties of the diseased image are ignored and most of the retinopathy image data sets have serious uneven distribution problems, which limits the ability of the network to predict the classification of lesions to a large extent. We propose a new non-homologous bilinear pooling convolutional neural network model and combine it with the attention mechanism to further improve the network's ability to extract specific features of the image. The experimental results show that, compared with the most popular fundus image classification models, the network model we proposed can greatly improve the prediction accuracy of the network while maintaining computational efficiency.

2.
Soft Matter ; 2020 Sep 10.
Artículo en Inglés | MEDLINE | ID: mdl-32909580

RESUMEN

This work investigated the crystalline forms obtained from melt crystallization in the isotactic polybutene-1 (iPB-1) homopolymer via manipulation of the temperature at which samples were melted (Tmelt) and crystallization pressure (Pcry). Unlike the results under atmospheric conditions where the molten sample crystallized into the pure form II and the crystallization temperature and kinetics were affected obviously by Tmelt, the melted sample crystallized into forms II or I' under high pressure, depending on Tmelt and Pcry. The content of form I' decreases with increasing Tmelt or decreasing Pcry. Meanwhile, the critical pressure for the formation of pure form I' increases with increasing Tmelt. The formation of form I' is attributed to the memory effect of the melt which preserved some ordered sequence of crystal and the high pressure (Pcry) which suppressed the nucleation and growth of the kinetically favored form II, which results in the formation of form I'. In addition, the melt crystallized form II transforms to form I under high pressure conditions; thus forms I, I' and II are observed. The relative contents of the three crystalline forms on samples for different Tmelt and Pcry are obtained in this work. The result shows that the crystalline forms in melt crystallization of iPB-1 can be customized by regulating the melt state and crystallization conditions.

3.
BMC Urol ; 20(1): 95, 2020 Jul 11.
Artículo en Inglés | MEDLINE | ID: mdl-32652967

RESUMEN

BACKGROUND: Horseshoe kidney (HSK) is a common renal fusion anomaly, occurring in about 1 in 400-600 individuals. In addition, the incidence of duplicated collecting system is about 0.8%. CASE PRESENTATION: This report documents an extremely rare case, which was treated by multiple procedures in the same operative session to accomplish laparoscopic amputation of the HSK isthmus, resection of duplicate kidney and ureteroscopic lithotripsy. CONCLUSION: Results showed that minimally invasive surgery with use of multiple endoscopes may be a feasible choice for this patient population with complicated comorbid renal conditions.


Asunto(s)
Anomalías Múltiples/cirugía , Riñón/anomalías , Riñón/cirugía , Laparoscopía , Cálculos Ureterales/cirugía , Ureteroscopía , Adulto , Terapia Combinada , Riñón Fusionado/complicaciones , Humanos , Masculino , Cálculos Ureterales/complicaciones
4.
Lasers Med Sci ; 35(5): 1025-1034, 2020 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-32006262

RESUMEN

To explore the advantages and limitations of holmium laser resection of the bladder tumor (HOLRBT) versus standard transurethral resection of the bladder tumor (TURBT) in the treatment of non-muscle-invasive bladder cancer (NMIBC), the eligible studies were selected from the following databases: PubMed, Cochrane Library, and Embase. Studies comparing HOLRBT and TURBT for patients with NMIBC were included. The outcomes of interest were time of operation, catheterization and hospitalization, rates of recurrence, and perioperative complications, including obturator nerve reflex, bladder perforation, bladder irritation, and urethral stricture. Results of all data were compared and analyzed by Review Manager 5.3. A total of 9 comparative studies were finally included for this analysis. Pooled data demonstrated that HOLRBT significantly reduced the time to catheterization and hospitalization, the rate of recurrence in 2 years of follow-up, obturator nerve reflex, bladder perforation, and bladder irritation, compared with those in TURBT, respectively. However, no significant difference found between HOLRBT and TURBT in the time of operation, rate of recurrence in 1-year follow-up, and urethral stricture. The results of this research reached that HOLRBT would be a better choice than TURBT for patients with NMIBC.


Asunto(s)
Láseres de Estado Sólido/uso terapéutico , Músculos/patología , Neoplasias de la Vejiga Urinaria/cirugía , Procedimientos Quirúrgicos Urológicos , Anciano , Cateterismo , Femenino , Hospitalización , Humanos , Masculino , Persona de Mediana Edad , Recurrencia Local de Neoplasia/patología , Complicaciones Posoperatorias/etiología , Sesgo de Publicación , Resultado del Tratamiento , Estrechez Uretral/cirugía , Neoplasias de la Vejiga Urinaria/patología
5.
Biochem Biophys Res Commun ; 459(4): 713-9, 2015 Apr 17.
Artículo en Inglés | MEDLINE | ID: mdl-25778869

RESUMEN

Cadmium (Cd) is known to induce hepatotoxicity, yet the underlying mechanism of how this occurs is not fully understood. In this study, Cd-induced apoptosis was demonstrated in rat liver cells (BRL 3A) with apoptotic nuclear morphological changes and a decrease in cell index (CI) in a time- and concentration-dependent manner. The role of gap junctional intercellular communication (GJIC) and autophagy in Cd-induced apoptosis was investigated. Cd significantly induced GJIC inhibition as well as downregulation of connexin 43 (Cx43). The prototypical gap junction blocker carbenoxolone disodium (CBX) exacerbated the Cd-induced decrease in CI. Cd treatment was also found to cause autophagy, with an increase in mRNA expression of autophagy-related genes Atg-5, Atg-7, Beclin-1, and microtubule-associated protein light chain 3 (LC3) conversion from cytosolic LC3-I to membrane-bound LC3-II. The autophagic inducer rapamycin (RAP) prevented the Cd-induced CI decrease, while the autophagic inhibitor chloroquine (CQ) caused a further reduction in CI. In addition, CBX promoted Cd-induced autophagy, as well as changes in expression of Atg-5, Atg-7, Beclin-1 and LC3. CQ was found to block the Cd-induced decrease in Cx43 and GJIC inhibition, whereas RAP had opposite effect. These results demonstrate that autophagy plays a protective role during Cd-induced apoptosis in BRL 3A cells during 6 h of experiment, while autophagy exacerbates Cd-induced GJIC inhibition which has a negative effect on cellular fate.


Asunto(s)
Apoptosis/efectos de los fármacos , Autofagia/efectos de los fármacos , Cadmio/toxicidad , Comunicación Celular , Uniones Comunicantes/fisiología , Hígado/efectos de los fármacos , Animales , Western Blotting , Hígado/citología , Ratas , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa
6.
IEEE Trans Image Process ; 33: 3470-3485, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38809731

RESUMEN

Recent years have witnessed the incredible performance boost of data-driven deep visual object trackers. Despite the success, these trackers require millions of sequential manual labels on videos for supervised training, implying the heavy burden of human annotating. This raises a crucial question: how to train a powerful tracker from abundant videos using limited manual annotations? In this paper, we challenge the conventional belief that frame-by-frame labeling is indispensable, and show that providing a small number of annotated bounding boxes in each video is sufficient for training a strong tracker. To facilitate that, we design a novel SParsely-supervised Object Tracking (SPOT) framework. It regards the sparsely annotated boxes as anchors and progressively explores in the temporal span to discover unlabeled target snapshots. Under the teacher-student paradigm, SPOT leverages the unique transitive consistency inherent in the tracking task as supervision, extracting knowledge from both anchor snapshots and unlabeled target snapshots. We also utilize several effective training strategies, i.e., IoU filtering, asymmetric augmentation, and temporal calibration to further improve the training robustness of SPOT. The experimental results demonstrate that, given less than 5 labels for each video, trackers trained via SPOT perform on par with their fully-supervised counterparts. Moreover, our SPOT exhibits two desirable properties: 1) SPOT enables us to fully exploit large-scale video datasets by efficiently allocating sparse labels to more videos even under a limited labeling budget; 2) when equipped with a target discovery module, SPOT can even learn from purely unlabeled videos for performance gain. We hope this work could inspire the community to rethink the current annotation principles and make a step towards practical label-efficient deep tracking.

7.
IEEE Trans Image Process ; 33: 1898-1910, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38451761

RESUMEN

In this paper, we present a simple yet effective continual learning method for blind image quality assessment (BIQA) with improved quality prediction accuracy, plasticity-stability trade-off, and task-order/-length robustness. The key step in our approach is to freeze all convolution filters of a pre-trained deep neural network (DNN) for an explicit promise of stability, and learn task-specific normalization parameters for plasticity. We assign each new IQA dataset (i.e., task) a prediction head, and load the corresponding normalization parameters to produce a quality score. The final quality estimate is computed by a weighted summation of predictions from all heads with a lightweight K -means gating mechanism. Extensive experiments on six IQA datasets demonstrate the advantages of the proposed method in comparison to previous training techniques for BIQA.

8.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 2788-2803, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-37999968

RESUMEN

World models learn the consequences of actions in vision-based interactive systems. However, in practical scenarios like autonomous driving, noncontrollable dynamics that are independent or sparsely dependent on action signals often exist, making it challenging to learn effective world models. To address this issue, we propose Iso-Dream++, a model-based reinforcement learning approach that has two main contributions. First, we optimize the inverse dynamics to encourage the world model to isolate controllable state transitions from the mixed spatiotemporal variations of the environment. Second, we perform policy optimization based on the decoupled latent imaginations, where we roll out noncontrollable states into the future and adaptively associate them with the current controllable state. This enables long-horizon visuomotor control tasks to benefit from isolating mixed dynamics sources in the wild, such as self-driving cars that can anticipate the movement of other vehicles, thereby avoiding potential risks. On top of our previous work (Pan et al. 2022), we further consider the sparse dependencies between controllable and noncontrollable states, address the training collapse problem of state decoupling, and validate our approach in transfer learning setups. Our empirical study demonstrates that Iso-Dream++ outperforms existing reinforcement learning models significantly on CARLA and DeepMind Control.

9.
IEEE Trans Med Imaging ; PP2024 Apr 16.
Artículo en Inglés | MEDLINE | ID: mdl-38625765

RESUMEN

Intraoperative imaging techniques for reconstructing deformable tissues in vivo are pivotal for advanced surgical systems. Existing methods either compromise on rendering quality or are excessively computationally intensive, often demanding dozens of hours to perform, which significantly hinders their practical application. In this paper, we introduce Fast Orthogonal Plane (Forplane), a novel, efficient framework based on neural radiance fields (NeRF) for the reconstruction of deformable tissues. We conceptualize surgical procedures as 4D volumes, and break them down into static and dynamic fields comprised of orthogonal neural planes. This factorization discretizes the four-dimensional space, leading to a decreased memory usage and faster optimization. A spatiotemporal importance sampling scheme is introduced to improve performance in regions with tool occlusion as well as large motions and accelerate training. An efficient ray marching method is applied to skip sampling among empty regions, significantly improving inference speed. Forplane accommodates both binocular and monocular endoscopy videos, demonstrating its extensive applicability and flexibility. Our experiments, carried out on two in vivo datasets, the EndoNeRF and Hamlyn datasets, demonstrate the effectiveness of our framework. In all cases, Forplane substantially accelerates both the optimization process (by over 100 times) and the inference process (by over 15 times) while maintaining or even improving the quality across a variety of non-rigid deformations. This significant performance improvement promises to be a valuable asset for future intraoperative surgical applications. The code of our project is now available at https://github.com/Loping151/ForPlane.

10.
Patterns (N Y) ; 5(3): 100929, 2024 Mar 08.
Artículo en Inglés | MEDLINE | ID: mdl-38487802

RESUMEN

We described a challenge named "DRAC - Diabetic Retinopathy Analysis Challenge" in conjunction with the 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2022). Within this challenge, we provided the DRAC datset, an ultra-wide optical coherence tomography angiography (UW-OCTA) dataset (1,103 images), addressing three primary clinical tasks: diabetic retinopathy (DR) lesion segmentation, image quality assessment, and DR grading. The scientific community responded positively to the challenge, with 11, 12, and 13 teams submitting different solutions for these three tasks, respectively. This paper presents a concise summary and analysis of the top-performing solutions and results across all challenge tasks. These solutions could provide practical guidance for developing accurate classification and segmentation models for image quality assessment and DR diagnosis using UW-OCTA images, potentially improving the diagnostic capabilities of healthcare professionals. The dataset has been released to support the development of computer-aided diagnostic systems for DR evaluation.

11.
Nat Med ; 30(2): 584-594, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38177850

RESUMEN

Diabetic retinopathy (DR) is the leading cause of preventable blindness worldwide. The risk of DR progression is highly variable among different individuals, making it difficult to predict risk and personalize screening intervals. We developed and validated a deep learning system (DeepDR Plus) to predict time to DR progression within 5 years solely from fundus images. First, we used 717,308 fundus images from 179,327 participants with diabetes to pretrain the system. Subsequently, we trained and validated the system with a multiethnic dataset comprising 118,868 images from 29,868 participants with diabetes. For predicting time to DR progression, the system achieved concordance indexes of 0.754-0.846 and integrated Brier scores of 0.153-0.241 for all times up to 5 years. Furthermore, we validated the system in real-world cohorts of participants with diabetes. The integration with clinical workflow could potentially extend the mean screening interval from 12 months to 31.97 months, and the percentage of participants recommended to be screened at 1-5 years was 30.62%, 20.00%, 19.63%, 11.85% and 17.89%, respectively, while delayed detection of progression to vision-threatening DR was 0.18%. Altogether, the DeepDR Plus system could predict individualized risk and time to DR progression over 5 years, potentially allowing personalized screening intervals.


Asunto(s)
Aprendizaje Profundo , Diabetes Mellitus , Retinopatía Diabética , Humanos , Retinopatía Diabética/diagnóstico , Ceguera
12.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 6984-7000, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-32750800

RESUMEN

Graph matching aims to establish node correspondence between two graphs, which has been a fundamental problem for its NP-hard nature. One practical consideration is the effective modeling of the affinity function in the presence of noise, such that the mathematically optimal matching result is also physically meaningful. This paper resorts to deep neural networks to learn the node and edge feature, as well as the affinity model for graph matching in an end-to-end fashion. The learning is supervised by combinatorial permutation loss over nodes. Specifically, the parameters belong to convolutional neural networks for image feature extraction, graph neural networks for node embedding that convert the structural (beyond second-order) information into node-wise features that leads to a linear assignment problem, as well as the affinity kernel between two graphs. Our approach enjoys flexibility in that the permutation loss is agnostic to the number of nodes, and the embedding model is shared among nodes such that the network can deal with varying numbers of nodes for both training and inference. Moreover, our network is class-agnostic. Experimental results on extensive benchmarks show its state-of-the-art performance. It bears some generalization capability across categories and datasets, and is capable for robust matching against outliers.

13.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 6940-6954, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-33085614

RESUMEN

Capturing the interactions of human articulations lies in the center of skeleton-based action recognition. Recent graph-based methods are inherently limited in the weak spatial context modeling capability due to fixed interaction pattern and inflexible shared weights of GCN. To address above problems, we propose the multi-view interactional graph network (MV-IGNet) which can construct, learn and infer multi-level spatial skeleton context, including view-level (global), group-level, joint-level (local) context, in a unified way. MV-IGNet leverages different skeleton topologies as multi-views to cooperatively generate complementary action features. For each view, separable parametric graph convolution (SPG-Conv) enables multiple parameterized graphs to enrich local interaction patterns, which provides strong graph-adaption ability to handle irregular skeleton topologies. We also partition the skeleton into several groups and then the higher-level group contexts including inter-group and intra-group, are hierarchically captured by above SPG-Conv layers. A simple yet effective global context adaption (GCA) module facilitates representative feature extraction by learning the input-dependent skeleton topologies. Compared to the mainstream works, MV-IGNet can be readily implemented while with smaller model size and faster inference. Experimental results show the proposed MV-IGNet achieves impressive performance on large-scale benchmarks: NTU-RGB+D and NTU-RGB+D 120.

14.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10500-10518, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37030721

RESUMEN

Graph matching (GM) has been a long-standing combinatorial problem due to its NP-hard nature. Recently (deep) learning-based approaches have shown their superiority over the traditional solvers while the methods are almost based on supervised learning which can be expensive or even impractical. We develop a unified unsupervised framework from matching two graphs to multiple graphs, without correspondence ground truth for training. Specifically, a Siamese-style unsupervised learning framework is devised and trained by minimizing the discrepancy of a second-order classic solver and a first-order (differentiable) Sinkhorn net as two branches for matching prediction. The two branches share the same CNN backbone for visual graph matching. Our framework further allows unsupervised learning with graphs from a mixture of modes which is ubiquitous in reality. Specifically, we develop and unify the graduated assignment (GA) strategy for matching two-graph, multi-graph, and graphs from a mixture of modes, whereby two-way constraint and clustering confidence (for mixture case) are modulated by two separate annealing parameters, respectively. Moreover, for partial and outlier matching, an adaptive reweighting technique is developed to suppress the overmatching issue. Experimental results on real-world benchmarks including natural image matching show our unsupervised method performs comparatively and even better against two-graph based supervised approaches.


Asunto(s)
Algoritmos , Aprendizaje Automático no Supervisado , Análisis por Conglomerados
15.
IEEE Trans Neural Netw Learn Syst ; 34(11): 8566-8578, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-35226610

RESUMEN

Mesh is a type of data structure commonly used for 3-D shapes. Representation learning for 3-D meshes is essential in many computer vision and graphics applications. The recent success of convolutional neural networks (CNNs) for structured data (e.g., images) suggests the value of adapting insights from CNN for 3-D shapes. However, 3-D shape data are irregular since each node's neighbors are unordered. Various graph neural networks for 3-D shapes have been developed with isotropic filters or predefined local coordinate systems to overcome the node inconsistency on graphs. However, isotropic filters or predefined local coordinate systems limit the representation power. In this article, we propose a local structure-aware anisotropic convolutional operation (LSA-Conv) that learns adaptive weighting matrices for each template's node according to its neighboring structure and performs shared anisotropic filters. In fact, the learnable weighting matrix is similar to the attention matrix in the random synthesizer-a new Transformer model for natural language processing (NLP). Since the learnable weighting matrices require large amounts of parameters for high-resolution 3-D shapes, we introduce a matrix factorization technique to notably reduce the parameter size, denoted as LSA-small. Furthermore, a residual connection with a linear transformation is introduced to improve the performance of our LSA-Conv. Comprehensive experiments demonstrate that our model produces significant improvement in 3-D shape reconstruction compared to state-of-the-art methods.

16.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 2384-2399, 2023 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-35412976

RESUMEN

Small and cluttered objects are common in real-world which are challenging for detection. The difficulty is further pronounced when the objects are rotated, as traditional detectors often routinely locate the objects in horizontal bounding box such that the region of interest is contaminated with background or nearby interleaved objects. In this paper, we first innovatively introduce the idea of denoising to object detection. Instance-level denoising on the feature map is performed to enhance the detection to small and cluttered objects. To handle the rotation variation, we also add a novel IoU constant factor to the smooth L1 loss to address the long standing boundary problem, which to our analysis, is mainly caused by the periodicity of angular (PoA) and exchangeability of edges (EoE). By combing these two features, our proposed detector is termed as SCRDet++. Extensive experiments are performed on large aerial images public datasets DOTA, DIOR, UCAS-AOD as well as natural image dataset COCO, scene text dataset ICDAR2015, small traffic light dataset BSTLD and our released S 2 TLD by this paper. The results show the effectiveness of our approach. The released dataset S 2 TLD is made public available, which contains 5,786 images with 14,130 traffic light instances across five categories.

17.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 2864-2878, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-35635807

RESUMEN

The explosive growth of image data facilitates the fast development of image processing and computer vision methods for emerging visual applications, meanwhile introducing novel distortions to processed images. This poses a grand challenge to existing blind image quality assessment (BIQA) models, which are weak at adapting to subpopulation shift. Recent work suggests training BIQA methods on the combination of all available human-rated IQA datasets. However, this type of approach is not scalable to a large number of datasets and is cumbersome to incorporate a newly created dataset as well. In this paper, we formulate continual learning for BIQA, where a model learns continually from a stream of IQA datasets, building on what was learned from previously seen data. We first identify five desiderata in the continual setting with three criteria to quantify the prediction accuracy, plasticity, and stability, respectively. We then propose a simple yet effective continual learning method for BIQA. Specifically, based on a shared backbone network, we add a prediction head for a new dataset and enforce a regularizer to allow all prediction heads to evolve with new data while being resistant to catastrophic forgetting of old data. We compute the overall quality score by a weighted summation of predictions from all heads. Extensive experiments demonstrate the promise of the proposed continual learning method in comparison to standard training techniques for BIQA, with and without experience replay. We made the code publicly available at https://github.com/zwx8981/BIQA_CL.

18.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13489-13508, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-37432801

RESUMEN

Recently neural architecture (NAS) search has attracted great interest in academia and industry. It remains a challenging problem due to the huge search space and computational costs. Recent studies in NAS mainly focused on the usage of weight sharing to train a SuperNet once. However, the corresponding branch of each subnetwork is not guaranteed to be fully trained. It may not only incur huge computation costs but also affect the architecture ranking in the retraining procedure. We propose a multi-teacher-guided NAS, which proposes to use the adaptive ensemble and perturbation-aware knowledge distillation algorithm in the one-shot-based NAS algorithm. The optimization method aiming to find the optimal descent directions is used to obtain adaptive coefficients for the feature maps of the combined teacher model. Besides, we propose a specific knowledge distillation process for optimal architectures and perturbed ones in each searching process to learn better feature maps for later distillation procedures. Comprehensive experiments verify our approach is flexible and effective. We show improvement in precision and search efficiency in the standard recognition dataset. We also show improvement in correlation between the accuracy of the search algorithm and true accuracy by NAS benchmark datasets.

19.
Artículo en Inglés | MEDLINE | ID: mdl-37030765

RESUMEN

The significance of artistry in creating animated virtual characters is widely acknowledged, and motion style is a crucial element in this process. There has been a long-standing interest in stylizing character animations with style transfer methods. However, this kind of models can only deal with short-term motions and yield deterministic outputs. To address this issue, we propose a generative model based on normalizing flows for stylizing long and aperiodic animations in the VR scene. Our approach breaks down this task into two sub-problems: motion style transfer and stylized motion generation, both formulated as the instances of conditional normalizing flows with multi-class latent space. Specifically, we encode high-frequency style features into the latent space for varied results and control the generation process with style-content labels for disentangled edits of style and content. We have developed a prototype, StyleVR, in Unity, which allows casual users to apply our method in VR. Through qualitative and quantitative comparisons, we demonstrate that our system outperforms other methods in terms of style transfer as well as stochastic stylized motion generation.

20.
Artículo en Inglés | MEDLINE | ID: mdl-37022864

RESUMEN

Comprehensive understanding of video content requires both spatial and temporal localization. However, there lacks a unified video action localization framework, which hinders the coordinated development of this field. Existing 3D CNN methods take fixed and limited input length at the cost of ignoring temporally long-range cross-modal interaction. On the other hand, despite having large temporal context, existing sequential methods often avoid dense cross-modal interactions for complexity reasons. To address this issue, in this paper, we propose a unified framework which handles the whole video in sequential manner with long-range and dense visual-linguistic interaction in an end-to-end manner. Specifically, a lightweight relevance filtering based transformer (Ref-Transformer) is designed, which is composed of relevance filtering based attention and temporally expanded MLP. The text-relevant spatial regions and temporal clips in video can be efficiently highlighted through the relevance filtering and then propagated among the whole video sequence with the temporally expanded MLP. Extensive experiments on three sub-tasks of referring video action localization, i.e., referring video segmentation, temporal sentence grounding, and spatiotemporal video grounding, show that the proposed framework achieves the state-of-the-art performance in all referring video action localization tasks.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA