RESUMO
Facial expression recognition using convolutional neural networks (CNNs) is a prevalent research area, and the network's complexity poses obstacles for deployment on devices with limited computational resources, such as mobile devices. To address these challenges, researchers have developed lightweight networks with the aim of reducing model size and minimizing parameters without compromising accuracy. The LiteFer method introduced in this study incorporates depth-separable convolution and a lightweight attention mechanism, effectively reducing network parameters. Moreover, through comprehensive comparative experiments on the RAFDB and FERPlus datasets, its superior performance over various state-of-the-art lightweight expression-recognition methods is evident.
Assuntos
Redes Neurais de Computação , Humanos , Algoritmos , Expressão Facial , Reconhecimento Automatizado de Padrão/métodosRESUMO
As an important direction in computer vision, human pose estimation has received extensive attention in recent years. A High-Resolution Network (HRNet) can achieve effective estimation results as a classical human pose estimation method. However, the complex structure of the model is not conducive to deployment under limited computer resources. Therefore, an improved Efficient and Lightweight HRNet (EL-HRNet) model is proposed. In detail, point-wise and grouped convolutions were used to construct a lightweight residual module, replacing the original 3 × 3 module to reduce the parameters. To compensate for the information loss caused by the network's lightweight nature, the Convolutional Block Attention Module (CBAM) is introduced after the new lightweight residual module to construct the Lightweight Attention Basicblock (LA-Basicblock) module to achieve high-precision human pose estimation. To verify the effectiveness of the proposed EL-HRNet, experiments were carried out using the COCO2017 and MPII datasets. The experimental results show that the EL-HRNet model requires only 5 million parameters and 2.0 GFlops calculations and achieves an AP score of 67.1% on the COCO2017 validation set. In addition, PCKh@0.5mean is 87.7% on the MPII validation set, and EL-HRNet shows a good balance between model complexity and human pose estimation accuracy.
RESUMO
Convolutional neural networks (CNNs) have made significant progress in the field of facial expression recognition (FER). However, due to challenges such as occlusion, lighting variations, and changes in head pose, facial expression recognition in real-world environments remains highly challenging. At the same time, methods solely based on CNN heavily rely on local spatial features, lack global information, and struggle to balance the relationship between computational complexity and recognition accuracy. Consequently, the CNN-based models still fall short in their ability to address FER adequately. To address these issues, we propose a lightweight facial expression recognition method based on a hybrid vision transformer. This method captures multi-scale facial features through an improved attention module, achieving richer feature integration, enhancing the network's perception of key facial expression regions, and improving feature extraction capabilities. Additionally, to further enhance the model's performance, we have designed the patch dropping (PD) module. This module aims to emulate the attention allocation mechanism of the human visual system for local features, guiding the network to focus on the most discriminative features, reducing the influence of irrelevant features, and intuitively lowering computational costs. Extensive experiments demonstrate that our approach significantly outperforms other methods, achieving an accuracy of 86.51% on RAF-DB and nearly 70% on FER2013, with a model size of only 3.64 MB. These results demonstrate that our method provides a new perspective for the field of facial expression recognition.
Assuntos
Expressão Facial , Redes Neurais de Computação , Humanos , Reconhecimento Facial Automatizado/métodos , Algoritmos , Processamento de Imagem Assistida por Computador/métodos , Face , Reconhecimento Automatizado de Padrão/métodosRESUMO
In the field of autofocus for optical systems, although passive focusing methods are widely used due to their cost-effectiveness, fixed focusing windows and evaluation functions in certain scenarios can still lead to focusing failures. Additionally, the lack of datasets limits the extensive research of deep learning methods. In this work, we propose a neural network autofocus method with the capability of dynamically selecting the region of interest (ROI). Our main work is as follows: first, we construct a dataset for automatic focusing of grayscale images; second, we transform the autofocus issue into an ordinal regression problem and propose two focusing strategies: full-stack search and single-frame prediction; and third, we construct a MobileViT network with a linear self-attention mechanism to achieve automatic focusing on dynamic regions of interest. The effectiveness of the proposed focusing method is verified through experiments, and the results show that the focusing MAE of the full-stack search can be as low as 0.094, with a focusing time of 27.8 ms, and the focusing MAE of the single-frame prediction can be as low as 0.142, with a focusing time of 27.5 ms.
RESUMO
Video super-resolution (VSR) remains challenging for real-world applications due to complex and unknown degradations. Existing methods lack the flexibility to handle video sequences with different degradation levels, thus failing to reflect real-world scenarios. To address this problem, we propose a degradation-adaptive video super-resolution network (DAVSR) based on a bidirectional propagation network. Specifically, we adaptively employ three distinct degradation levels to process input video sequences, aiming to obtain training pairs that reflect a variety of real-world corrupted images. We also equip the network with a pre-cleaning module to reduce noise and artifacts in the low-quality video sequences prior to information propagation. Additionally, compared to previous flow-based methods, we employ an unsupervised optical flow estimator to acquire a more precise optical flow to guide inter-frame alignment. Meanwhile, while maintaining network performance, we streamline the propagation network branches and the structure of the reconstruction module of the baseline network. Experiments are conducted on datasets with diverse degradation types to validate the effectiveness of DAVSR. Our method exhibits an average improvement of 0.18 dB over a recent SOTA approach (DBVSR) in terms of the PSNR metric. Extensive experiments demonstrate the effectiveness of our network in handling real-world video sequences with different degradation levels.
RESUMO
In clinical conditions limited by equipment, attaining lightweight skin lesion segmentation is pivotal as it facilitates the integration of the model into diverse medical devices, thereby enhancing operational efficiency. However, the lightweight design of the model may face accuracy degradation, especially when dealing with complex images such as skin lesion images with irregular regions, blurred boundaries, and oversized boundaries. To address these challenges, we propose an efficient lightweight attention network (ELANet) for the skin lesion segmentation task. In ELANet, two different attention mechanisms of the bilateral residual module (BRM) can achieve complementary information, which enhances the sensitivity to features in spatial and channel dimensions, respectively, and then multiple BRMs are stacked for efficient feature extraction of the input information. In addition, the network acquires global information and improves segmentation accuracy by putting feature maps of different scales through multi-scale attention fusion (MAF) operations. Finally, we evaluate the performance of ELANet on three publicly available datasets, ISIC2016, ISIC2017, and ISIC2018, and the experimental results show that our algorithm can achieve 89.87%, 81.85%, and 82.87% of the mIoU on the three datasets with a parametric of 0.459 M, which is an excellent balance between accuracy and lightness and is superior to many existing segmentation methods.
Assuntos
Algoritmos , Redes Neurais de Computação , Humanos , Processamento de Imagem Assistida por Computador/métodos , Pele/diagnóstico por imagem , Pele/patologiaRESUMO
Multiple dynamic impact signals are widely used in a variety of engineering scenarios and are difficult to identify accurately and quickly due to the signal adhesion phenomenon caused by nonlinear interference. To address this problem, an intelligent algorithm combining wavelet transforms with lightweight neural networks is proposed. First, the features of multiple impact signals are analyzed by establishing a transfer model for multiple impacts in multibody dynamical systems, and interference is suppressed using wavelet transformation. Second, a lightweight neural network, i.e., fast-activated minimal gated unit (FMGU), is elaborated for multiple impact signals, which can reduce computational complexity and improve real-time performance. Third, the experimental results show that the proposed method maintains excellent feature recognition results compared to gate recurrent unit (GRU) and long short-term memory (LSTM) networks under all test datasets with varying impact speeds, while its metrics for computational complexity are 50% lower than those of the GRU and LSTM. Therefore, the proposed method is of great practical value for weak hardware application platforms that require the accurate identification of multiple dynamic impact signals in real time.
RESUMO
In order to effectively respond to floods and water emergencies that result in the drowning of missing persons, timely and effective search and rescue is a very critical step in underwater rescue. Due to the complex underwater environment and low visibility, unmanned underwater vehicles (UUVs) with sonar are more efficient than traditional manual search and rescue methods to conduct active searches using deep learning algorithms. In this paper, we constructed a sound-based rescue target dataset that encompasses both the source and target domains using deep transfer learning techniques. For the underwater acoustic rescue target detection of small targets, which lack image feature accuracy, this paper proposes a two-branch convolution module and improves the YOLOv5s algorithm model to design an acoustic rescue small target detection algorithm model. For an underwater rescue target dataset based on acoustic images with a small sample acoustic dataset, a direct fine-tuning using optical image pre-training lacks cross-domain adaptability due to the different statistical properties of optical and acoustic images. This paper therefore proposes a heterogeneous information hierarchical migration learning method. For the false detection of acoustic rescue targets in a complex underwater background, the network layer is frozen during the hierarchical migration of heterogeneous information to improve the detection accuracy. In addition, in order to be more applicable to the embedded devices carried by underwater UAVs, an underwater acoustic rescue target detection algorithm based on ShuffleNetv2 is proposed to improve the two-branch convolutional module and the backbone network of YOLOv5s algorithm, and to create a lightweight model based on hierarchical migration of heterogeneous information. Through extensive comparative experiments conducted on various acoustic images, we have thoroughly validated the feasibility and effectiveness of our method. Our approach has demonstrated state-of-the-art performance in underwater search and rescue target detection tasks.
RESUMO
To lighten the workload of train drivers and enhance railway transportation safety, a novel and intelligent method for railway turnout identification is investigated based on semantic segmentation. More specifically, a railway turnout scene perception (RTSP) dataset is constructed and annotated manually in this paper, wherein the innovative concept of side rails is introduced as part of the labeling process. After that, based on the work of Deeplabv3+, combined with a lightweight design and an attention mechanism, a railway turnout identification network (RTINet) is proposed. Firstly, in consideration of the need for rapid response in the deployment of the identification model on high-speed trains, this paper selects the MobileNetV2 network, renowned for its suitability for lightweight deployment, as the backbone of the RTINet model. Secondly, to reduce the computational load of the model while ensuring accuracy, depth-separable convolutions are employed to replace the standard convolutions within the network architecture. Thirdly, the bottleneck attention module (BAM) is integrated into the model to enhance position and feature information perception, bolster the robustness and quality of the segmentation masks generated, and ensure that the outcomes are characterized by precision and reliability. Finally, to address the issue of foreground and background imbalance in turnout recognition, the Dice loss function is incorporated into the network training procedure. Both the quantitative and qualitative experimental results demonstrate that the proposed method is feasible for railway turnout identification, and it outperformed the compared baseline models. In particular, the RTINet was able to achieve a remarkable mIoU of 85.94%, coupled with an inference speed of 78 fps on the customized dataset. Furthermore, the effectiveness of each optimized component of the proposed RTINet is verified by an additional ablation study.
RESUMO
Most deep-learning-based object detection algorithms exhibit low speeds and accuracy in gear surface defect detection due to their high computational costs and complex structures. To solve this problem, a lightweight model for gear surface defect detection, namely STMS-YOLOv5, is proposed in this paper. Firstly, the ShuffleNetv2 module is employed as the backbone to reduce the giga floating-point operations per second and the number of parameters. Secondly, transposed convolution upsampling is used to enhance the learning capability of the network. Thirdly, the max efficient channel attention mechanism is embedded in the neck to compensate for the accuracy loss caused by the lightweight backbone. Finally, the SIOU_Loss is adopted as the bounding box regression loss function in the prediction part to speed up the model convergence. Experiments show that STMS-YOLOv5 achieves frames per second of 130.4 and 133.5 on the gear and NEU-DET steel surface defect datasets, respectively. The number of parameters and GFLOPs are reduced by 44.4% and 50.31%, respectively, while the mAP@0.5 reaches 98.6% and 73.5%, respectively. Extensive ablation and comparative experiments validate the effectiveness and generalization capability of the model in industrial defect detection.
Assuntos
Algoritmos , Indústrias , Aprendizagem , Pescoço , Coluna VertebralRESUMO
In the past few years, 3D Morphing Model (3DMM)-based methods have achieved remarkable results in single-image 3D face reconstruction. However, high-fidelity 3D face texture generation has been successfully achieved with this method, which mostly uses the power of deep convolutional neural networks during the parameter fitting process, which leads to an increase in the number of network layers and computational burden of the network model and reduces the computational speed. Currently, existing methods increase computational speed by using lightweight networks for parameter fitting, but at the expense of reconstruction accuracy. In order to solve the above problems, we improved the 3D deformation model and proposed an efficient and lightweight network model: Mobile-FaceRNet. First, we combine depthwise separable convolution and multi-scale representation methods to fit the parameters of a 3D deformable model (3DMM); then, we introduce a residual attention module during network training to enhance the network's attention to important features, guaranteeing high-fidelity facial texture reconstruction quality; and, finally, a new perceptual loss function is designed to better address smoothness and image similarity for the smoothing constraints. Experimental results show that the method proposed in this paper can not only achieve high-precision reconstruction under the premise of lightweight, but it is also more robust to influences such as attitude and occlusion.
RESUMO
Accurate and rapid response in complex driving scenarios is a challenging problem in autonomous driving. If a target is detected, the vehicle will not be able to react in time, resulting in fatal safety accidents. Therefore, the application of driver assistance systems requires a model that can accurately detect targets in complex scenes and respond quickly. In this paper, a lightweight feature extraction model, ShuffDet, is proposed to replace the CSPDark53 model used by YOLOX by improving the YOLOX algorithm. At the same time, an attention mechanism is introduced into the path aggregation feature pyramid network (PAFPN) to make the network focus more on important information in the network, thereby improving the accuracy of the model. This model, which combines two methods, is called ShuffYOLOX, and it can improve the accuracy of the model while keeping it lightweight. The performance of the ShuffYOLOX model on the KITTI dataset is tested in this paper, and the experimental results show that compared to the original network, the mean average precision (mAP) of the ShuffYOLOX model on the KITTI dataset reaches 92.20%. In addition, the number of parameters of the ShuffYOLOX model is reduced by 34.57%, the Gflops are reduced by 42.19%, and the FPS is increased by 65%. Therefore, the ShuffYOLOX model is very suitable for autonomous driving applications.
RESUMO
Recording the trajectory of table tennis balls in real-time enables the analysis of the opponent's attacking characteristics and weaknesses. The current analysis of the ball paths mainly relied on human viewing, which lacked certain theoretical data support. In order to solve the problem of the lack of objective data analysis in the research of table tennis competition, a target detection algorithm-based table tennis trajectory extraction network was proposed to record the trajectory of the table tennis movement in video. The network improved the feature reuse rate in order to achieve a lightweight network and enhance the detection accuracy. The core of the network was the "feature store & return" module, which could store the output of the current network layer and pass the features to the input of the network layer at the next moment to achieve efficient reuse of the features. In this module, the Transformer model was used to secondarily process the features, build the global association information, and enhance the feature richness of the feature map. According to the designed experiments, the detection accuracy of the network was 96.8% for table tennis and 89.1% for target localization. Moreover, the parameter size of the model was only 7.68 MB, and the detection frame rate could reach 634.19 FPS using the hardware for the tests. In summary, the network designed in this paper has the characteristics of both lightweight and high precision in table tennis detection, and the performance of the proposed model significantly outperforms that of the existing models.
RESUMO
To solve the demand for road damage object detection under the resource-constrained conditions of mobile terminal devices, in this paper, we propose the YOLO-LWNet, an efficient lightweight road damage detection algorithm for mobile terminal devices. First, a novel lightweight module, the LWC, is designed and the attention mechanism and activation function are optimized. Then, a lightweight backbone network and an efficient feature fusion network are further proposed with the LWC as the basic building units. Finally, the backbone and feature fusion network in the YOLOv5 is replaced. In this paper, two versions of the YOLO-LWNet, small and tiny, are introduced. The YOLO-LWNet was compared with the YOLOv6 and the YOLOv5 on the RDD-2020 public dataset in various performance aspects. The experimental results show that the YOLO-LWNet outperforms state-of-the-art real-time detectors in terms of balancing detection accuracy, model scale, and computational complexity in the road damage object detection task. It can better achieve the lightweight and accuracy requirements for object detection for mobile terminal devices.
RESUMO
Coal flow in belt conveyors is often mixed with foreign objects, such as anchor rods, angle irons, wooden bars, gangue, and large coal chunks, leading to belt tearing, blockages at transfer points, or even belt breakage. Fast and effective detection of these foreign objects is vital to ensure belt conveyors' safe and smooth operation. This paper proposes an improved YOLOv5-based method for rapid and low-parameter detection and recognition of non-coal foreign objects. Firstly, a new dataset containing foreign objects on conveyor belts is established for training and testing. Considering the high-speed operation of belt conveyors and the increased demands for inspection robot data collection frequency and real-time algorithm processing, this study employs a dark channel dehazing method to preprocess the raw data collected by the inspection robot in harsh mining environments, thus enhancing image clarity. Subsequently, improvements are made to the backbone and neck of YOLOv5 to achieve a deep lightweight object detection network that ensures detection speed and accuracy. The experimental results demonstrate that the improved model achieves a detection accuracy of 94.9% on the proposed foreign object dataset. Compared to YOLOv5s, the model parameters, inference time, and computational load are reduced by 43.1%, 54.1%, and 43.6%, respectively, while the detection accuracy is improved by 2.5%. These findings are significant for enhancing the detection speed of foreign object recognition and facilitating its application in edge computing devices, thus ensuring belt conveyors' safe and efficient operation.
RESUMO
Currently, in most traditional VSLAM (visual SLAM) systems, static assumptions result in a low accuracy in dynamic environments, or result in a new and higher level of accuracy but at the cost of sacrificing the real-time property. In highly dynamic scenes, balancing a high accuracy and a low computational cost has become a pivotal requirement for VSLAM systems. This paper proposes a new VSLAM system, balancing the competitive demands between positioning accuracy and computational complexity and thereby further improving the overall system properties. From the perspective of accuracy, the system applies an improved lightweight target detection network to quickly detect dynamic feature points while extracting feature points at the front end of the system, and only feature points of static targets are applied for frame matching. Meanwhile, the attention mechanism is integrated into the target detection network to continuously and accurately capture dynamic factors to cope with more complex dynamic environments. From the perspective of computational expense, the lightweight network Ghostnet module is applied as the backbone network of the target detection network YOLOv5s, significantly reducing the number of model parameters and improving the overall inference speed of the algorithm. Experimental results on the TUM dynamic dataset indicate that in contrast with the ORB-SLAM3 system, the pose estimation accuracy of the system improved by 84.04%. In contrast with dynamic SLAM systems such as DS-SLAM and DVO SLAM, the system has a significantly improved positioning accuracy. In contrast with other VSLAM algorithms based on deep learning, the system has superior real-time properties while maintaining a similar accuracy index.
RESUMO
This paper discusses the application of deep learning technology in recognizing vehicle black smoke in road traffic monitoring videos. The use of massive surveillance video data imposes higher demands on the real-time performance of vehicle black smoke detection models. The YOLOv5s model, known for its excellent single-stage object detection performance, has a complex network structure. Therefore, this study proposes a lightweight real-time detection model for vehicle black smoke, named MGSNet, based on the YOLOv5s framework. The research involved collecting road traffic monitoring video data and creating a custom dataset for vehicle black smoke detection by applying data augmentation techniques such as changing image brightness and contrast. The experiment explored three different lightweight networks, namely ShuffleNetv2, MobileNetv3 and GhostNetv1, to reconstruct the CSPDarknet53 backbone feature extraction network of YOLOv5s. Comparative experimental results indicate that reconstructing the backbone network with MobileNetv3 achieved a better balance between detection accuracy and speed. The introduction of the squeeze excitation attention mechanism and inverted residual structure from MobileNetv3 effectively reduced the complexity of black smoke feature fusion. Simultaneously, a novel convolution module, GSConv, was introduced to enhance the expression capability of black smoke features in the neck network. The combination of depthwise separable convolution and standard convolution in the module further reduced the model's parameter count. After the improvement, the parameter count of the model is compressed to 1/6 of the YOLOv5s model. The lightweight vehicle black smoke real-time detection network, MGSNet, achieved a detection speed of 44.6 frames per second on the test set, an increase of 18.9 frames per second compared with the YOLOv5s model. The mAP@0.5 still exceeded 95%, meeting the application requirements for real-time and accurate detection of vehicle black smoke.
RESUMO
The counting of pineapple buds relies on target recognition in estimating pineapple yield using unmanned aerial vehicle (UAV) photography. This research proposes the SFHG-YOLO method, with YOLOv5s as the baseline, to address the practical needs of identifying small objects (pineapple buds) in UAV vision and the drawbacks of existing algorithms in terms of real-time performance and accuracy. Field pineapple buds are small objects that may be detected in high density using a lightweight network model. This model enhances spatial attention and adaptive context information fusion to increase detection accuracy and resilience. To construct the lightweight network model, the first step involves utilizing the coordinate attention module and MobileNetV3. Additionally, to fully leverage feature information across various levels and enhance perception skills for tiny objects, we developed both an enhanced spatial attention module and an adaptive context information fusion module. Experiments were conducted to validate the suggested algorithm's performance in detecting small objects. The SFHG-YOLO model exhibited significant gains in assessment measures, achieving mAP@0.5 and mAP@0.5:0.95 improvements of 7.4% and 31%, respectively, when compared to the baseline model YOLOv5s. Considering the model size and computational cost, the findings underscore the superior performance of the suggested technique in detecting high-density small items. This program offers a reliable detection approach for estimating pineapple yield by accurately identifying minute items.
RESUMO
Hyperreflective foci (HF) reflects inflammatory responses for fundus diseases such as diabetic macular edema (DME), retina vein occlusion (RVO), and central serous chorioretinopathy (CSC). Shown as high contrast and reflectivity in optical coherence tomography (OCT) images, automatic segmentation of HF in OCT images is helpful for the prognosis of fundus diseases. Previous traditional methods were time-consuming and required high computing power. Hence, we proposed a lightweight network to segment HF (with a speed of 57 ms per OCT image, at least 150 ms faster than other methods). Our framework consists of two stages: an NLM filter and patch-based split to preprocess images and a lightweight DBR neural network to segment HF automatically. Experimental results from 3000 OCT images of 300 patients (100 DME,100 RVO, and 100 CSC) revealed that our method achieved HF segmentation successfully. The DBR network had the area under curves dice similarity coefficient (DSC) of 83.65%, 76.43%, and 82.20% in segmenting HF in DME, RVO, and CSC on the test cohort respectively. Our DBR network achieves at least 5% higher DSC than previous methods. HF in DME was more easily segmented compared with the other two types. In addition, our DBR network is universally applicable to clinical practice with the ability to segment HF in a wide range of fundus diseases.
Assuntos
Retinopatia Diabética , Edema Macular , Humanos , Retinopatia Diabética/diagnóstico por imagem , Tomografia de Coerência Óptica/métodos , Edema Macular/diagnóstico por imagem , Fundo de Olho , Redes Neurais de ComputaçãoRESUMO
BACKGROUND: Brain tumor segmentation plays a significant role in clinical treatment and surgical planning. Recently, several deep convolutional networks have been proposed for brain tumor segmentation and have achieved impressive performance. However, most state-of-the-art models use 3D convolution networks, which require high computational costs. This makes it difficult to apply these models to medical equipment in the future. Additionally, due to the large diversity of the brain tumor and uncertain boundaries between sub-regions, some models cannot well-segment multiple tumors in the brain at the same time. RESULTS: In this paper, we proposed a lightweight hierarchical convolution network, called LHC-Net. Our network uses a multi-scale strategy which the common 3D convolution is replaced by the hierarchical convolution with residual-like connections. It improves the ability of multi-scale feature extraction and greatly reduces parameters and computation resources. On the BraTS2020 dataset, LHC-Net achieves the Dice scores of 76.38%, 90.01% and 83.32% for ET, WT and TC, respectively, which is better than that of 3D U-Net with 73.50%, 89.42% and 81.92%. Especially on the multi-tumor set, our model shows significant performance improvement. In addition, LHC-Net has 1.65M parameters and 35.58G FLOPs, which is two times fewer parameters and three times less computation compared with 3D U-Net. CONCLUSION: Our proposed method achieves automatic segmentation of tumor sub-regions from four-modal brain MRI images. LHC-Net achieves competitive segmentation performance with fewer parameters and less computation than the state-of-the-art models. It means that our model can be applied under limited medical computing resources. By using the multi-scale strategy on channels, LHC-Net can well-segment multiple tumors in the patient's brain. It has great potential for application to other multi-scale segmentation tasks.