RESUMO
Machine learning algorithms, and in particular deep learning approaches, have recently garnered attention in the field of molecular biology due to remarkable results. In this chapter, we describe machine learning approaches specifically developed for the design of RNAs, with a focus on the learna_tools Python package, a collection of automated deep reinforcement learning algorithms for secondary structure-based RNA design. We explain the basic concepts of reinforcement learning and its extension, automated reinforcement learning, and outline how these concepts can be successfully applied to the design of RNAs. The chapter is structured to guide through the usage of the different programs with explicit examples, highlighting particular applications of the individual tools.
Assuntos
Algoritmos , Aprendizado de Máquina , Conformação de Ácido Nucleico , RNA , Software , RNA/química , RNA/genética , Biologia Computacional/métodos , Aprendizado ProfundoRESUMO
This study investigated the risk to social behavior and cognitive flexibility induced by chronic social defeat stress (CSDS) during early and late adolescence (EA and LA). Utilizing the "resident-intruder" stress paradigm, adolescent male Sprague-Dawley rats were exposed to CSDS during either EA (postnatal days 29-38) or LA (postnatal days 39-48) to explore how social defeat at different stages of adolescence affects behavioral and cognitive symptoms commonly associated with psychiatric disorders. After stress exposure, the rats were assessed for anxiety-like behavior in the elevated plus maze, social interaction, and cognitive flexibility through set-shifting and reversal-learning tasks under immediate and delayed reward conditions. The results showed that CSDS during EA, but not LA, led to impaired cognitive flexibility in adulthood, as evidenced by increased perseverative and regressive errors in the set-shifting and reversal-learning tasks, particularly under the delayed reward condition. This suggests that the timing of stress exposure during development has a significant impact on the long-term consequences for behavioral and cognitive function. The findings highlight the vulnerability of the prefrontal cortex, which undergoes critical maturation during early adolescence, to the effects of social stress. Overall, this study demonstrates that the timing of social stressors during adolescence can differentially shape the developmental trajectory of cognitive flexibility, with important implications for understanding the link between childhood/adolescent adversity and the emergence of psychiatric disorders.
Assuntos
Ratos Sprague-Dawley , Reversão de Aprendizagem , Comportamento Social , Derrota Social , Estresse Psicológico , Animais , Masculino , Estresse Psicológico/fisiopatologia , Reversão de Aprendizagem/fisiologia , Ratos , Recompensa , Ansiedade/fisiopatologia , Cognição/fisiologia , Função Executiva/fisiologia , Fatores Etários , Modelos Animais de Doenças , Comportamento Animal/fisiologiaRESUMO
Research into society's informal rules of conduct, or norms, has recently experienced a surge, extending across multiple academic disciplines. Despite this growth, the theoretical modeling of norms often remains siloed within specific paradigms, as different disciplines tend to favor certain frameworks over others, thereby hindering the spread of innovative ideas. This article breaks through disciplinary barriers to explore recent advancements in the mathematical study of norms. It specifically focuses on cutting-edge theoretical research, structuring the discussion around four general frameworks: game theory, evolutionary game theory, agent-based modeling, and multi-agent reinforcement learning.
RESUMO
Threshold voltage ( V th ) assignment is convenient for leakage optimization due to the exponential relation between leakage power and V th by swapping logic cells without routing effort. However, it poses great challenge in large scale circuit design as an NP-hard problem. Machine learning-based approaches have been proposed to solve this problem, aiming to achieve well tradeoff between leakage power reduction and runtime speed up without new induced timing violation. In this paper, a leakage power optimization framework based on reinforcement learning (RL) with graph neural network (GNN) is first-ever proposed to formulate V th assignment as a RL process by learning timing and physical characteristics of each circuit instance with GNN. Multiple instances are selected in a non-overlapped manner for each RL action iteration to speed up convergence and decouple timing interdependence along circuit path, where the corresponding reward is carefully defined to tradeoff between leakage reduction and potential timing violation. The proposed framework was validated by the Opencores and IWLS 2005 benchmark circuits with TSMC 28 nm technology. Experimental results demonstrate that our work outperforms prior non-analytical and GNN-based methods with better leakage power optimization by additional 5% to 17% reduction, which is highly consistent with the commercial tool. When transferring the trained RL-based framework to unseen circuits, it achieves the roughly identical leakage optimization results as seen circuit and speed up the runtime by 5.7 × to 8.5 × compared with commercial tool.
RESUMO
Goal-conditioned reinforcement learning is widely used in robot control, manipulating the robot to accomplish specific tasks by maximizing accumulated rewards. However, the useful reward signal is only received when the desired goal is reached, leading to the issue of sparse rewards and affecting the efficiency of policy learning. In this paper, we propose a method to generate highly valued subgoals for efficient goal-conditioned policy learning, enabling the development of smart home robots or automatic pilots in our daily life. The highly valued subgoals are conditioned on the context of the specific tasks and characterized by suitable complexity for efficient goal-conditioned action value learning. The context variable captures the latent representation of the particular tasks, allowing for efficient subgoal generation. Additionally, the goal-conditioned action values regularized by the self-adaptive ranges generate subgoals with suitable complexity. Compared to Hindsight Experience Replay that uniformly samples subgoals from visited trajectories, our method generates the subgoals based on the context of tasks with suitable difficulty for efficient policy training. Experimental results show that our method achieves stable performance in robotic environments compared to baseline methods.
RESUMO
Urban traffic congestion poses significant economic and environmental challenges worldwide. To mitigate these issues, Adaptive Traffic Signal Control (ATSC) has emerged as a promising solution. Recent advancements in deep reinforcement learning (DRL) have further enhanced ATSC's capabilities. This paper introduces a novel DRL-based ATSC approach named the Sequence Decision Transformer (SDT), employing DRL enhanced with attention mechanisms and leveraging the robust capabilities of sequence decision models, akin to those used in advanced natural language processing, adapted here to tackle the complexities of urban traffic management. Firstly, the ATSC problem is modeled as a Markov Decision Process (MDP), with the observation space, action space, and reward function carefully defined. Subsequently, we propose SDT, specifically tailored to solve the MDP problem. The SDT model uses a transformer-based architecture with an encoder and decoder in an actor-critic structure. The encoder processes observations and outputs, both encoded data for the decoder, and value estimates for parameter updates. The decoder, as the policy network, outputs the agent's actions. Proximal Policy Optimization (PPO) is used to update the policy network based on historical data, enhancing decision-making in ATSC. This approach significantly reduces training times, effectively manages larger observation spaces, captures dynamic changes in traffic conditions more accurately, and enhances traffic throughput. Finally, the SDT model is trained and evaluated in synthetic scenarios by comparing the number of vehicles, average speed, and queue length against three baselines, including PPO, a DQN tailored for ATSC, and FRAP, a state-of-the-art ATSC algorithm. SDT shows improvements of 26.8%, 150%, and 21.7% over traditional ATSC algorithms, and 18%, 30%, and 15.6% over the FRAP. This research underscores the potential of integrating Large Language Models (LLMs) with DRL for traffic management, offering a promising solution to urban congestion.
RESUMO
This study explores manipulator control using reinforcement learning, specifically targeting anthropomorphic gripper-equipped robots, with the objective of enhancing the robots' ability to safely exchange diverse objects with humans during human-robot interactions (HRIs). The study integrates an adaptive HRI hand for versatile grasping and incorporates image recognition for efficient object identification and precise coordinate estimation. A tailored reinforcement-learning environment enables the robot to dynamically adapt to diverse scenarios. The effectiveness of this approach is validated through simulations and real-world applications. The HRI hand's adaptability ensures seamless interactions, while image recognition enhances cognitive capabilities. The reinforcement-learning framework enables the robot to learn and refine skills, demonstrated through successful navigation and manipulation in various scenarios. The transition from simulations to real-world applications affirms the practicality of the proposed system, showcasing its robustness and potential for integration into practical robotic platforms. This study contributes to advancing intelligent and adaptable robotic systems for safe and dynamic HRIs.
Assuntos
Robótica , Humanos , Robótica/métodos , Aprendizagem , Força da Mão/fisiologia , Reforço Psicológico , AlgoritmosRESUMO
This paper investigates the single agile optical satellite scheduling problem, which has received increasing attention due to the rapid growth in earth observation requirements. Owing to the complicated constraints and considerable solution space of this problem, the conventional exact methods and heuristic methods, which are sensitive to the problem scale, demand high computational expenses. Thus, an efficient approach is demanded to solve this problem, and this paper proposes a deep reinforcement learning algorithm with a local attention mechanism. A mathematical model is first established to describe this problem, which considers a series of complex constraints and takes the profit ratio of completed tasks as the optimization objective. Then, a neural network framework with an encoder-decoder structure is adopted to generate high-quality solutions, and a local attention mechanism is designed to improve the generation of solutions. In addition, an adaptive learning rate strategy is proposed to guide the actor-critic training algorithm to dynamically adjust the learning rate in the training process to enhance the training effectiveness of the proposed network. Finally, extensive experiments verify that the proposed algorithm outperforms the comparison algorithms in terms of solution quality, generalization performance, and computation efficiency.
RESUMO
In this study, we investigate the adaptability of artificial agents within a noisy T-maze that use Markov decision processes (MDPs) and successor feature (SF) and predecessor feature (PF) learning algorithms. Our focus is on quantifying how varying the hyperparameters, specifically the reward learning rate (αr) and the eligibility trace decay rate (λ), can enhance their adaptability. Adaptation is evaluated by analyzing the hyperparameters of cumulative reward, step length, adaptation rate, and adaptation step length and the relationships between them using Spearman's correlation tests and linear regression. Our findings reveal that an αr of 0.9 consistently yields superior adaptation across all metrics at a noise level of 0.05. However, the optimal setting for λ varies by metric and context. In discussing these results, we emphasize the critical role of hyperparameter optimization in refining the performance and transfer learning efficacy of learning algorithms. This research advances our understanding of the functionality of PF and SF algorithms, particularly in navigating the inherent uncertainty of transfer learning tasks. By offering insights into the optimal hyperparameter configurations, this study contributes to the development of more adaptive and robust learning algorithms, paving the way for future explorations in artificial intelligence and neuroscience.
Assuntos
Algoritmos , Aprendizagem Espacial , Aprendizagem Espacial/fisiologia , Inteligência Artificial , Cadeias de Markov , Aprendizagem em Labirinto/fisiologia , Humanos , RecompensaRESUMO
Treatment planning for chronic diseases is a critical task in medical artificial intelligence, particularly in traditional Chinese medicine (TCM). However, generating optimized sequential treatment strategies for patients with chronic diseases in different clinical encounters remains a challenging issue that requires further exploration. In this study, we proposed a TCM herbal prescription planning framework based on deep reinforcement learning for chronic disease treatment (PrescDRL). PrescDRL is a sequential herbal prescription optimization model that focuses on long-term effectiveness rather than achieving maximum reward at every step, thereby ensuring better patient outcomes. We constructed a high-quality benchmark dataset for sequential diagnosis and treatment of diabetes and evaluated PrescDRL against this benchmark. Our results showed that PrescDRL achieved a higher curative effect, with the single-step reward improving by 117% and 153% compared to doctors. Furthermore, PrescDRL outperformed the benchmark in prescription prediction, with precision improving by 40.5% and recall improving by 63%. Overall, our study demonstrates the potential of using artificial intelligence to improve clinical intelligent diagnosis and treatment in TCM.
RESUMO
Neuropsychological data suggest that being overweight or obese is associated with a tendency to perseverate behavior despite negative feedback. This deficit might be observed due to other cognitive factors, such as working memory (WM) deficits or decreased ability to deduce model-based strategies when learning by trial-and-error. In the present study, a group of subjects with overweight or obesity (Ow/Ob, n = 30) was compared to normal-weight individuals (n = 42) in a modified Reinforcement Learning (RL) task. The task was designed to control WM effects on learning by manipulating cognitive load and to foster model-based learning via deductive reasoning. Computational modelling and analysis were conducted to isolate parameters related to RL mechanisms, WM use, and model-based learning (deduction parameter). Results showed that subjects with Ow/Ob had a higher number of perseverative errors and used a weaker deduction mechanism in their performance than control individuals, indicating impairments in negative reinforcement and model-based learning, whereas WM impairments were not responsible for deficits in RL. The present data suggests that obesity is associated with impairments in negative reinforcement and model-based learning.
RESUMO
Energy harvesters based on nanomaterials are getting more and more popular, but on their way to commercial availability, some crucial issues still need to be solved. The objective of the study is to select an appropriate nanomaterial. Using features of the Reinforcement Deep Q-Network (DQN) in conjunction with Fuzzy PROMETHEE, the proposed model, we present in this work a hybrid fuzzy approach to selecting appropriate materials for a vehicle-environmental-hazardous substance (EHS) combination that operates in roadways and under traffic conditions. The DQN is able to accumulate useful experience of operating in a dynamic traffic environment, accordingly selecting materials that deliver the highest energy output but at the same time bring consideration to factors such as durability, cost, and environmental impact. Fuzzy PROMETHEE allows the participation of human experts during the decision-making process, going beyond the quantitative data typically learned by DQN through the inclusion of qualitative preferences. Instead, this hybrid method unites the strength of individual approaches, as a result providing highly resistant and adjustable material selection to real EHS. The result of the study pointed out materials that can give high energy efficiency with reference to years of service, price, and environmental effects. The proposed model provides 95% accuracy with a computational efficiency of 300 s, and the application of hypothesis and practical testing on the chosen materials showed the high efficiency of the selected materials to harvest energy under fluctuating traffic conditions and proved the concept of a hybrid approach in True Vehicle Environmental High-risk Substance scenarios.
RESUMO
Introduction: Accurately recognizing and understanding human motion actions presents a key challenge in the development of intelligent sports robots. Traditional methods often encounter significant drawbacks, such as high computational resource requirements and suboptimal real-time performance. To address these limitations, this study proposes a novel approach called Sports-ACtrans Net. Methods: In this approach, the Swin Transformer processes visual data to extract spatial features, while the Spatio-Temporal Graph Convolutional Network (ST-GCN) models human motion as graphs to handle skeleton data. By combining these outputs, a comprehensive representation of motion actions is created. Reinforcement learning is employed to optimize the action recognition process, framing it as a sequential decision-making problem. Deep Q-learning is utilized to learn the optimal policy, thereby enhancing the robot's ability to accurately recognize and engage in motion. Results and discussion: Experiments demonstrate significant improvements over state-of-the-art methods. This research advances the fields of neural computation, computer vision, and neuroscience, aiding in the development of intelligent robotic systems capable of understanding and participating in sports activities.
RESUMO
Robotic assemblies are widely used in manufacturing processes. However, high-precision assembly remains challenging because of numerous uncertain disturbances. Current research mainly focuses on a single robot or weakly coupled multi-robot assembly. Nevertheless, more complex and uncertainty-filled tightly coupled multi-robot assemblies have been overlooked. This study proposes an efficient skill-acquisition framework to address this challenging task by improving learning efficiency. The framework integrates a dual-loop coupled force-position control (DLCFPC) algorithm, a parallel skill-learning algorithm, and collision detection. The DLCFPC was presented to address simultaneous motion and force control challenges. In addition, a parallel skill-learning algorithm was proposed to accelerate assembly skill acquisition. Simulations and experiments on a multi-robot cooperative peg-in-hole assembly confirm that the framework enables a multi-robot system to accomplish high-precision assembly tasks even without prior knowledge, demonstrating robustness against disturbances.
RESUMO
Triple negative breast cancer (TNBC) is one of the most difficult of all types of breast cancer to treat. TNBC is characterized by the absence of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. The development of effective drugs can help to alleviate the suffering of patients. The novel nickel(II)-based coordination polymer (CP), [Ni2(HL)(O)(H2O)3·H2O] (1) (where H4L=[1,1':2',1''-triphenyl]-3,3'',4',5'-tetracarboxylic acid), was synthesized via solvothermal reaction in this study. The overall structure of CP1 was fully identified by SXRD, Fourier transform infrared spectroscopy and elemental analysis. Using advanced chemical synthesis, we developed Hyaluronic Acid/Carboxymethyl Chitosan-CP1@Doxorubicin (HA/CMCS-CP1@DOX), a nanocarrier system encapsulating doxorubicin (DOX), which was thoroughly characterized using Scanning Electron Microscopy (SEM), Fourier Transform Infrared Spectroscopy (FTIR), and Thermogravimetric Analysis (TGA). These analyses confirmed the integration of doxorubicin and provided data on the nanocarriers' stability and structure. In vitro experiments showed that this system significantly downregulated Tissue Inhibitor of Metalloproteinases-1 (TIMP-1) in triple-negative breast cancer cells and inhibited their proliferation. Molecular docking simulations revealed the biological effects of CP1 are derived from its carboxyl groups. Using reinforcement learning, multiple new derivatives were generated from this compound, displaying excellent biological activities. These findings highlight the potential clinical applications and the innovative capacity of this nanocarrier system in drug development.
Assuntos
Doxorrubicina , Portadores de Fármacos , Hidrogéis , Neoplasias de Mama Triplo Negativas , Neoplasias de Mama Triplo Negativas/tratamento farmacológico , Neoplasias de Mama Triplo Negativas/patologia , Humanos , Doxorrubicina/química , Doxorrubicina/farmacologia , Doxorrubicina/administração & dosagem , Hidrogéis/química , Linhagem Celular Tumoral , Feminino , Portadores de Fármacos/química , Simulação de Acoplamento Molecular , Nanopartículas/química , Estruturas Metalorgânicas/química , Estruturas Metalorgânicas/farmacologia , Quitosana/química , Quitosana/análogos & derivados , Inibidor Tecidual de Metaloproteinase-1/metabolismo , Espectroscopia de Infravermelho com Transformada de Fourier , Ácido Hialurônico/químicaRESUMO
BACKGROUND: Clinical diagnoses are typically made by following a series of steps recommended by guidelines that are authored by colleges of experts. Accordingly, guidelines play a crucial role in rationalizing clinical decisions. However, they suffer from limitations, as they are designed to cover the majority of the population and often fail to account for patients with uncommon conditions. Moreover, their updates are long and expensive, making them unsuitable for emerging diseases and new medical practices. METHODS: Inspired by guidelines, we formulate the task of diagnosis as a sequential decision-making problem and study the use of Deep Reinforcement Learning (DRL) algorithms to learn the optimal sequence of actions to perform in order to obtain a correct diagnosis from Electronic Health Records (EHRs), which we name a diagnostic decision pathway. We apply DRL to synthetic yet realistic EHRs and develop two clinical use cases: Anemia diagnosis, where the decision pathways follow a decision tree schema, and Systemic Lupus Erythematosus (SLE) diagnosis, which follows a weighted criteria score. We particularly evaluate the robustness of our approaches to noise and missing data, as these frequently occur in EHRs. RESULTS: In both use cases, even with imperfect data, our best DRL algorithms exhibit competitive performance compared to traditional classifiers, with the added advantage of progressively generating a pathway to the suggested diagnosis, which can both guide and explain the decision-making process. CONCLUSION: DRL offers the opportunity to learn personalized decision pathways for diagnosis. Our two use cases illustrate the advantages of this approach: they generate step-by-step pathways that are explainable, and their performance is competitive when compared to state-of-the-art methods.
RESUMO
Mobile, low-cost, and energy-aware operation of Artificial Intelligence (AI) computations in smart circuits and autonomous robots will play an important role in the next industrial leap in intelligent automation and assistive devices. Neuromorphic hardware with spiking neural network (SNN) architecture utilizes insights from biological phenomena to offer encouraging solutions. Previous studies have proposed reinforcement learning (RL) models for SNN responses in the rat hippocampus to an environment where rewards depend on the context. The scale of these models matches the scope and capacity of small embedded systems in the framework of Internet-of-Bodies (IoB), autonomous sensor nodes, and other edge applications. Addressing energy-efficient artificial learning problems in such systems enables smart micro-systems with edge intelligence. A novel bio-inspired RL system architecture is presented in this work, leading to significant energy consumption benefits without foregoing real-time autonomous processing and accuracy requirements of the context-dependent task. The hardware architecture successfully models features analogous to synaptic tagging, changes in the exploration schemes, synapse saturation, and spatially localized task-based activation observed in the brain. The design has been synthesized, simulated, and tested on Intel MAX10 Field-Programmable Gate Array (FPGA). The problem-based bio-inspired approach to SNN edge architectural design results in 25X reduction in average power compared to the state-of-the-art for a test with real-time context learning and 30 trials. Furthermore, 940x lower energy consumption is achieved due to improvement in the execution time.
RESUMO
Enhancing patient response to immune checkpoint inhibitors (ICIs) is crucial in cancer immunotherapy. We aim to create a data-driven mathematical model of the tumor immune microenvironment (TIME) and utilize deep reinforcement learning (DRL) to optimize patient-specific ICI therapy combined with chemotherapy (ICC). Using patients' genomic and transcriptomic data, we develop an ordinary differential equations (ODEs)-based TIME dynamic evolutionary model to characterize interactions among chemotherapy, ICIs, immune cells, and tumor cells. A DRL agent is trained to determine the personalized optimal ICC therapy. Numerical experiments with real-world data demonstrate that the proposed TIME model can predict ICI therapy response. The DRL-derived personalized ICC therapy outperforms predefined fixed schedules. For tumors with extremely low CD8 + T cell infiltration ('extremely cold tumors'), the DRL agent recommends high-dosage chemotherapy alone. For tumors with higher CD8 + T cell infiltration ('cold' and 'hot tumors'), an appropriate chemotherapy dosage induces CD8 + T cell proliferation, enhancing ICI therapy outcomes. Specifically, for 'hot tumors', chemotherapy and ICI are administered simultaneously, while for 'cold tumors', a mid-dosage of chemotherapy makes the TIME 'hotter' before ICI administration. However, in several 'cold tumors' with rapid resistant tumor cell growth, ICC eventually fails. This study highlights the potential of utilizing real-world clinical data and DRL algorithm to develop personalized optimal ICC by understanding the complex biological dynamics of a patient's TIME. Our ODE-based TIME dynamic evolutionary model offers a theoretical framework for determining the best use of ICI, and the proposed DRL agent may guide personalized ICC schedules.
Assuntos
Inibidores de Checkpoint Imunológico , Neoplasias , Microambiente Tumoral , Humanos , Microambiente Tumoral/imunologia , Inibidores de Checkpoint Imunológico/uso terapêutico , Inibidores de Checkpoint Imunológico/farmacologia , Neoplasias/tratamento farmacológico , Neoplasias/imunologia , Linfócitos T CD8-Positivos/imunologia , Linfócitos T CD8-Positivos/efeitos dos fármacos , Medicina de Precisão , ImunoterapiaRESUMO
The nutcracker optimizer algorithm (NOA) is a metaheuristic method proposed in recent years. This algorithm simulates the behavior of nutcrackers searching and storing food in nature to solve the optimization problem. However, the traditional NOA struggles to balance global exploration and local exploitation effectively, making it prone to getting trapped in local optima when solving complex problems. To address these shortcomings, this study proposes a reinforcement learning-based bi-population nutcracker optimizer algorithm called RLNOA. In the RLNOA, a bi-population mechanism is introduced to better balance global and local optimization capabilities. At the beginning of each iteration, the raw population is divided into an exploration sub-population and an exploitation sub-population based on the fitness value of each individual. The exploration sub-population is composed of individuals with poor fitness values. An improved foraging strategy based on random opposition-based learning is designed as the update method for the exploration sub-population to enhance diversity. Meanwhile, Q-learning serves as an adaptive selector for exploitation strategies, enabling optimal adjustment of the exploitation sub-population's behavior across various problems. The performance of the RLNOA is evaluated using the CEC-2014, CEC-2017, and CEC-2020 benchmark function sets, and it is compared against nine state-of-the-art metaheuristic algorithms. Experimental results demonstrate the superior performance of the proposed algorithm.
RESUMO
Crawling robots are the focus of intelligent inspection research, and the main feature of this type of robot is the flexibility of in-plane attitude adjustment. The crawling robot HIT_Spibot is a new type of steam generator heat transfer tube inspection robot with a unique mobility capability different from traditional quadrupedal robots. This paper introduces a hierarchical motion planning approach for HIT_Spibot, aiming to achieve efficient and agile maneuverability. The proposed method integrates three distinct planners to handle complex motion tasks: a nonlinear optimization-based base motion planner, a TOPSIS-based base orientation planner, and a Mask-D3QN (MD3QN) algorithm-based gait motion planner. Initially, the robot's base and foot workspace were delineated through envelope analysis, followed by trajectory computation using Larangian methods. Subsequently, the TOPSIS algorithm was employed to establish an evaluation framework conducive to foundational turning planning. Finally, the MD3QN algorithm trained foot-points to facilitate robot movement along predefined paths. Experimental results demonstrated the method's adaptability across diverse tube structures, showcasing robust performance even in environments with random obstacles. Compared to the D3QN algorithm, MD3QN achieved a 100% success rate, enhanced average overall scores by 6.27%, reduced average stride lengths by 39.04%, and attained a stability rate of 58.02%. These results not only validate the effectiveness and practicality of the method but also showcase the significant potential of HIT_Spibot in the field of industrial inspection.