Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Sensors (Basel) ; 22(9)2022 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-35591271

RESUMO

When a traditional Deep Deterministic Policy Gradient (DDPG) algorithm is used in mobile robot path planning, due to the limited observable environment of mobile robots, the training efficiency of the path planning model is low, and the convergence speed is slow. In this paper, Long Short-Term Memory (LSTM) is introduced into the DDPG network, the former and current states of the mobile robot are combined to determine the actions of the robot, and a Batch Norm layer is added after each layer of the Actor network. At the same time, the reward function is optimized to guide the mobile robot to move faster towards the target point. In order to improve the learning efficiency, different normalization methods are used to normalize the distance and angle between the mobile robot and the target point, which are used as the input of the DDPG network model. When the model outputs the next action of the mobile robot, mixed noise composed of Gaussian noise and Ornstein-Uhlenbeck (OU) noise is added. Finally, the simulation environment built by a ROS system and a Gazebo platform is used for experiments. The results show that the proposed algorithm can accelerate the convergence speed of DDPG, improve the generalization ability of the path planning model and improve the efficiency and success rate of mobile robot path planning.


Assuntos
Robótica , Algoritmos , Simulação por Computador , Memória de Longo Prazo , Políticas , Robótica/métodos
2.
Sensors (Basel) ; 22(12)2022 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-35746364

RESUMO

As one of the main elements of reinforcement learning, the design of the reward function is often not given enough attention when reinforcement learning is used in concrete applications, which leads to unsatisfactory performances. In this study, a reward function matrix is proposed for training various decision-making modes with emphasis on decision-making styles and further emphasis on incentives and punishments. Additionally, we model a traffic scene via graph model to better represent the interaction between vehicles, and adopt the graph convolutional network (GCN) to extract the features of the graph structure to help the connected autonomous vehicles perform decision-making directly. Furthermore, we combine GCN with deep Q-learning and multi-step double deep Q-learning to train four decision-making modes, which are named the graph convolutional deep Q-network (GQN) and the multi-step double graph convolutional deep Q-network (MDGQN). In the simulation, the superiority of the reward function matrix is proved by comparing it with the baseline, and evaluation metrics are proposed to verify the performance differences among decision-making modes. Results show that the trained decision-making modes can satisfy various driving requirements, including task completion rate, safety requirements, comfort level, and completion efficiency, by adjusting the weight values in the reward function matrix. Finally, the decision-making modes trained by MDGQN had better performance in an uncertain highway exit scene than those trained by GQN.


Assuntos
Condução de Veículo , Recompensa , Benchmarking , Aprendizagem , Incerteza
3.
Sensors (Basel) ; 21(19)2021 Sep 30.
Artigo em Inglês | MEDLINE | ID: mdl-34640890

RESUMO

In recent years, machine learning for trading has been widely studied. The direction and size of position should be determined in trading decisions based on market conditions. However, there is no research so far that considers variable position sizes in models developed for trading purposes. In this paper, we propose a deep reinforcement learning model named LSTM-DDPG to make trading decisions with variable positions. Specifically, we consider the trading process as a Partially Observable Markov Decision Process, in which the long short-term memory (LSTM) network is used to extract market state features and the deep deterministic policy gradient (DDPG) framework is used to make trading decisions concerning the direction and variable size of position. We test the LSTM-DDPG model on IF300 (index futures of China stock market) data and the results show that LSTM-DDPG with variable positions performs better in terms of return and risk than models with fixed or few-level positions. In addition, the investment potential of the model can be better tapped by the reward function of the differential Sharpe ratio than that of profit reward function.


Assuntos
Investimentos em Saúde , Memória de Longo Prazo , Previsões , Aprendizado de Máquina , Políticas
4.
Sensors (Basel) ; 21(8)2021 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-33916995

RESUMO

One of the critical challenges in deploying the cleaning robots is the completion of covering the entire area. Current tiling robots for area coverage have fixed forms and are limited to cleaning only certain areas. The reconfigurable system is the creative answer to such an optimal coverage problem. The tiling robot's goal enables the complete coverage of the entire area by reconfiguring to different shapes according to the area's needs. In the particular sequencing of navigation, it is essential to have a structure that allows the robot to extend the coverage range while saving energy usage during navigation. This implies that the robot is able to cover larger areas entirely with the least required actions. This paper presents a complete path planning (CPP) for hTetran, a polyabolo tiled robot, based on a TSP-based reinforcement learning optimization. This structure simultaneously produces robot shapes and sequential trajectories whilst maximizing the reward of the trained reinforcement learning (RL) model within the predefined polyabolo-based tileset. To this end, a reinforcement learning-based travel sales problem (TSP) with proximal policy optimization (PPO) algorithm was trained using the complementary learning computation of the TSP sequencing. The reconstructive results of the proposed RL-TSP-based CPP for hTetran were compared in terms of energy and time spent with the conventional tiled hypothetical models that incorporate TSP solved through an evolutionary based ant colony optimization (ACO) approach. The CPP demonstrates an ability to generate an ideal Pareto optima trajectory that enhances the robot's navigation inside the real environment with the least energy and time spent in the company of conventional techniques.

5.
Int J Mol Sci ; 22(14)2021 Jul 14.
Artigo em Inglês | MEDLINE | ID: mdl-34299139

RESUMO

Acupuncture affects the central nervous system via the regulation of neurotransmitter transmission. We previously showed that Shemen (HT7) acupoint stimulation decreased cocaine-induced dopamine release in the nucleus accumbens. Here, we used the intracranial self-stimulation (ICSS) paradigm to evaluate whether HT stimulation regulates the brain reward function of rats. We found that HT stimulation triggered a rightward shift of the frequency-rate curve and elevated the ICSS thresholds. However, HT7 stimulation did not affect the threshold-lowering effects produced by cocaine. These results indicate that HT7 points only effectively regulates the ICSS thresholds of the medial forebrain bundle in drug-naïve rats.


Assuntos
Terapia por Acupuntura/métodos , Cocaína/administração & dosagem , Estimulação Elétrica/métodos , Feixe Prosencefálico Mediano/fisiologia , Recompensa , Autoestimulação/fisiologia , Anestésicos Locais/administração & dosagem , Animais , Masculino , Feixe Prosencefálico Mediano/efeitos dos fármacos , Ratos , Ratos Sprague-Dawley , Autoestimulação/efeitos dos fármacos
6.
Sensors (Basel) ; 20(19)2020 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-33019643

RESUMO

Autonomous driving with artificial intelligence technology has been viewed as promising for autonomous vehicles hitting the road in the near future. In recent years, considerable progress has been made with Deep Reinforcement Learnings (DRLs) for realizing end-to-end autonomous driving. Still, driving safely and comfortably in real dynamic scenarios with DRL is nontrivial due to the reward functions being typically pre-defined with expertise. This paper proposes a human-in-the-loop DRL algorithm for learning personalized autonomous driving behavior in a progressive learning way. Specifically, a progressively optimized reward function (PORF) learning model is built and integrated into the Deep Deterministic Policy Gradient (DDPG) framework, which is called PORF-DDPG in this paper. PORF consists of two parts: the first part of the PORF is a pre-defined typical reward function on the system state, the second part is modeled as a Deep Neural Network (DNN) for representing driving adjusting intention by the human observer, which is the main contribution of this paper. The DNN-based reward model is progressively learned using the front-view images as the input and via active human supervision and intervention. The proposed approach is potentially useful for driving in dynamic constrained scenarios when dangerous collision events might occur frequently with classic DRLs. The experimental results show that the proposed autonomous driving behavior learning method exhibits online learning capability and environmental adaptability.


Assuntos
Inteligência Artificial , Condução de Veículo , Redes Neurais de Computação , Humanos , Recompensa
7.
Sensors (Basel) ; 20(19)2020 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-32992750

RESUMO

This paper proposes a novel incremental training mode to address the problem of Deep Reinforcement Learning (DRL) based path planning for a mobile robot. Firstly, we evaluate the related graphic search algorithms and Reinforcement Learning (RL) algorithms in a lightweight 2D environment. Then, we design the algorithm based on DRL, including observation states, reward function, network structure as well as parameters optimization, in a 2D environment to circumvent the time-consuming works for a 3D environment. We transfer the designed algorithm to a simple 3D environment for retraining to obtain the converged network parameters, including the weights and biases of deep neural network (DNN), etc. Using these parameters as initial values, we continue to train the model in a complex 3D environment. To improve the generalization of the model in different scenes, we propose to combine the DRL algorithm Twin Delayed Deep Deterministic policy gradients (TD3) with the traditional global path planning algorithm Probabilistic Roadmap (PRM) as a novel path planner (PRM+TD3). Experimental results show that the incremental training mode can notably improve the development efficiency. Moreover, the PRM+TD3 path planner can effectively improve the generalization of the model.

8.
Int J Neuropsychopharmacol ; 20(5): 403-409, 2017 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-28031268

RESUMO

Background: Opioid and dopamine systems play crucial roles in reward. Similarities and differences in the neural mechanisms of reward that are mediated by these 2 systems have remained largely unknown. Thus, in the present study, we investigated the differences in reward function in both µ-opioid receptor knockout mice and dopamine transporter knockout mice, important molecules in the opioid and dopamine systems. Methods: Mice were implanted with electrodes into the right lateral hypothalamus (l hour). Mice were then trained to put their muzzle into the hole in the head-dipping chamber for intracranial electrical stimulation, and the influences of gene knockout were assessed. Results: Significant differences are observed between opioid and dopamine systems in reward function. µ-Opioid receptor knockout mice exhibited enhanced intracranial electrical stimulation, which induced dopamine release. They also exhibited greater motility under conditions of "despair" in both the tail suspension test and water wheel test. In contrast, dopamine transporter knockout mice maintained intracranial electrical stimulation responding even when more active efforts were required to obtain the reward. Conclusions: The absence of µ-opioid receptor or dopamine transporter did not lead to the absence of intracranial electrical stimulation responsiveness but rather differentially altered it. The present results in µ-opioid receptor knockout mice are consistent with the suppressive involvement of µ-opioid receptors in both positive incentive motivation associated with intracranial electrical stimulation and negative incentive motivation associated with depressive states. In contrast, the results in dopamine transporter knockout mice are consistent with the involvement of dopamine transporters in positive incentive motivation, especially its persistence. Differences in intracranial electrical stimulation in µ-opioid receptor and dopamine transporter knockout mice underscore the multidimensional nature of reward.


Assuntos
Analgésicos Opioides/metabolismo , Dopamina/metabolismo , Região Hipotalâmica Lateral/efeitos dos fármacos , Região Hipotalâmica Lateral/metabolismo , Receptores Opioides mu/deficiência , Animais , Biofísica , Proteínas da Membrana Plasmática de Transporte de Dopamina/deficiência , Proteínas da Membrana Plasmática de Transporte de Dopamina/genética , Estimulação Elétrica , Camundongos , Camundongos Endogâmicos C57BL , Camundongos Knockout , Motivação , Atividade Motora/efeitos dos fármacos , Receptores Opioides mu/genética , Recompensa , Autoadministração , Fatores de Tempo
9.
Anim Cogn ; 20(3): 473-484, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28102509

RESUMO

Optimal performance in temporal decisions requires the integration of timing uncertainty with environmental statistics such as probability or cost functions. Reward maximization under response deadlines constitutes one of the most stringent examples of these problems. The current study investigated whether and how mice can optimize their timing behavior in a complex experimental setting under a response deadline in which reward maximization required the integration of timing uncertainty with a geometrically increasing probability/decreasing cost function. Mice optimized their performance under seconds-long response deadlines when the underlying function was reward probability but approached this level of performance when the underlying function was reward cost, only under the assumption of logarithmically scaled subjective costs. The same subjects were then tested in a timed response inhibition task characterized by response rules that conflicted with the initial task, not responding earlier than a schedule as opposed to not missing the deadline. Irrespective of original test groups, mice optimized the timing of their inhibitory control in the second experiment. These results provide strong support for the ubiquity of optimal temporal risk assessment in mice.


Assuntos
Comportamento Animal/fisiologia , Tomada de Decisões/fisiologia , Camundongos/fisiologia , Probabilidade , Animais , Condicionamento Operante , Masculino , Camundongos Endogâmicos C57BL , Fatores de Tempo , Incerteza
10.
BMC Neurosci ; 17(1): 70, 2016 10 28.
Artigo em Inglês | MEDLINE | ID: mdl-27793098

RESUMO

BACKGROUND: Reinforcement learning is a fundamental form of learning that may be formalized using the Bellman equation. Accordingly an agent determines the state value as the sum of immediate reward and of the discounted value of future states. Thus the value of state is determined by agent related attributes (action set, policy, discount factor) and the agent's knowledge of the environment embodied by the reward function and hidden environmental factors given by the transition probability. The central objective of reinforcement learning is to solve these two functions outside the agent's control either using, or not using a model. RESULTS: In the present paper, using the proactive model of reinforcement learning we offer insight on how the brain creates simplified representations of the environment, and how these representations are organized to support the identification of relevant stimuli and action. Furthermore, we identify neurobiological correlates of our model by suggesting that the reward and policy functions, attributes of the Bellman equitation, are built by the orbitofrontal cortex (OFC) and the anterior cingulate cortex (ACC), respectively. CONCLUSIONS: Based on this we propose that the OFC assesses cue-context congruence to activate the most context frame. Furthermore given the bidirectional neuroanatomical link between the OFC and model-free structures, we suggest that model-based input is incorporated into the reward prediction error (RPE) signal, and conversely RPE signal may be used to update the reward-related information of context frames and the policy underlying action selection in the OFC and ACC, respectively. Furthermore clinical implications for cognitive behavioral interventions are discussed.


Assuntos
Córtex Cerebral/fisiologia , Sinais (Psicologia) , Modelos Neurológicos , Modelos Psicológicos , Recompensa , Animais , Associação , Humanos
11.
J Imaging Inform Med ; 2024 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-39020159

RESUMO

Large labeled data bring significant performance improvement, but acquiring labeled medical data is particularly challenging due to the laborious, time-consuming, and medically qualified annotation. Semi-supervised learning has been employed to leverage unlabeled data. However, the quality and quantity of annotated data have a great influence on the performance of the semi-supervised model. Selecting informative samples through active learning is crucial and could improve model performance. Therefore, we propose a unified semi-supervised active learning architecture (RL-based SSAL) that alternately trains a semi-supervised network and performs active sample selection. Semi-supervised model is first well trained for sample selection, and selected label-required samples are annotated and added to the previously labeled dataset for subsequent semi-supervised model training. To learn to select the most informative samples, we adopt a policy learning-based approach that treats sample selection as a decision-making process. A novel reward function based on the product of predictive confidence and uncertainty is designed, aiming to select samples with high confidence and uncertainty. Comparisons with a semi-supervised baseline on collected lumbar disc herniation dataset demonstrate the effectiveness of the proposed RL-based SSAL, achieving over 3% promotion across different amounts of labeled data. Comparisons with other active learning methods and ablation studies reveal the superiority of proposed policy learning based on active sample selection and reward function. Our model trained with only 200 labeled data achieves an accuracy of 89.32% which is comparable to the performance achieved with the entire labeled dataset, demonstrating its significant advantage.

12.
Biomimetics (Basel) ; 9(7)2024 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-39056825

RESUMO

In recent years, remotely controlling an unmanned aerial vehicle (UAV) to perform coverage search missions has become increasingly popular due to the advantages of the UAV, such as small size, high maneuverability, and low cost. However, due to the distance limitations of the remote control and endurance of a UAV, a single UAV cannot effectively perform a search mission in various and complex regions. Thus, using a group of UAVs to deal with coverage search missions has become a research hotspot in the last decade. In this paper, a differential evolution (DE)-based multi-UAV cooperative coverage algorithm is proposed to deal with the coverage tasks in different regions. In the proposed algorithm, named DECSMU, the entire coverage process is divided into many coverage stages. Before each coverage stage, every UAV automatically plans its flight path based on DE. To obtain a promising flight trajectory for a UAV, a dynamic reward function is designed to evaluate the quality of the planned path in terms of the coverage rate and the energy consumption of the UAV. In each coverage stage, an information interaction between different UAVs is carried out through a communication network, and a distributed model predictive control is used to realize the collaborative coverage of multiple UAVs. The experimental results show that the strategy can achieve high coverage and a low energy consumption index under the constraints of collision avoidance. The favorable performance in DECSMU on different regions also demonstrate that it has outstanding stability and generality.

13.
Int J Neural Syst ; 34(7): 2450037, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38655914

RESUMO

Vision and proprioception have fundamental sensory mismatches in delivering locational information, and such mismatches are critical factors limiting the efficacy of motor learning. However, it is still not clear how and to what extent this mismatch limits motor learning outcomes. To further the understanding of the effect of sensory mismatch on motor learning outcomes, a reinforcement learning algorithm and the simplified biomechanical elbow joint model were employed to mimic the motor learning process in a computational environment. By applying a reinforcement learning algorithm to the motor learning of elbow joint flexion task, simulation results successfully explained how visual-proprioceptive mismatch limits motor learning outcomes in terms of motor control accuracy and task completion speed. The larger the perceived angular offset between the two sensory modalities, the lower the motor control accuracy. Also, the more similar the peak reward amplitude of the two sensory modalities, the lower the motor control accuracy. In addition, simulation results suggest that insufficient exploration rate limits task completion speed, and excessive exploration rate limits motor control accuracy. Such a speed-accuracy trade-off shows that a moderate exploration rate could serve as another important factor in motor learning.


Assuntos
Propriocepção , Reforço Psicológico , Percepção Visual , Humanos , Propriocepção/fisiologia , Percepção Visual/fisiologia , Aprendizagem/fisiologia , Articulação do Cotovelo/fisiologia , Desempenho Psicomotor/fisiologia , Fenômenos Biomecânicos/fisiologia , Simulação por Computador , Atividade Motora/fisiologia
14.
Comput Biol Med ; 169: 107877, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38157774

RESUMO

Although existing deep reinforcement learning-based approaches have achieved some success in image augmentation tasks, their effectiveness and adequacy for data augmentation in intelligent medical image analysis are still unsatisfactory. Therefore, we propose a novel Adaptive Sequence-length based Deep Reinforcement Learning (ASDRL) model for Automatic Data Augmentation (AutoAug) in intelligent medical image analysis. The improvements of ASDRL-AutoAug are two-fold: (i) To remedy the problem of some augmented images being invalid, we construct a more accurate reward function based on different variations of the augmentation trajectories. This reward function assesses the validity of each augmentation transformation more accurately by introducing different information about the validity of the augmented images. (ii) Then, to alleviate the problem of insufficient augmentation, we further propose a more intelligent automatic stopping mechanism (ASM). ASM feeds a stop signal to the agent automatically by judging the adequacy of image augmentation. This ensures that each transformation before stopping the augmentation can smoothly improve the model performance. Extensive experimental results on three medical image segmentation datasets show that (i) ASDRL-AutoAug greatly outperforms the state-of-the-art data augmentation methods in medical image segmentation tasks, (ii) the proposed improvements are both effective and essential for ASDRL-AutoAug to achieve superior performance, and the new reward evaluates the transformations more accurately than existing reward functions, and (iii) we also demonstrate that ASDRL-AutoAug is adaptive for different images in terms of sequence length, as well as generalizable across different segmentation models.

15.
Front Psychiatry ; 14: 1093784, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36896348

RESUMO

Objective: Internet gaming disorder (IGD) can seriously impair an individual's physical and mental health. However, unlike the majority of those suffering from substance addiction, individuals with IGD may recover without any professional intervention. Understanding the brain mechanisms of natural recovery from IGD may provide new insight into how to prevent addiction and implement more targeted interventions. Methods: Sixty individuals with IGD were scanned by using a resting-state fMRI to assess brain region changes associated with IGD. After 1 year, 19 individuals with IGD no longer met the IGD criteria and were considered recovered (RE-IGD), 23 individuals still met the IGD criteria (PER-IGD), and 18 individuals left the study. The brain activity in resting state between 19 RE-IGD individuals and 23 PER-IGD individuals was compared by using regional homogeneity (ReHo). Additionally, brain structure and cue-craving functional MRIs were collected to further support the results in the resting-state. Results: The resting-state fMRI results revealed that activity in brain regions responsible for reward and inhibitory control [including the orbitofrontal cortex (OFC), the precuneus and the dorsolateral prefrontal cortex (DLPFC)] was decreased in the PER-IGD individuals compared to RE-IGD individuals. In addition, significant positive correlations were found between mean ReHo values in the precuneus and self-reported craving scores for gaming, whether among the PER-IGD individuals or the RE-IGD individuals. Furthermore, we found similar results in that brain structure and cue-craving differences exist between the PER-IGD individuals and RE-IGD individuals, specifically in the brain regions associated with reward processing and inhibitory control (including the DLPFC, anterior cingulate gyrus, insula, OFC, precuneus, and superior frontal gyrus). Conclusion: These findings indicate that the brain regions responsible for reward processing and inhibitory control are different in PER-IGD individuals, which may have consequences on natural recovery. Our present study provides neuroimaging evidence that spontaneous brain activity may influence natural recovery from IGD.

16.
Neural Netw ; 167: 104-117, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37647740

RESUMO

The implementation of robotic reinforcement learning is hampered by problems such as an unspecified reward function and high training costs. Many previous works have used cross-domain policy transfer to obtain the policy of the problem domain. However, these researches require paired and aligned dynamics trajectories or other interactions with the environment. We propose a cross-domain dynamics alignment framework for the problem domain policy acquisition that can transfer the policy trained in the source domain to the problem domain. Our framework aims to learn dynamics alignment across two domains that differ in agents' physical parameters (armature, rotation range, or torso mass) or agents' morphologies (limbs). Most importantly, we learn dynamics alignment between two domains using unpaired and unaligned dynamics trajectories. For these two scenarios, we propose a cross-physics-domain policy adaptation algorithm (CPD) and a cross-morphology-domain policy adaptation algorithm (CMD) based on our cross-domain dynamics alignment framework. In order to improve the performance of policy in the source domain so that a better policy can be transferred to the problem domain, we propose the Boltzmann TD3 (BTD3) algorithm. We conduct diverse experiments on agent continuous control domains to demonstrate the performance of our approaches. Experimental results show that our approaches can obtain better policies and higher rewards for the agents in the problem domains even when the dataset of the problem domain is small.


Assuntos
Algoritmos , Aprendizagem , Física , Políticas , Reforço Psicológico
17.
J Cheminform ; 15(1): 120, 2023 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-38093324

RESUMO

Developing compounds with novel structures is important for the production of new drugs. From an intellectual perspective, confirming the patent status of newly developed compounds is essential, particularly for pharmaceutical companies. The generation of a large number of compounds has been made possible because of the recent advances in artificial intelligence (AI). However, confirming the patent status of these generated molecules has been a challenge because there are no free and easy-to-use tools that can be used to determine the novelty of the generated compounds in terms of patents in a timely manner; additionally, there are no appropriate reference databases for pharmaceutical patents in the world. In this study, two public databases, SureChEMBL and Google Patents Public Datasets, were used to create a reference database of drug-related patented compounds using international patent classification. An exact structure search system was constructed using InChIKey and a relational database system to rapidly search for compounds in the reference database. Because drug-related patented compounds are a good source for generative AI to learn useful chemical structures, they were used as the training data. Furthermore, molecule generation was successfully directed by increasing and decreasing the number of generated patented compounds through incorporation of patent status (i.e., patented or not) into learning. The use of patent status enabled generation of novel molecules with high drug-likeness. The generation using generative AI with patent information would help efficiently propose novel compounds in terms of pharmaceutical patents. Scientific contribution: In this study, a new molecule-generation method that takes into account the patent status of molecules, which has rarely been considered but is an important feature in drug discovery, was developed. The method enables the generation of novel molecules based on pharmaceutical patents with high drug-likeness and will help in the efficient development of effective drug compounds.

18.
Comput Biol Med ; 164: 107253, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37536094

RESUMO

Spike sorting is the basis for analyzing spike firing patterns encoded in high-dimensional information spaces. With the fact that high-density microelectrode arrays record multiple neurons simultaneously, the data collected often suffers from two problems: a few overlapping spikes and different neuronal firing rates, which both belong to the multi-class imbalance problem. Since deep reinforcement learning (DRL) assign targeted attention to categories through reward functions, we propose ImbSorter to implement spike sorting under multi-class imbalance. We describe spike sorting as a Markov sequence decision and construct a dynamic reward function (DRF) to improve the sensitivity of the agent to minor classes based on the inter-class imbalance ratios. The agent is eventually guided by the optimal strategy to classify spikes. We consider the Wave_Clus dataset, which contains overlapping spikes and diverse noise levels, and the macaque dataset, which has a multi-scale imbalance. ImbSorter is compared with classical DRL architectures, traditional machine learning algorithms, and advanced overlapping spike sorting techniques on these two above datasets. ImbSorter obtained improved results on the Macro_F1. The results show ImbSorter has a promising ability to resist overlapping and noise interference. It has high stability and promising performance in processing spikes with different degrees of skewed distribution.


Assuntos
Neurônios , Processamento de Sinais Assistido por Computador , Potenciais de Ação/fisiologia , Neurônios/fisiologia , Microeletrodos , Algoritmos
19.
Neural Netw ; 167: 847-864, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37741067

RESUMO

Adversarial imitation learning (AIL) is a powerful method for automated decision systems due to training a policy efficiently by mimicking expert demonstrations. However, implicit bias is present in the reward function of these algorithms, which leads to sample inefficiency. To solve this issue, an algorithm, referred to as Mutual Information Generative Adversarial Imitation Learning (MI-GAIL), is proposed to correct the biases. In this study, we propose two guidelines for designing an unbiased reward function. Based on these guidelines, we shape the reward function from the discriminator by adding auxiliary information from a potential-based reward function. The primary insight is that the potential-based reward function provides more accurate rewards for actions identified in the two guidelines. We compare our algorithm with SOTA imitation learning algorithms on a family of continuous control tasks. Experiments results show that MI-GAIL is able to address the issue of bias in AIL reward functions and further improve sample efficiency and training stability.


Assuntos
Viés Implícito , Comportamento Imitativo , Aprendizagem , Algoritmos , Políticas
20.
Neural Netw ; 164: 419-427, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37187108

RESUMO

Although reinforcement learning (RL) has made numerous breakthroughs in recent years, addressing reward-sparse environments remains challenging and requires further exploration. Many studies improve the performance of the agents by introducing the state-action pairs experienced by an expert. However, such kinds of strategies almost depend on the quality of the demonstration by the expert, which is rarely optimal in a real-world environment, and struggle with learning from sub-optimal demonstrations. In this paper, a self-imitation learning algorithm based on the task space division is proposed to realize an efficient high-quality demonstration acquire while the training process. To determine the quality of the trajectory, some well-designed criteria are defined in the task space for finding a better demonstration. The results show that the proposed algorithm will improve the success rate of robot control and achieve a high mean Q value per step. The algorithm framework proposed in this paper has illustrated a great potential to learn from a demonstration generated by using self-policy in sparse environments and can be used in reward-sparse environments where the task space can be divided.


Assuntos
Algoritmos , Inteligência Artificial , Reforço Psicológico , Recompensa
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA