RESUMO
Clinicians and patients must make treatment decisions at a series of key decision points throughout disease progression. A dynamic treatment regime is a set of sequential decision rules that return treatment decisions based on accumulating patient information, like that commonly found in electronic medical record (EMR) data. When applied to a patient population, an optimal treatment regime leads to the most favorable outcome on average. Identifying optimal treatment regimes that maximize residual life is especially desirable for patients with life-threatening diseases such as sepsis, a complex medical condition that involves severe infections with organ dysfunction. We introduce the residual life value estimator (ReLiVE), an estimator for the expected value of cumulative restricted residual life under a fixed treatment regime. Building on ReLiVE, we present a method for estimating an optimal treatment regime that maximizes expected cumulative restricted residual life. Our proposed method, ReLiVE-Q, conducts estimation via the backward induction algorithm Q-learning. We illustrate the utility of ReLiVE-Q in simulation studies, and we apply ReLiVE-Q to estimate an optimal treatment regime for septic patients in the intensive care unit using EMR data from the Multiparameter Intelligent Monitoring Intensive Care database. Ultimately, we demonstrate that ReLiVE-Q leverages accumulating patient information to estimate personalized treatment regimes that optimize a clinically meaningful function of residual life.
Assuntos
Registros Eletrônicos de Saúde , Humanos , Sepse/terapia , Modelos EstatísticosRESUMO
Transition state (TS) on the potential energy surface (PES) plays a key role in determining the kinetics and thermodynamics of chemical reactions. Inspired by the fact that the dynamics of complex systems are always driven by rare but significant transition events, we herein propose a TS search method in accordance with the Q-learning algorithm. Appropriate reward functions are set for a given PES to optimize the reaction pathway through continuous trial and error, and then the TS can be obtained from the optimized reaction pathway. The validity of this Q-learning method with reasonable settings of Q-value table including actions, states, learning rate, greedy rate, discount rate, and so on, is exemplified in 2 two-dimensional potential functions. In the applications of the Q-learning method to two chemical reactions, it is demonstrated that the Q-learning method can predict consistent TS and reaction pathway with those by ab initio calculations. Notably, the PES must be well prepared before using the Q-learning method, and a coarse-to-fine PES scanning scheme is thus introduced to save the computational time while maintaining the accuracy of the Q-learning prediction. This work offers a simple and reliable Q-learning method to search for all possible TS and reaction pathway of a chemical reaction, which may be a new option for effectively exploring the PES in an extensive search manner.
RESUMO
Sequential multiple assignment randomized trials (SMARTs) are the gold standard for estimating optimal dynamic treatment regimes (DTRs), but are costly and require a large sample size. We introduce the multi-stage augmented Q-learning estimator (MAQE) to improve efficiency of estimation of optimal DTRs by augmenting SMART data with observational data. Our motivating example comes from the Back Pain Consortium, where one of the overarching aims is to learn how to tailor treatments for chronic low back pain to individual patient phenotypes, knowledge which is lacking clinically. The Consortium-wide collaborative SMART and observational studies within the Consortium collect data on the same participant phenotypes, treatments, and outcomes at multiple time points, which can easily be integrated. Previously published single-stage augmentation methods for integration of trial and observational study (OS) data were adapted to estimate optimal DTRs from SMARTs using Q-learning. Simulation studies show the MAQE, which integrates phenotype, treatment, and outcome information from multiple studies over multiple time points, more accurately estimates the optimal DTR, and has a higher average value than a comparable Q-learning estimator without augmentation. We demonstrate this improvement is robust to a wide range of trial and OS sample sizes, addition of noise variables, and effect sizes.
Assuntos
Simulação por Computador , Dor Lombar , Estudos Observacionais como Assunto , Ensaios Clínicos Controlados Aleatórios como Assunto , Humanos , Estudos Observacionais como Assunto/estatística & dados numéricos , Ensaios Clínicos Controlados Aleatórios como Assunto/estatística & dados numéricos , Dor Lombar/terapia , Tamanho da Amostra , Resultado do Tratamento , Modelos Estatísticos , Biometria/métodosRESUMO
Research on dynamic treatment regimes has enticed extensive interest. Many methods have been proposed in the literature, which, however, are vulnerable to the presence of misclassification in covariates. In particular, although Q-learning has received considerable attention, its applicability to data with misclassified covariates is unclear. In this article, we investigate how ignoring misclassification in binary covariates can impact the determination of optimal decision rules in randomized treatment settings, and demonstrate its deleterious effects on Q-learning through empirical studies. We present two correction methods to address misclassification effects on Q-learning. Numerical studies reveal that misclassification in covariates induces non-negligible estimation bias and that the correction methods successfully ameliorate bias in parameter estimation.
Assuntos
Regras de Decisão Clínica , Aprendizado de Máquina , HumanosRESUMO
We present a trial design for sequential multiple assignment randomized trials (SMARTs) that use a tailoring function instead of a binary tailoring variable allowing for simultaneous development of the tailoring variable and estimation of dynamic treatment regimens (DTRs). We apply methods for developing DTRs from observational data: tree-based regression learning and Q-learning. We compare this to a balanced randomized SMART with equal re-randomization probabilities and a typical SMART design where re-randomization depends on a binary tailoring variable and DTRs are analyzed with weighted and replicated regression. This project addresses a gap in clinical trial methodology by presenting SMARTs where second stage treatment is based on a continuous outcome removing the need for a binary tailoring variable. We demonstrate that data from a SMART using a tailoring function can be used to efficiently estimate DTRs and is more flexible under varying scenarios than a SMART using a tailoring variable.
Assuntos
Ensaios Clínicos Controlados Aleatórios como Assunto , Humanos , Ensaios Clínicos Controlados Aleatórios como Assunto/métodos , Projetos de Pesquisa , Modelos Estatísticos , Análise de Regressão , Simulação por ComputadorRESUMO
The balance between exploration and exploitation is essential for decision-making. The present study investigated the role of ventromedial orbitofrontal cortex (vmOFC) glutamate neurons in mediating value-based decision-making by first using optogenetics to manipulate vmOFC glutamate activity in rats during a probabilistic reversal learning (PRL) task. Rats that received vmOFC activation during informative feedback completed fewer reversals and exhibited reduced reward sensitivity relative to rats. Analysis with a Q-learning computational model revealed that increased vmOFC activity did not affect the learning rate but instead promoted maladaptive exploration. By contrast, vmOFC inhibition increased the number of completed reversals and increased exploitative behavior. In a separate group of animals, calcium activity of vmOFC glutamate neurons was recorded using fiber photometry. Complementing our results above, we found that suppression of vmOFC activity during the latter part of rewarded trials was associated with improved PRL performance, greater win-stay responding and selecting the correct choice on the next trial. These data demonstrate that excessive vmOFC activity during reward feedback disrupted value-based decision-making by increasing the maladaptive exploration of lower-valued options. Our findings support the premise that pharmacological interventions that normalize aberrant vmOFC glutamate activity during reward feedback processing may attenuate deficits in value-based decision-making.
Assuntos
Córtex Pré-Frontal , Recompensa , Ratos , Animais , Córtex Pré-Frontal/fisiologia , Reversão de Aprendizagem/fisiologia , Glutamatos , Tomada de Decisões/fisiologiaRESUMO
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that affects an individual's behavior, speech, and social interaction. Early and accurate diagnosis of ASD is pivotal for successful intervention. The limited availability of large datasets for neuroimaging investigations, however, poses a significant challenge to the timely and precise identification of ASD. To address this problem, we propose a breakthrough approach, GARL, for ASD diagnosis using neuroimaging data. GARL innovatively integrates the power of GANs and Deep Q-Learning to augment limited datasets and enhance diagnostic precision. We utilized the Autistic Brain Imaging Data Exchange (ABIDE) I and II datasets and employed a GAN to expand these datasets, creating a more robust and diversified dataset for analysis. This approach not only captures the underlying sample distribution within ABIDE I and II but also employs deep reinforcement learning for continuous self-improvement, significantly enhancing the capability of the model to generalize and adapt. Our experimental results confirmed that GAN-based data augmentation effectively improved the performance of all prediction models on both datasets, with the combination of InfoGAN and DQN's GARL yielding the most notable improvement.
Assuntos
Transtorno do Espectro Autista , Aprendizado Profundo , Neuroimagem , Humanos , Transtorno do Espectro Autista/diagnóstico por imagem , Neuroimagem/métodos , Criança , Redes Neurais de Computação , Masculino , Encéfalo/diagnóstico por imagemRESUMO
With the rapid development of mobile edge computing (MEC) and wireless power transfer (WPT) technologies, the MEC-WPT system makes it possible to provide high-quality data processing services for end users. However, in a real-world WPT-MEC system, the channel gain decreases with the transmission distance, leading to "double near and far effect" in the joint transmission of wireless energy and data, which affects the quality of the data processing service for end users. Consequently, it is essential to design a reasonable system model to overcome the "double near and far effect" and make reasonable scheduling of multi-dimensional resources such as energy, communication and computing to guarantee high-quality data processing services. First, this paper designs a relay collaboration WPT-MEC resource scheduling model to improve wireless energy utilization efficiency. The optimization goal is to minimize the normalization of the total communication delay and total energy consumption while meeting multiple resource constraints. Second, this paper imports a BK-means algorithm to complete the end terminals cluster to guarantee effective energy reception and adapts the whale optimization algorithm with adaptive mechanism (AWOA) for mobile vehicle path-planning to reduce energy waste. Third, this paper proposes an immune differential enhanced deep deterministic policy gradient (IDDPG) algorithm to realize efficient resource scheduling of multiple resources and minimize the optimization goal. Finally, simulation experiments are carried out on different data, and the simulation results prove the validity of the designed scheduling model and proposed IDDPG.
RESUMO
With the exponential growth of wireless devices and the demand for real-time processing, traditional server architectures face challenges in meeting the ever-increasing computational requirements. This paper proposes a collaborative edge computing framework to offload and process tasks efficiently in such environments. By equipping a moving unmanned aerial vehicle (UAV) as the mobile edge computing (MEC) server, the proposed architecture aims to release the burden on roadside units (RSUs) servers. Specifically, we propose a two-layer edge intelligence scheme to allocate network computing resources. The first layer intelligently offloads and allocates tasks generated by wireless devices in the vehicular system, and the second layer utilizes the partially observable stochastic game (POSG), solved by duelling deep Q-learning, to allocate the computing resources of each processing node (PN) to different tasks. Meanwhile, we propose a weighted position optimization algorithm for the UAV movement in the system to facilitate task offloading and task processing. Simulation results demonstrate the improved performance by applying the proposed scheme.
RESUMO
Vehicular ad hoc networks (VANETs) use multiple channels to communicate using wireless access in vehicular environment (WAVE) standards to provide a variety of vehicle-related applications. The current IEEE 802.11p WAVE communication channel structure is composed of one control channel (CCH) and several service channels (SCHs). SCHs are used for non-safety data transmission, while the CCH is used for broadcasting beacons, control, and safety. WAVE devices transmit data that alternate between CCHs and SCHs, and each channel is active for a duration called the CCH interval (CCHI) and SCH interval (SCHI), respectively. Currently, both intervals are fixed at 50 ms. However, fixed-length intervals cannot effectively respond to dynamically changing traffic loads. Additionally, when many vehicles are simultaneously using the limited channel resources for data transmission, the network performance significantly degrades due to numerous packet collisions. Herein, we propose an adaptive resource allocation technique for efficient data transmission. The technique dynamically adjusts the SCHI and CCHI to improve network performance. Moreover, to reduce data collisions and optimize the network's backoff distribution, the proposed scheme applies reinforcement learning (RL) to provide an intelligent channel access algorithm. The simulation results demonstrate that the proposed scheme can ensure high throughputs and low transmission delays.
RESUMO
For the relativistic navigation system where the position and velocity of the spacecraft are determined through the observation of the relativistic perturbations including stellar aberration and starlight gravitational deflection, a novel parallel Q-learning extended Kalman filter (PQEKF) is presented to implement the measurement bias calibration. The relativistic perturbations are extracted from the inter-star angle measurement achieved with a group of high-accuracy star sensors on the spacecraft. Inter-star angle measurement bias caused by the misalignment of the star sensors is one of the main error sources in the relativistic navigation system. In order to suppress the unfavorable effect of measurement bias on navigation performance, the PQEKF is developed to estimate the position and velocity, together with the calibration parameters, where the Q-learning approach is adopted to fine tune the process noise covariance matrix of the filter automatically. The high performance of the presented method is illustrated via numerical simulations in the scenario of medium Earth orbit (MEO) satellite navigation. The simulation results show that, for the considered MEO satellite and the presented PQEKF algorithm, in the case that the inter-star angle measurement accuracy is about 1 mas, after calibration, the positioning accuracy of the relativistic navigation system is less than 300 m.
RESUMO
The transition to Industry 4.0 and 5.0 underscores the need for integrating humans into manufacturing processes, shifting the focus towards customization and personalization rather than traditional mass production. However, human performance during task execution may vary. To ensure high human-robot teaming (HRT) performance, it is crucial to predict performance without negatively affecting task execution. Therefore, to predict performance indirectly, significant factors affecting human performance, such as engagement and task load (i.e., amount of cognitive, physical, and/or sensory resources required to perform a particular task), must be considered. Hence, we propose a framework to predict and maximize the HRT performance. For the prediction of task performance during the development phase, our methodology employs features extracted from physiological data as inputs. The labels for these predictions-categorized as accurate performance or inaccurate performance due to high/low task load-are meticulously crafted using a combination of the NASA TLX questionnaire, records of human performance in quality control tasks, and the application of Q-Learning to derive task-specific weights for the task load indices. This structured approach enables the deployment of our model to exclusively rely on physiological data for predicting performance, thereby achieving an accuracy rate of 95.45% in forecasting HRT performance. To maintain optimized HRT performance, this study further introduces a method of dynamically adjusting the robot's speed in the case of low performance. This strategic adjustment is designed to effectively balance the task load, thereby enhancing the efficiency of human-robot collaboration.
Assuntos
Robótica , Análise e Desempenho de Tarefas , Humanos , Robótica/métodos , Feminino , Masculino , Análise de Dados , Sistemas Homem-Máquina , Adulto , Carga de TrabalhoRESUMO
In heterogeneous wireless networked control systems (WNCSs), the age of information (AoI) of the actuation update and actuation update cost are important performance metrics. To reduce the monetary cost, the control system can wait for the availability of a WiFi network for the actuator and then conduct the update using a WiFi network in an opportunistic manner, but this leads to an increased AoI of the actuation update. In addition, since there are different AoI requirements according to the control priorities (i.e., robustness of AoI of the actuation update), these need to be considered when delivering the actuation update. To jointly consider the monetary cost and AoI with priority, this paper proposes a priority-aware actuation update scheme (PAUS) where the control system decides whether to deliver or delay the actuation update to the actuator. For the optimal decision, we formulate a Markov decision process model and derive the optimal policy based on Q-learning, which aims to maximize the average reward that implies the balance between the monetary cost and AoI with priority. Simulation results demonstrate that the PAUS outperforms the comparison schemes in terms of the average reward under various settings.
RESUMO
The transmission environment of underwater wireless sensor networks is open, and important transmission data can be easily intercepted, interfered with, and tampered with by malicious nodes. Malicious nodes can be mixed in the network and are difficult to distinguish, especially in time-varying underwater environments. To address this issue, this article proposes a GAN-based trusted routing algorithm (GTR). GTR defines the trust feature attributes and trust evaluation matrix of underwater network nodes, constructs the trust evaluation model based on a generative adversarial network (GAN), and achieves malicious node detection by establishing a trust feature profile of a trusted node, which improves the detection performance for malicious nodes in underwater networks under unlabeled and imbalanced training data conditions. GTR combines the trust evaluation algorithm with the adaptive routing algorithm based on Q-Learning to provide an optimal trusted data forwarding route for underwater network applications, improving the security, reliability, and efficiency of data forwarding in underwater networks. GTR relies on the trust feature profile of trusted nodes to distinguish malicious nodes and can adaptively select the forwarding route based on the status of trusted candidate next-hop nodes, which enables GTR to better cope with the changing underwater transmission environment and more accurately detect malicious nodes, especially unknown malicious node intrusions, compared to baseline algorithms. Simulation experiments showed that, compared to baseline algorithms, GTR can provide a better malicious node detection performance and data forwarding performance. Under the condition of 15% malicious nodes and 10% unknown malicious nodes mixed in, the detection rate of malicious nodes by the underwater network configured with GTR increased by 5.4%, the error detection rate decreased by 36.4%, the packet delivery rate increased by 11.0%, the energy tax decreased by 11.4%, and the network throughput increased by 20.4%.
RESUMO
To address traffic flow fluctuations caused by changes in traffic signal control schemes on tidal lanes and maintain smooth traffic operations, this paper proposes a method for controlling traffic signal transitions on tidal lanes. Firstly, the proposed method includes designing an intersection overlap phase scheme based on the traffic flow conflict matrix in the tidal lane scenario and a fast and smooth transition method for key intersections based on the flow ratio. The aim of the control is to equalize average queue lengths and minimize average vehicle delays for different flow directions at the intersection. This study also analyses various tidal lane scenarios based on the different opening states of the tidal lanes at related intersections. The transitions of phase offsets are emphasized after a comprehensive analysis of transition time and smoothing characteristics. In addition, this paper proposes a coordinated method for tidal lanes to optimize the phase offset at arterial intersections for smooth and rapid transitions. The method uses Deep Q-Learning, a reinforcement learning algorithm for optimal action selection (OSA), to develop an adaptive traffic signal transition control and enhance its efficiency. Finally, a simulation experiment using a traffic control interface is presented to validate the proposed approach. This study shows that this method leads to smoother and faster traffic signal transitions across different tidal lane scenarios compared to the conventional method. Implementing this solution can benefit intersection groups by reducing traffic delays, improving traffic efficiency, and decreasing air pollution caused by congestion.
RESUMO
Task scheduling is a critical challenge in cloud computing systems, greatly impacting their performance. Task scheduling is a nondeterministic polynomial time hard (NP-Hard) problem that complicates the search for nearly optimal solutions. Five major uncertainty parameters, i.e., security, traffic, workload, availability, and price, influence task scheduling decisions. The primary rationale for selecting these uncertainty parameters lies in the challenge of accurately measuring their values, as empirical estimations often diverge from the actual values. The integral-valued Pythagorean fuzzy set (IVPFS) is a promising mathematical framework to deal with parametric uncertainties. The Dyna Q+ algorithm is the updated form of the Dyna Q agent designed specifically for dynamic computing environments by providing bonus rewards to non-exploited states. In this paper, the Dyna Q+ agent is enriched with the IVPFS mathematical framework to make intelligent task scheduling decisions. The performance of the proposed IVPFS Dyna Q+ task scheduler is tested using the CloudSim 3.3 simulator. The execution time is reduced by 90%, the makespan time is also reduced by 90%, the operation cost is below 50%, and the resource utilization rate is improved by 95%, all of these parameters meeting the desired standards or expectations. The results are also further validated using an expected value analysis methodology that confirms the good performance of the task scheduler. A better balance between exploration and exploitation through rigorous action-based learning is achieved by the Dyna Q+ agent.
RESUMO
The rapid development of 6G communications using terahertz (THz) electromagnetic waves has created a demand for highly sensitive THz nanoresonators capable of detecting these waves. Among the potential candidates, THz nanogap loop arrays show promising characteristics but require significant computational resources for accurate simulation. This requirement arises because their unit cells are 10 times smaller than millimeter wavelengths, with nanogap regions that are 1â¯000â¯000 times smaller. To address this challenge, we propose a rapid inverse design method using physics-informed machine learning, employing double deep Q-learning with an analytical model of the THz nanogap loop array. In â¼39 h on a middle-level personal computer, our approach identifies the optimal structure through 200â¯000 iterations, achieving an experimental electric field enhancement of 32â¯000 at 0.2 THz, 300% stronger than prior results. Our analytical model-based approach significantly reduces the amount of computational resources required, offering a practical alternative to numerical simulation-based inverse design for THz nanodevices.
RESUMO
A 10-Hz repetitive transcranial magnetic stimulation to the left dorsal lateral prefrontal cortex has been shown to increase dopaminergic activity in the dorsal striatum, a region strongly implicated in reinforcement learning. However, the behavioural influence of this effect remains largely unknown. We tested the causal effects of 10-Hz stimulation on behavioural and computational characteristics of reinforcement learning. A total of 40 healthy individuals were randomized into active and sham (placebo) stimulation groups. Each participant underwent one stimulation session (1500 pulses) in which stimulation was applied over the left dorsal lateral prefrontal cortex using a robotic arm. Participants then completed a reinforcement learning task sensitive to striatal dopamine functioning. Participants' choices were modelled using a reinforcement learning model (Q-learning) that calculates separate learning rates associated with positive and negative reward prediction errors. Subjects receiving active stimulation exhibited increased reward rate (number of correct responses per second of task activity) compared with those in sham. Computationally, although no group differences were observed, the active group displayed a higher learning rate for correct trials (αG) compared with incorrect trials (αL). Finally, when tested with novel pairs of stimuli, the active group displayed extremely fast reaction times, and a trend towards a higher reward rate. This study provided specific behavioural and computational accounts of altered striatal-mediated behaviour, particularly response vigour, induced by a proposed increase of dopamine activity by 10-Hz stimulation to the left dorsal lateral prefrontal cortex. Together, these findings bolster the use of repetitive transcranial magnetic stimulation to target neurocognitive disturbances attributed to the dysregulation of dopaminergic-striatal circuits.
Assuntos
Dopamina , Estimulação Magnética Transcraniana , Humanos , Adulto , Dopamina/farmacologia , Reforço Psicológico , Aprendizagem/fisiologia , Recompensa , Córtex Pré-Frontal/fisiologiaRESUMO
In this work, the impact of implementing Deep Reinforcement Learning (DRL) in predicting the channel parameters for user devices in a Power Domain Non-Orthogonal Multiple Access system (PD-NOMA) is investigated. In the channel prediction process, DRL based on deep Q networks (DQN) algorithm will be developed and incorporated into the NOMA system so that this developed DQN model can be employed to estimate the channel coefficients for each user device in NOMA system. The developed DQN scheme will be structured as a simplified approach to efficiently predict the channel parameters for each user in order to maximize the downlink sum rates for all users in the system. In order to approximate the channel parameters for each user device, this proposed DQN approach is first initialized using random channel statistics, and then the proposed DQN model will be dynamically updated based on the interaction with the environment. The predicted channel parameters will be utilized at the receiver side to recover the desired data. Furthermore, this work inspects how the channel estimation process based on the simplified DQN algorithm and the power allocation policy, can both be integrated for the purpose of multiuser detection in the examined NOMA system. Simulation results, based on several performance metrics, have demonstrated that the proposed simplified DQN algorithm can be a competitive algorithm for channel parameters estimation when compared to different benchmark schemes for channel estimation processes such as deep neural network (DNN) based long-short term memory (LSTM), RL based Q algorithm, and channel estimation scheme based on minimum mean square error (MMSE) procedure.
RESUMO
In this study, the influence of adopting Reinforcement Learning (RL) to predict the channel parameters for user devices in a Power Domain Multi-Input Single-Output Non-Orthogonal Multiple Access (MISO-NOMA) system is inspected. In the channel prediction-based RL approach, the Q-learning algorithm is developed and incorporated into the NOMA system so that the developed Q-model can be employed to predict the channel coefficients for every user device. The purpose of adopting the developed Q-learning procedure is to maximize the received downlink sum-rate and decrease the estimation loss. To satisfy this aim, the developed Q-algorithm is initialized using different channel statistics and then the algorithm is updated based on the interaction with the environment in order to approximate the channel coefficients for each device. The predicted parameters are utilized at the receiver side to recover the desired data. Furthermore, based on maximizing the sum-rate of the examined user devices, the power factors for each user can be deduced analytically to allocate the optimal power factor for every user device in the system. In addition, this work inspects how the channel prediction based on the developed Q-learning model, and the power allocation policy, can both be incorporated for the purpose of multiuser recognition in the examined MISO-NOMA system. Simulation results, based on several performance metrics, have demonstrated that the developed Q-learning algorithm can be a competitive algorithm for channel estimation when compared to different benchmark schemes such as deep learning-based long short-term memory (LSTM), RL based actor-critic algorithm, RL based state-action-reward-state-action (SARSA) algorithm, and standard channel estimation scheme based on minimum mean square error procedure.