Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 168
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
J Neurosci ; 44(24)2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38670805

RESUMO

Reinforcement learning is a theoretical framework that describes how agents learn to select options that maximize rewards and minimize punishments over time. We often make choices, however, to obtain symbolic reinforcers (e.g., money, points) that are later exchanged for primary reinforcers (e.g., food, drink). Although symbolic reinforcers are ubiquitous in our daily lives, widely used in laboratory tasks because they can be motivating, mechanisms by which they become motivating are less understood. In the present study, we examined how monkeys learn to make choices that maximize fluid rewards through reinforcement with tokens. The question addressed here is how the value of a state, which is a function of multiple task features (e.g., the current number of accumulated tokens, choice options, task epoch, trials since the last delivery of primary reinforcer, etc.), drives value and affects motivation. We constructed a Markov decision process model that computes the value of task states given task features to then correlate with the motivational state of the animal. Fixation times, choice reaction times, and abort frequency were all significantly related to values of task states during the tokens task (n = 5 monkeys, three males and two females). Furthermore, the model makes predictions for how neural responses could change on a moment-by-moment basis relative to changes in the state value. Together, this task and model allow us to capture learning and behavior related to symbolic reinforcement.


Assuntos
Comportamento de Escolha , Macaca mulatta , Motivação , Reforço Psicológico , Recompensa , Animais , Motivação/fisiologia , Masculino , Comportamento de Escolha/fisiologia , Tempo de Reação/fisiologia , Cadeias de Markov , Feminino
2.
Catheter Cardiovasc Interv ; 104(1): 84-91, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38639136

RESUMO

Cardiovascular devices are essential for the treatment of cardiovascular diseases including cerebrovascular, coronary, valvular, congenital, peripheral vascular and arrhythmic diseases. The regulation and surveillance of vascular devices in real-world practice, however, presents challenges during each individual product's life cycle. Four examples illustrate recent challenges and questions regarding safety, appropriate use and efficacy arising from FDA approved devices used in real-world practice. We outline potential pathways wherein providers, regulators and payors could potentially provide high-quality cardiovascular care, identify safety signals, ensure equitable device access, and study potential issues with devices in real-world practice.


Assuntos
Aprovação de Equipamentos , Vigilância de Produtos Comercializados , Humanos , Estados Unidos , Fatores de Risco , Segurança do Paciente , United States Food and Drug Administration , Medição de Risco , Dispositivos de Acesso Vascular , Procedimentos Endovasculares/instrumentação , Procedimentos Endovasculares/efeitos adversos , Desenho de Equipamento , Doenças Cardiovasculares/terapia , Doenças Cardiovasculares/diagnóstico
3.
Artigo em Inglês | MEDLINE | ID: mdl-38836923

RESUMO

Forty percent of diabetics will develop chronic kidney disease (CKD) in their lifetimes. However, as many as 50% of these CKD cases may go undiagnosed. We developed screening recommendations stratified by age and previous test history for individuals with diagnosed diabetes and unknown proteinuria status by race and gender groups. To do this, we used a Partially Observed Markov Decision Process (POMDP) to identify whether a patient should be screened at every three-month interval from ages 30-85. Model inputs were drawn from nationally-representative datasets, the medical literature, and a microsimulation that integrates this information into group-specific disease progression rates. We implement the POMDP solution policy in the microsimulation to understand how this policy may impact health outcomes and generate an easily-implementable, non-belief-based approximate policy for easier clinical interpretability. We found that the status quo policy, which is to screen annually for all ages and races, is suboptimal for maximizing expected discounted future net monetary benefits (NMB). The POMDP policy suggests more frequent screening after age 40 in all race and gender groups, with screenings 2-4 times a year for ages 61-70. Black individuals are recommended for screening more frequently than their White counterparts. This policy would increase NMB from the status quo policy between $1,000 to  $8,000 per diabetic patient at a willingness-to-pay of $150,000 per quality-adjusted life year (QALY).

4.
Sensors (Basel) ; 24(13)2024 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-39001102

RESUMO

Visible light communication (VLC) is a promising complementary technology to its radio frequency (RF) counterpart to satisfy the high quality-of-service (QoS) requirements of intelligent vehicular communications by reusing LED street lights. In this paper, a hybrid handover scheme for vehicular VLC/RF communication networks is proposed to balance QoS and handover costs by considering the vertical handover and horizontal handover together judging from the mobile state of the vehicle. A Markov decision process (MDP) is formulated to describe this hybrid handover problem, with a cost function balancing the handover consumption, delay, and reliability. A value iteration algorithm was applied to solve the optimal handover policy. The simulation results demonstrated the performance of the proposed hybrid handover scheme in comparison to other benchmark schemes.

5.
Sensors (Basel) ; 24(2)2024 Jan 07.
Artigo em Inglês | MEDLINE | ID: mdl-38257450

RESUMO

In heterogeneous wireless networked control systems (WNCSs), the age of information (AoI) of the actuation update and actuation update cost are important performance metrics. To reduce the monetary cost, the control system can wait for the availability of a WiFi network for the actuator and then conduct the update using a WiFi network in an opportunistic manner, but this leads to an increased AoI of the actuation update. In addition, since there are different AoI requirements according to the control priorities (i.e., robustness of AoI of the actuation update), these need to be considered when delivering the actuation update. To jointly consider the monetary cost and AoI with priority, this paper proposes a priority-aware actuation update scheme (PAUS) where the control system decides whether to deliver or delay the actuation update to the actuator. For the optimal decision, we formulate a Markov decision process model and derive the optimal policy based on Q-learning, which aims to maximize the average reward that implies the balance between the monetary cost and AoI with priority. Simulation results demonstrate that the PAUS outperforms the comparison schemes in terms of the average reward under various settings.

6.
Ecol Lett ; 26(3): 398-410, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36719341

RESUMO

Finding a common currency for benefits and hazards is a major challenge in optimal foraging theory, often requiring complex computational methods. We present a new analytic approach that builds on the Marginal Value Theorem and giving-up densities while incorporating the nonlinear effect of predation risk. We map the space of all possible environments into strategy regions, each corresponding to a discrete optimal strategy. This provides a generalised quantitative measure of the trade-off between foraging rewards and hazards. This extends a classic optimal diet choice rule-of-thumb to incorporate the hazard of waiting for better resources to appear. We compare the dynamics of optimal decision-making for three foraging life-history strategies: One in which fitness accrues instantly, and two with delays before fitness benefit is accrued. Foragers with delayed-benefit strategies are more sensitive to predation risk than resource quality, as they stand to lose more fitness from a predation event than instant-accrual foragers.


Assuntos
Comportamento Alimentar , Comportamento Predatório , Animais , Dieta
7.
BMC Med ; 21(1): 359, 2023 09 19.
Artigo em Inglês | MEDLINE | ID: mdl-37726729

RESUMO

BACKGROUND: During the COVID-19 pandemic, a variety of clinical decision support systems (CDSS) were developed to aid patient triage. However, research focusing on the interaction between decision support systems and human experts is lacking. METHODS: Thirty-two physicians were recruited to rate the survival probability of 59 critically ill patients by means of chart review. Subsequently, one of two artificial intelligence systems advised the physician of a computed survival probability. However, only one of these systems explained the reasons behind its decision-making. In the third step, physicians reviewed the chart once again to determine the final survival probability rating. We hypothesized that an explaining system would exhibit a higher impact on the physicians' second rating (i.e., higher weight-on-advice). RESULTS: The survival probability rating given by the physician after receiving advice from the clinical decision support system was a median of 4 percentage points closer to the advice than the initial rating. Weight-on-advice was not significantly different (p = 0.115) between the two systems (with vs without explanation for its decision). Additionally, weight-on-advice showed no difference according to time of day or between board-qualified and not yet board-qualified physicians. Self-reported post-experiment overall trust was awarded a median of 4 out of 10 points. When asked after the conclusion of the experiment, overall trust was 5.5/10 (non-explaining median 4 (IQR 3.5-5.5), explaining median 7 (IQR 5.5-7.5), p = 0.007). CONCLUSIONS: Although overall trust in the models was low, the median (IQR) weight-on-advice was high (0.33 (0.0-0.56)) and in line with published literature on expert advice. In contrast to the hypothesis, weight-on-advice was comparable between the explaining and non-explaining systems. In 30% of cases, weight-on-advice was 0, meaning the physician did not change their rating. The median of the remaining weight-on-advice values was 50%, suggesting that physicians either dismissed the recommendation or employed a "meeting halfway" approach. Newer technologies, such as clinical reasoning systems, may be able to augment the decision process rather than simply presenting unexplained bias.


Assuntos
COVID-19 , Sistemas de Apoio a Decisões Clínicas , Humanos , Inteligência Artificial , COVID-19/diagnóstico , Pandemias , Triagem
8.
Health Care Manag Sci ; 26(1): 93-116, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36284034

RESUMO

Preventing chronic diseases is an essential aspect of medical care. To prevent chronic diseases, physicians focus on monitoring their risk factors and prescribing the necessary medication. The optimal monitoring policy depends on the patient's risk factors and demographics. Monitoring too frequently may be unnecessary and costly; on the other hand, monitoring the patient infrequently means the patient may forgo needed treatment and experience adverse events related to the disease. We propose a finite horizon and finite-state Markov decision process to define monitoring policies. To build our Markov decision process, we estimate stochastic models based on longitudinal observational data from electronic health records for a large cohort of patients seen in the national U.S. Veterans Affairs health system. We use our model to study policies for whether or when to assess the need for cholesterol-lowering medications. We further use our model to investigate the role of gender and race on optimal monitoring policies.


Assuntos
Anticolesterolemiantes , Doenças Cardiovasculares , Humanos , Doenças Cardiovasculares/prevenção & controle , Fatores de Risco
9.
Health Care Manag Sci ; 26(1): 1-20, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36044131

RESUMO

Alzheimer's Disease (AD) is believed to be the most common type of dementia. Even though screening for AD has been discussed widely, there is no screening program implemented as part of a policy in any country. Current medical research motivates focusing on the preclinical stages of the disease in a modeling initiative. We develop a partially observable Markov decision process model to determine optimal screening programs. The model contains disease free and preclinical AD partially observable states and the screening decision is taken while an individual is in one of those states. An observable diagnosed preclinical AD state is integrated along with observable mild cognitive impairment, AD and death states. Transition probabilities among states are estimated using data from Knight Alzheimer's Disease Research Center (KADRC) and relevant literature. With an objective of maximizing expected total quality-adjusted life years (QALYs), the output of the model is an optimal screening program that specifies at what points in time an individual over 50 years of age with a given risk of AD will be directed to undergo screening. The screening test used to diagnose preclinical AD has a positive disutility, is imperfect and its sensitivity and specificity are estimated using the KADRC data set. We study the impact of a potential intervention with a parameterized effectiveness and disutility on model outcomes for three different risk profiles (low, medium and high). When intervention effectiveness and disutility are at their best, the optimal screening policy is to screen every year between ages 50 and 95, with an overall QALY gain of 0.94, 1.9 and 2.9 for low, medium and high risk profiles, respectively. As intervention effectiveness diminishes and/or its disutility increases, the optimal policy changes to sporadic screening and then to never screening. Under several scenarios, some screening within the time horizon is optimal from a QALY perspective. Moreover, an in-depth analysis of costs reveals that implementing these policies are either cost-saving or cost-effective.


Assuntos
Doença de Alzheimer , Humanos , Pessoa de Meia-Idade , Idoso , Idoso de 80 Anos ou mais , Doença de Alzheimer/diagnóstico , Sensibilidade e Especificidade , Análise Custo-Benefício , Anos de Vida Ajustados por Qualidade de Vida , Cadeias de Markov
10.
Health Care Manag Sci ; 26(3): 430-446, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37084163

RESUMO

Contagious disease pandemics, such as COVID-19, can cause hospitals around the world to delay nonemergent elective surgeries, which results in a large surgery backlog. To develop an operational solution for providing patients timely surgical care with limited health care resources, this study proposes a stochastic control process-based method that helps hospitals make operational recovery plans to clear their surgery backlog and restore surgical activity safely. The elective surgery backlog recovery process is modeled by a general discrete-time queueing network system, which is formulated by a Markov decision process. A scheduling optimization algorithm based on the piecewise decaying [Formula: see text]-greedy reinforcement learning algorithm is proposed to make dynamic daily surgery scheduling plans considering newly arrived patients, waiting time and clinical urgency. The proposed method is tested through a set of simulated dataset, and implemented on an elective surgery backlog that built up in one large general hospital in China after the outbreak of COVID-19. The results show that, compared with the current policy, the proposed method can effectively and rapidly clear the surgery backlog caused by a pandemic while ensuring that all patients receive timely surgical care. These results encourage the wider adoption of the proposed method to manage surgery scheduling during all phases of a public health crisis.


Assuntos
COVID-19 , Humanos , Pandemias , SARS-CoV-2 , Procedimentos Cirúrgicos Eletivos , Hospitais
11.
Sensors (Basel) ; 23(3)2023 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-36772553

RESUMO

In this study, we develop a framework for an intelligent and self-supervised industrial pick-and-place operation for cluttered environments. Our target is to have the agent learn to perform prehensile and non-prehensile robotic manipulations to improve the efficiency and throughput of the pick-and-place task. To achieve this target, we specify the problem as a Markov decision process (MDP) and deploy a deep reinforcement learning (RL) temporal difference model-free algorithm known as the deep Q-network (DQN). We consider three actions in our MDP; one is 'grasping' from the prehensile manipulation category and the other two are 'left-slide' and 'right-slide' from the non-prehensile manipulation category. Our DQN is composed of three fully convolutional networks (FCN) based on the memory-efficient architecture of DenseNet-121 which are trained together without causing any bottleneck situations. Each FCN corresponds to each discrete action and outputs a pixel-wise map of affordances for the relevant action. Rewards are allocated after every forward pass and backpropagation is carried out for weight tuning in the corresponding FCN. In this manner, non-prehensile manipulations are learnt which can, in turn, lead to possible successful prehensile manipulations in the near future and vice versa, thus increasing the efficiency and throughput of the pick-and-place task. The Results section shows performance comparisons of our approach to a baseline deep learning approach and a ResNet architecture-based approach, along with very promising test results at varying clutter densities across a range of complex scenario test cases.

12.
Sensors (Basel) ; 23(3)2023 Jan 31.
Artigo em Inglês | MEDLINE | ID: mdl-36772585

RESUMO

Aiming at the existing Direction of Arrival (DOA) methods based on neural network, a large number of samples are required to achieve signal-scene adaptation and accurate angle estimation. In the coherent signal environment, the problems of a larger amount of training sample data are required. In this paper, the DOA of coherent signal is converted into the DOA parameter estimation of the angle interval of incident signal. The accurate estimation of coherent DOA under the condition of small samples based on meta-reinforcement learning (MRL) is realized. The meta-reinforcement learning method in this paper models the process of angle interval estimation of coherent signals as a Markov decision process. In the inner loop layer, the sequence to sequence (S2S) neural network is used to express the angular interval feature sequence of the incident signal DOA. The strategy learning of the existence of angle interval under small samples is realized through making full use of the context relevance of spatial spectral sequence through S2S neural network. Thus, according to the optimal strategy, the output sequence is sequentially determined to give the angle interval of the incident signal. Finally, DOA is obtained through one-dimensional spectral peak search according to the angle interval obtained. The experiment shows that the meta-reinforcement learning algorithm based on S2S neural network can quickly converge to the optimal state by only updating the gradient of S2S neural network parameters with a small sample set when a new signal environment appears.

13.
Sensors (Basel) ; 23(10)2023 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-37430735

RESUMO

This paper investigates the problem of buffer-aided relay selection to achieve reliable and secure communications in a two-hop amplify-and-forward (AF) network with an eavesdropper. Due to the fading of wireless signals and the broadcast nature of wireless channels, transmitted signals over the network may be undecodable at the receiver end or have been eavesdropped by eavesdroppers. Most available buffer-aided relay selection schemes consider either reliability or security issues in wireless communications; rarely is work conducted on both reliability and security issues. This paper proposes a buffer-aided relay selection scheme based on deep Q-learning (DQL) that considers both reliability and security. By conducting Monte Carlo simulations, we then verify the reliability and security performances of the proposed scheme in terms of the connection outage probability (COP) and secrecy outage probability (SOP), respectively. The simulation results show that two-hop wireless relay network can achieve reliable and secure communications by using our proposed scheme. We also performed comparison experiments between our proposed scheme and two benchmark schemes. The comparison results indicate that our proposed scheme outperforms the max-ratio scheme in terms of the SOP.

14.
Entropy (Basel) ; 25(4)2023 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-37190372

RESUMO

Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.

15.
Ann Stat ; 50(6): 3364-3387, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37022318

RESUMO

We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

16.
Health Care Manag Sci ; 25(3): 363-388, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35687269

RESUMO

Depending on personal and hereditary factors, each woman has a different risk of developing breast cancer, one of the leading causes of death for women. For women with a high-risk of breast cancer, their risk can be reduced by two main therapeutic approaches: 1) preventive treatments such as hormonal therapies (i.e., tamoxifen, raloxifene, exemestane); or 2) a risk reduction surgery (i.e., mastectomy). Existing national clinical guidelines either fail to incorporate or have limited use of the personal risk of developing breast cancer in their proposed risk reduction strategies. As a result, they do not provide enough resolution on the benefit-risk trade-off of an intervention policy as personal risk changes. In addressing this problem, we develop a discrete-time, finite-horizon Markov decision process (MDP) model with the objective of maximizing the patient's total expected quality-adjusted life years. We find several useful insights some of which contradict the existing national breast cancer risk reduction recommendations. For example, we find that mastectomy is the optimal choice for the border-line high-risk women who are between ages 22 and 38. Additionally, in contrast to the National Comprehensive Cancer Network recommendations, we find that exemestane is a plausible, in fact, the best, option for high-risk postmenopausal women.


Assuntos
Neoplasias da Mama , Adulto , Neoplasias da Mama/prevenção & controle , Feminino , Humanos , Mastectomia , Políticas , Comportamento de Redução do Risco , Tamoxifeno/uso terapêutico , Adulto Jovem
17.
Risk Anal ; 42(7): 1585-1602, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-34651336

RESUMO

As climate change threatens to cause increasingly frequent and severe natural disasters, decisionmakers must consider costly investments to enhance the resilience of critical infrastructures. Evaluating these potential resilience improvements using traditional cost-benefit analysis (CBA) approaches is often problematic because disasters are stochastic and can destroy even hardened infrastructure, meaning that the lifetimes of investments are themselves uncertain. In this article, we develop a novel Markov decision process (MDP) model for CBA of infrastructure resilience upgrades that offer prevention (reduce the probability of a disaster) and/or protection (mitigate the cost of a disaster) benefits. Stochastic features of the model include disaster occurrences and whether or not a disaster terminates the effective life of an earlier resilience upgrade. From our MDP model, we derive analytical expressions for the decisionmaker's willingness to pay (WTP) to enhance infrastructure resilience, and conduct a comparative static analysis to investigate how the WTP varies with the fundamental parameters of the problem. Following this theoretical portion of the article, we demonstrate the applicability of our MDP framework to real-world decision making by applying it to two case studies of electric utility infrastructure hardening. The first case study considers elevating a flood-prone substation and the second assesses upgrading transmission structures to withstand high winds. Results from these two case studies show that assumptions about the value of lost load during power outages and the distribution of customer types significantly influence the WTP for the resilience upgrades and are material to the decisions of whether or not to implement them.


Assuntos
Planejamento em Desastres , Desastres , Mudança Climática , Análise Custo-Benefício , Planejamento em Desastres/métodos , Inundações
18.
Sensors (Basel) ; 22(8)2022 Apr 14.
Artigo em Inglês | MEDLINE | ID: mdl-35458985

RESUMO

In the context of distributed defense, multi-sensor networks are required to be able to carry out reasonable planning and scheduling to achieve the purpose of continuous, accurate and rapid target detection. In this paper, a multi-sensor cooperative scheduling model based on the partially observable Markov decision process is proposed. By studying the partially observable Markov decision process and the posterior Cramer-Rao lower bound, a multi-sensor cooperative scheduling model and optimization objective function were established. The improvement of the particle filter algorithm by the beetle swarm optimization algorithm was studied to improve the tracking accuracy of the particle filter. Finally, the improved elephant herding optimization algorithm was used as the solution algorithm of the scheduling scheme, which further improved the algorithm performance of the solution model. The simulation results showed that the model could solve the distributed multi-sensor cooperative scheduling problem well, had higher solution performance than other algorithms, and met the real-time requirements.


Assuntos
Algoritmos , Simulação por Computador , Cadeias de Markov
19.
Sensors (Basel) ; 22(21)2022 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-36365857

RESUMO

In this paper, in order to solve the problem of wireless sensor networks' reliable transmission in intelligent malicious jamming, we propose a Distributed Anti-Jamming Algorithm (DAJA) based on an actor-critic algorithm for a multi-agent system. The Multi-Agent Markov Decision Process (MAMPD) is introduced to model the progress of wireless sensor networks' anti-jamming communication, and the multi-agent system learns the intelligent jamming from the external environment by using an actor-critic algorithm. On the basis of coping with the influence of external and internal factors effectively, each sensor in networks selects the appropriate channels for transmission and finally realizes the optimal transmission of the system overall in a unit time period. In the environment of probabilistic intelligent jamming with tracking properties set in this paper, the simulation shows that the algorithm proposed can outperform the algorithm based on joint Q-learning and the conventional scheme based on orthogonal frequency hopping in terms of transmission. In addition, the proposed algorithm completes two updates of strategy evaluation and action selection in one iteration, which means that the system has higher efficiency of action selection and better adaptability to the environment through the interaction with the external environment, resulting in the better performance of transmission and convergence.

20.
Sensors (Basel) ; 22(12)2022 Jun 09.
Artigo em Inglês | MEDLINE | ID: mdl-35746162

RESUMO

This paper proposes a low-complexity algorithm for a reinforcement learning-based channel estimator for multiple-input multiple-output systems. The proposed channel estimator utilizes detected symbols to reduce the channel estimation error. However, the detected data symbols may include errors at the receiver owing to the characteristics of the wireless channels. Thus, the detected data symbols are selectively used as additional pilot symbols. To this end, a Markov decision process (MDP) problem is defined to optimize the selection of the detected data symbols. Subsequently, a reinforcement learning algorithm is developed to solve the MDP problem with computational efficiency. The developed algorithm derives the optimal policy in a closed form by introducing backup samples and data subblocks, to reduce latency and complexity. Simulations are conducted, and the results show that the proposed channel estimator significantly reduces the minimum-mean square error of the channel estimates, thus improving the block error rate compared to the conventional channel estimation.


Assuntos
Algoritmos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA