RESUMO
Obsessive-compulsive disorder (OCD) and pathological gambling (PG) are accompanied by deficits in behavioural flexibility. In reinforcement learning, this inflexibility can reflect asymmetric learning from outcomes above and below expectations. In alternative frameworks, it reflects perseveration independent of learning. Here, we examine evidence for asymmetric reward-learning in OCD and PG by leveraging model-based functional magnetic resonance imaging (fMRI). Compared with healthy controls (HC), OCD patients exhibited a lower learning rate for worse-than-expected outcomes, which was associated with the attenuated encoding of negative reward prediction errors in the dorsomedial prefrontal cortex and the dorsal striatum. PG patients showed higher and lower learning rates for better- and worse-than-expected outcomes, respectively, accompanied by higher encoding of positive reward prediction errors in the anterior insula than HC. Perseveration did not differ considerably between the patient groups and HC. These findings elucidate the neural computations of reward-learning that are altered in OCD and PG, providing a potential account of behavioural inflexibility in those mental disorders.
Assuntos
Jogo de Azar , Transtorno Obsessivo-Compulsivo , Humanos , Reforço Psicológico , Recompensa , Transtorno Obsessivo-Compulsivo/diagnóstico por imagem , Córtex Pré-Frontal/diagnóstico por imagem , Imageamento por Ressonância MagnéticaRESUMO
From an associative perspective the acquisition of new goal-directed actions requires the encoding of specific action-outcome (AO) associations and, therefore, sensitivity to the validity of an action as a predictor of a specific outcome relative to other events. Although competitive architectures have been proposed within associative learning theory to achieve this kind of identity-based selection, whether and how these architectures are implemented by the brain is still a matter of conjecture. To investigate this issue, we trained human participants to encode various AO associations while undergoing functional neuroimaging (fMRI). We then degraded one AO contingency by increasing the probability of the outcome in the absence of its associated action while keeping other AO contingencies intact. We found that this treatment selectively reduced performance of the degraded action. Furthermore, when a signal predicted the unpaired outcome, performance of the action was restored, suggesting that the degradation effect reflects competition between the action and the context for prediction of the specific outcome. We used a Kalman filter to model the contribution of different causal variables to AO learning and found that activity in the medial prefrontal cortex (mPFC) and the dorsal anterior cingulate cortex (dACC) tracked changes in the association of the action and context, respectively, with regard to the specific outcome. Furthermore, we found the mPFC participated in a network with the striatum and posterior parietal cortex to segregate the influence of the various competing predictors to establish specific AO associations.SIGNIFICANCE STATEMENT Humans and other animals learn the consequences of their actions, allowing them to control their environment in a goal-directed manner. Nevertheless, it is unknown how we parse environmental causes from the effects of our own actions to establish these specific action-outcome (AO) relationships. Here, we show that the brain learns the causal structure of the environment by segregating the unique influence of actions from other causes in the medial prefrontal and anterior cingulate cortices and, through a network of structures, including the caudate nucleus and posterior parietal cortex, establishes the distinct causal relationships from which specific AO associations are formed.
Assuntos
Giro do Cíngulo , Aprendizagem , Animais , Corpo Estriado , Humanos , Imageamento por Ressonância Magnética , Lobo Parietal , Córtex Pré-Frontal , Aprendizagem Baseada em ProblemasRESUMO
Adversarial examples are carefully crafted input patterns that are surprisingly poorly classified by artificial and/or natural neural networks. Here we examine adversarial vulnerabilities in the processes responsible for learning and choice in humans. Building upon recent recurrent neural network models of choice processes, we propose a general framework for generating adversarial opponents that can shape the choices of individuals in particular decision-making tasks toward the behavioral patterns desired by the adversary. We show the efficacy of the framework through three experiments involving action selection, response inhibition, and social decision-making. We further investigate the strategy used by the adversary in order to gain insights into the vulnerabilities of human choice. The framework may find applications across behavioral sciences in helping detect and avoid flawed choice.
Assuntos
Tomada de Decisões/fisiologia , Aprendizagem/fisiologia , Recompensa , Comportamento de Escolha/fisiologia , Simulação por Computador , Humanos , Redes Neurais de Computação , Reforço PsicológicoRESUMO
State-space and action representations form the building blocks of decision-making processes in the brain; states map external cues to the current situation of the agent whereas actions provide the set of motor commands from which the agent can choose to achieve specific goals. Although these factors differ across environments, it is currently unknown whether or how accurately state and action representations are acquired by the agent because previous experiments have typically provided this information a priori through instruction or pre-training. Here we studied how state and action representations adapt to reflect the structure of the world when such a priori knowledge is not available. We used a sequential decision-making task in rats in which they were required to pass through multiple states before reaching the goal, and for which the number of states and how they map onto external cues were unknown a priori. We found that, early in training, animals selected actions as if the task was not sequential and outcomes were the immediate consequence of the most proximal action. During the course of training, however, rats recovered the true structure of the environment and made decisions based on the expanded state-space, reflecting the multiple stages of the task. Similarly, we found that the set of actions expanded with training, although the emergence of new action sequences was sensitive to the experimental parameters and specifics of the training procedure. We conclude that the profile of choices shows a gradual shift from simple representations to more complex structures compatible with the structure of the world.
Assuntos
Biologia Computacional/métodos , Tomada de Decisões/fisiologia , Aprendizagem/fisiologia , Algoritmos , Animais , Comportamento Animal , Sinais (Psicologia) , Masculino , Modelos Biológicos , Ratos , Ratos WistarRESUMO
Evaluating the future consequences of actions is achievable by simulating a mental search tree into the future. Expanding deep trees, however, is computationally taxing. Therefore, machines and humans use a plan-until-habit scheme that simulates the environment up to a limited depth and then exploits habitual values as proxies for consequences that may arise in the future. Two outstanding questions in this scheme are "in which directions the search tree should be expanded?", and "when should the expansion stop?". Here we propose a principled solution to these questions based on a speed/accuracy tradeoff: deeper expansion in the appropriate directions leads to more accurate planning, but at the cost of slower decision-making. Our simulation results show how this algorithm expands the search tree effectively and efficiently in a grid-world environment. We further show that our algorithm can explain several behavioral patterns in animals and humans, namely the effect of time-pressure on the depth of planning, the effect of reward magnitudes on the direction of planning, and the gradual shift from goal-directed to habitual behavior over the course of training. The algorithm also provides several predictions testable in animal/human experiments.
Assuntos
Técnicas de Planejamento , Algoritmos , Animais , Comportamento de Escolha , Humanos , Estudos Prospectivos , RecompensaRESUMO
Popular computational models of decision-making make specific assumptions about learning processes that may cause them to underfit observed behaviours. Here we suggest an alternative method using recurrent neural networks (RNNs) to generate a flexible family of models that have sufficient capacity to represent the complex learning and decision- making strategies used by humans. In this approach, an RNN is trained to predict the next action that a subject will take in a decision-making task and, in this way, learns to imitate the processes underlying subjects' choices and their learning abilities. We demonstrate the benefits of this approach using a new dataset drawn from patients with either unipolar (n = 34) or bipolar (n = 33) depression and matched healthy controls (n = 34) making decisions on a two-armed bandit task. The results indicate that this new approach is better than baseline reinforcement-learning methods in terms of overall performance and its capacity to predict subjects' choices. We show that the model can be interpreted using off-policy simulations and thereby provides a novel clustering of subjects' learning processes-something that often eludes traditional approaches to modelling and behavioural analysis.
Assuntos
Simulação por Computador , Tomada de Decisões/fisiologia , Aprendizagem/fisiologia , Modelos Psicológicos , Adulto , Transtorno Bipolar/fisiopatologia , Biologia Computacional , Transtorno Depressivo/fisiopatologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Redes Neurais de Computação , Adulto JovemRESUMO
Computational modeling plays an important role in modern neuroscience research. Much previous research has relied on statistical methods, separately, to address two problems that are actually interdependent. First, given a particular computational model, Bayesian hierarchical techniques have been used to estimate individual variation in parameters over a population of subjects, leveraging their population-level distributions. Second, candidate models are themselves compared, and individual variation in the expressed model estimated, according to the fits of the models to each subject. The interdependence between these two problems arises because the relevant population for estimating parameters of a model depends on which other subjects express the model. Here, we propose a hierarchical Bayesian inference (HBI) framework for concurrent model comparison, parameter estimation and inference at the population level, combining previous approaches. We show that this framework has important advantages for both parameter estimation and model comparison theoretically and experimentally. The parameters estimated by the HBI show smaller errors compared to other methods. Model comparison by HBI is robust against outliers and is not biased towards overly simplistic models. Furthermore, the fully Bayesian approach of our theory enables researchers to make inference on group-level parameters by performing HBI t-test.
Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Modelos Neurológicos , Simulação por Computador , Tomada de Decisões/fisiologia , Humanos , Aprendizagem/fisiologiaRESUMO
Behavioral evidence suggests that instrumental conditioning is governed by two forms of action control: a goal-directed and a habit learning process. Model-based reinforcement learning (RL) has been argued to underlie the goal-directed process; however, the way in which it interacts with habits and the structure of the habitual process has remained unclear. According to a flat architecture, the habitual process corresponds to model-free RL, and its interaction with the goal-directed process is coordinated by an external arbitration mechanism. Alternatively, the interaction between these systems has recently been argued to be hierarchical, such that the formation of action sequences underlies habit learning and a goal-directed process selects between goal-directed actions and habitual sequences of actions to reach the goal. Here we used a two-stage decision-making task to test predictions from these accounts. The hierarchical account predicts that, because they are tied to each other as an action sequence, selecting a habitual action in the first stage will be followed by a habitual action in the second stage, whereas the flat account predicts that the statuses of the first and second stage actions are independent of each other. We found, based on subjects' choices and reaction times, that human subjects combined single actions to build action sequences and that the formation of such action sequences was sufficient to explain habitual actions. Furthermore, based on Bayesian model comparison, a family of hierarchical RL models, assuming a hierarchical interaction between habit and goal-directed processes, provided a better fit of the subjects' behavior than a family of flat models. Although these findings do not rule out all possible model-free accounts of instrumental conditioning, they do show such accounts are not necessary to explain habitual actions and provide a new basis for understanding how goal-directed and habitual action control interact.
Assuntos
Objetivos , Teorema de Bayes , Tomada de Decisões , Humanos , Motivação , Tempo de ReaçãoRESUMO
It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquired and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary, and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions.
Assuntos
Objetivos , Hábitos , Aprendizagem/fisiologia , Reforço Psicológico , Animais , Humanos , Distribuição AleatóriaRESUMO
Instrumental responses are hypothesized to be of two kinds: habitual and goal-directed, mediated by the sensorimotor and the associative cortico-basal ganglia circuits, respectively. The existence of the two heterogeneous associative learning mechanisms can be hypothesized to arise from the comparative advantages that they have at different stages of learning. In this paper, we assume that the goal-directed system is behaviourally flexible, but slow in choice selection. The habitual system, in contrast, is fast in responding, but inflexible in adapting its behavioural strategy to new conditions. Based on these assumptions and using the computational theory of reinforcement learning, we propose a normative model for arbitration between the two processes that makes an approximately optimal balance between search-time and accuracy in decision making. Behaviourally, the model can explain experimental evidence on behavioural sensitivity to outcome at the early stages of learning, but insensitivity at the later stages. It also explains that when two choices with equal incentive values are available concurrently, the behaviour remains outcome-sensitive, even after extensive training. Moreover, the model can explain choice reaction time variations during the course of learning, as well as the experimental observation that as the number of choices increases, the reaction time also increases. Neurobiologically, by assuming that phasic and tonic activities of midbrain dopamine neurons carry the reward prediction error and the average reward signals used by the model, respectively, the model predicts that whereas phasic dopamine indirectly affects behaviour through reinforcing stimulus-response associations, tonic dopamine can directly affect behaviour through manipulating the competition between the habitual and the goal-directed systems and thus, affect reaction time.
Assuntos
Comportamento de Escolha/fisiologia , Tomada de Decisões/fisiologia , Aprendizagem/fisiologia , Modelos Neurológicos , Algoritmos , Animais , Comportamento Animal , Simulação por Computador , Dopamina/fisiologia , Objetivos , Humanos , Cadeias de Markov , Aprendizagem em Labirinto , Neurônios/fisiologia , Ratos , Reforço Psicológico , Reprodutibilidade dos TestesRESUMO
Clinical and experimental observations show individual differences in the development of addiction. Increasing evidence supports the hypothesis that dopamine receptor availability in the nucleus accumbens (NAc) predisposes drug reinforcement. Here, modeling striatal-midbrain dopaminergic circuit, we propose a reinforcement learning model for addiction based on the actor-critic model of striatum. Modeling dopamine receptors in the NAc as modulators of learning rate for appetitive--but not aversive--stimuli in the critic--but not the actor--we define vulnerability to addiction as a relatively lower learning rate for the appetitive stimuli, compared to aversive stimuli, in the critic. We hypothesize that an imbalance in this learning parameter used by appetitive and aversive learning systems can result in addiction. We elucidate that the interaction between the degree of individual vulnerability and the duration of exposure to drug has two progressive consequences: deterioration of the imbalance and establishment of an abnormal habitual response in the actor. Using computational language, the proposed model describes how development of compulsive behavior can be a function of both degree of drug exposure and individual vulnerability. Moreover, the model describes how involvement of the dorsal striatum in addiction can be augmented progressively. The model also interprets other forms of addiction, such as obesity and pathological gambling, in a common mechanism with drug addiction. Finally, the model provides an answer for the question of why behavioral addictions are triggered in Parkinson's disease patients by D2 dopamine agonist treatments.
Assuntos
Comportamento Aditivo/fisiopatologia , Individualidade , Núcleo Accumbens/fisiopatologia , Receptores Dopaminérgicos/fisiologia , Reforço Psicológico , Simulação por Computador , Humanos , Modelos Neurológicos , Rede Nervosa/fisiopatologiaRESUMO
It is now commonly accepted that instrumental actions can reflect goal-directed control; i.e., they can show sensitivity to changes in the relationship to and the value of their consequences. With overtraining, stress, neurodegeneration, psychiatric conditions, or after exposure to various drugs of abuse, goal-directed control declines and instrumental actions are performed independently of their consequences. Although this latter insensitivity has been argued to reflect the development of habitual control, the lack of a positive definition of habits has rendered this conclusion controversial. Here we consider various alternative definitions of habit, including recent suggestions they reflect chunked action sequences, to derive criteria with which to categorize responses as habitual. We consider various theories regarding the interaction between goal-directed and habitual controllers and propose a collaborative model based on their hierarchical integration. We argue that this model is consistent with the available data, can be instantiated both at an associative level and computationally and generates interesting predictions regarding the influence of this collaborative integration on behavior.
RESUMO
Within a rational framework, a decision-maker selects actions based on the reward-maximization principle, which stipulates that they acquire outcomes with the highest value at the lowest cost. Action selection can be divided into two dimensions: selecting an action from various alternatives, and choosing its vigor, i.e., how fast the selected action should be executed. Both of these dimensions depend on the values of outcomes, which are often affected as more outcomes are consumed together with their associated actions. Despite this, previous research has only addressed the computational substrate of optimal actions in the specific condition that the values of outcomes are constant. It is not known what actions are optimal when the values of outcomes are non-stationary. Here, based on an optimal control framework, we derive a computational model for optimal actions when outcome values are non-stationary. The results imply that, even when the values of outcomes are changing, the optimal response rate is constant rather than decreasing. This finding shows that, in contrast to previous theories, commonly observed changes in action rate cannot be attributed solely to changes in outcome value. We then prove that this observation can be explained based on uncertainty about temporal horizons; e.g., the session duration. We further show that, when multiple outcomes are available, the model explains probability matching as well as maximization strategies. The model therefore provides a quantitative analysis of optimal action and explicit predictions for future testing.
Assuntos
Comportamento de Escolha , Modelos Psicológicos , Recompensa , HumanosRESUMO
Choice between actions often requires the ability to retrieve action consequences in circumstances where they are only partially observable. This capacity has recently been argued to depend on orbitofrontal cortex; however, no direct evidence for this hypothesis has been reported. Here, we examined whether activity in the medial orbitofrontal cortex (mOFC) underlies this critical determinant of decision-making in rats. First, we simulated predictions from this hypothesis for various tests of goal-directed action by removing the assumption that rats could retrieve partially observable outcomes and then tested those predictions experimentally using manipulations of the mOFC. The results closely followed predictions; consistent deficits only emerged when action consequences had to be retrieved. Finally, we put action selection based on observable and unobservable outcomes into conflict and found that whereas intact rats selected actions based on the value of retrieved outcomes, mOFC rats relied solely on the value of observable outcomes.
Assuntos
Comportamento de Escolha/fisiologia , Córtex Pré-Frontal/fisiologia , Desempenho Psicomotor/fisiologia , Recompensa , Animais , Masculino , Ratos , Ratos Long-EvansRESUMO
Goal-directed action involves making high-level choices that are implemented using previously acquired action sequences to attain desired goals. Such a hierarchical schema is necessary for goal-directed actions to be scalable to real-life situations, but results in decision-making that is less flexible than when action sequences are unfolded and the decision-maker deliberates step-by-step over the outcome of each individual action. In particular, from this perspective, the offline revaluation of any outcomes that fall within action sequence boundaries will be invisible to the high-level planner resulting in decisions that are insensitive to such changes. Here, within the context of a two-stage decision-making task, we demonstrate that this property can explain the emergence of habits. Next, we show how this hierarchical account explains the insensitivity of over-trained actions to changes in outcome value. Finally, we provide new data that show that, under extended extinction conditions, habitual behaviour can revert to goal-directed control, presumably as a consequence of decomposing action sequences into single actions. This hierarchical view suggests that the development of action sequences and the insensitivity of actions to changes in outcome value are essentially two sides of the same coin, explaining why these two aspects of automatic behaviour involve a shared neural structure.
Assuntos
Tomada de Decisões/fisiologia , Objetivos , Hábitos , Aprendizagem/fisiologia , Animais , Humanos , RoedoresRESUMO
It is generally assumed that choice between different actions reflects the difference between their action values yet little direct evidence confirming this assumption has been reported. Here we assess whether the brain calculates the absolute difference between action values or their relative advantage, that is, the probability that one action is better than the other alternatives. We use a two-armed bandit task during functional magnetic resonance imaging and modelled responses to determine both the size of the difference between action values (D) and the probability that one action value is better (P). The results show haemodynamic signals corresponding to P in right dorsolateral prefrontal cortex (dlPFC) together with evidence that these signals modulate motor cortex activity in an action-specific manner. We find no significant activity related to D. These findings demonstrate that a distinct neuronal population mediates action-value comparisons, and reveals how these comparisons are implemented to mediate value-based decision-making.
Assuntos
Comportamento de Escolha/fisiologia , Tomada de Decisões/fisiologia , Objetivos , Córtex Pré-Frontal/anatomia & histologia , Córtex Pré-Frontal/fisiologia , Adolescente , Adulto , Teorema de Bayes , Mapeamento Encefálico , Feminino , Humanos , Aprendizagem/fisiologia , Imageamento por Ressonância Magnética , Masculino , Modelos Estatísticos , Análise e Desempenho de Tarefas , Adulto JovemRESUMO
Methadone detoxification is among the widely used treatment programs for opioid dependence. The aims of this study were to identify which patient baseline factors and treatment regimen features are predictors of the treatment outcome in an outpatient flexible dose-duration methadone detoxification program. We studied 126 opioid dependents in a naturalistic nonexperimental clinical setting. The patients were assessed for baseline demographic characteristics, and drug abuse characteristics. Treatment regimen features were recorded during the program. Successful treatment completion was defined as the last daily dose of methadone being less than 15 mg, negative urine analysis in the last two weeks of treatment, and based on the final clinician-client's decision. Out of 126 patients, 60 patients completed detoxification successfully. Younger age, longer duration of the opioid abuse, and higher subjective opiate intoxication severity before treatment entry were all significantly associated with negative treatment outcome. Among treatment regimen features, higher maximum methadone dose had a marginally significant independent effect on treatment failure. Patients with maximum methadone dose of more than 75 mg per day had around ten times worse success rate when compared to those who received lesser doses. The study findings could be used to predict treatment outcome and prognosis in a more individualized and patient-tailored approach in the real clinical setting. Guideline development for treatment selection and outcome monitoring in addiction medicine based on similar studies could enhance treatment outcome in clinical services.
Assuntos
Metadona/uso terapêutico , Transtornos Relacionados ao Uso de Opioides/tratamento farmacológico , Adolescente , Adulto , Feminino , Humanos , Masculino , Estudos Prospectivos , Fatores de Tempo , Resultado do TratamentoRESUMO
Based on the dopamine hypotheses of cocaine addiction and the assumption of decrement of brain reward system sensitivity after long-term drug exposure, we propose a computational model for cocaine addiction. Utilizing average reward temporal difference reinforcement learning, we incorporate the elevation of basal reward threshold after long-term drug exposure into the model of drug addiction proposed by Redish. Our model is consistent with the animal models of drug seeking under punishment. In the case of nondrug reward, the model explains increased impulsivity after long-term drug exposure. Furthermore, the existence of a blocking effect for cocaine is predicted by our model.