Reinforcement Learning Model of Dopamine-Guided Behavior

RL models have four key components: a reward signal, a state, a state-dependent set of available actions, and a policy (which governs how actions are chosen). Here, a simple Q-learning agent with a softmax policy was designed to model mouse behaviour in the open field as an RL process over endogenous dopamine levels⁴⁴. Our model was recast (specifically a Q-learning agent with a softmax policy) to use endogenous dopamine (that is, syllable-associated dLight) as a reward signal, behavioural syllables as states, and transitions between behavioural syllables as actions. Given a syllable at time t + 1, the dLight peak occurring during the syllable at time t is considered the ‘reward’. The Q-table for the model was initialized with a uniform matrix with the diagonal set to 0, since by definition there are no self-transitions in our data. For every step of each simulation, given the currently expressed syllable (that is, the state), the model samples possible future syllables (actions) based on the behavioural policy and the expected dLight transient magnitude (expected reward, specified by the Q-table) associated with each syllable transition. Then, the model selected actions according to the softmax equation

p (a | s) = \frac{e^{Q_{s} (a) / τ}}{\sum_{b = 1}^{n} e^{Q_{s} (b) / τ}}

where τ is the temperature. The model is fed 30-min experiments of actual data. Data was formatted as a sequence of states and syllable-associated dopamine. Given the current state, the model selects an action according to the softmax equation. To update the Q-table and simulate the effect of endogenous dopamine as reward, the syllable-associated dopamine is presented to the model as reward in a standard Q-learning equation. Specifically, the Q-table was then updated according to

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ {max}_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]

where Q is the Q-table that defines the probability of action a while in state s, α is the learning rate, r is the reward associated with action a and state s (the dLight peak value at the transition between syllable a and syllable s), and γ is the discount factor. Performance was assessed by taking the Pearson correlation between the model’s resulting Q-table at the end of the simulation and the empirical transition matrix observed in the experimental data. Here, each row of the empirical transition matrix and the Q-table were separately z-scored prior to computing the Pearson correlation. Note that the learned Q-table is functionally equivalent to a transition matrix in this formulation. To avoid degradation in performance due to syllable sparsity, the top 10 syllables were used.

Free full text: Click here

Markowitz J.E., Gillis W.F., Jay M., Wood J., Harris R.W., Cieszkowski R., Scott R., Brann D., Koveal D., Kula T., Weinreb C., Osman M.A., Pinto S.R., Uchida N., Linderman S.W., Sabatini B.L, & Datta S.R. (2023). Spontaneous behaviour is structured by reinforcement without explicit reward. Nature, 614(7946), 108-117.

Publication 2023

Dopamine Dopamine effect Mouse Transient

Corresponding Organization :

Other organizations : Harvard University, Neurosciences Institute, Stanford University, Howard Hughes Medical Institute

Top 5 similar protocols

Variable analysis

independent variables

Endogenous dopamine levels (syllable-associated dLight)

dependent variables

Behavioral syllables (states)
Transitions between behavioral syllables (actions)

control variables

The diagonal of the Q-table was set to 0, since by definition there are no self-transitions in the data.

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!