RL models have four key components: a reward signal, a state, a state-dependent set of available actions, and a policy (which governs how actions are chosen). Here, a simple Q-learning agent with a softmax policy was designed to model mouse behaviour in the open field as an RL process over endogenous dopamine levels44 . Our model was recast (specifically a Q-learning agent with a softmax policy) to use endogenous dopamine (that is, syllable-associated dLight) as a reward signal, behavioural syllables as states, and transitions between behavioural syllables as actions. Given a syllable at time t + 1, the dLight peak occurring during the syllable at time t is considered the ‘reward’. The Q-table for the model was initialized with a uniform matrix with the diagonal set to 0, since by definition there are no self-transitions in our data. For every step of each simulation, given the currently expressed syllable (that is, the state), the model samples possible future syllables (actions) based on the behavioural policy and the expected dLight transient magnitude (expected reward, specified by the Q-table) associated with each syllable transition. Then, the model selected actions according to the softmax equation p(a|s)=eQs(a)/τb=1neQs(b)/τ where τ is the temperature. The model is fed 30-min experiments of actual data. Data was formatted as a sequence of states and syllable-associated dopamine. Given the current state, the model selects an action according to the softmax equation. To update the Q-table and simulate the effect of endogenous dopamine as reward, the syllable-associated dopamine is presented to the model as reward in a standard Q-learning equation. Specifically, the Q-table was then updated according to Q(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)Q(st,at)] where Q is the Q-table that defines the probability of action a while in state s, α is the learning rate, r is the reward associated with action a and state s (the dLight peak value at the transition between syllable a and syllable s), and γ is the discount factor. Performance was assessed by taking the Pearson correlation between the model’s resulting Q-table at the end of the simulation and the empirical transition matrix observed in the experimental data. Here, each row of the empirical transition matrix and the Q-table were separately z-scored prior to computing the Pearson correlation. Note that the learned Q-table is functionally equivalent to a transition matrix in this formulation. To avoid degradation in performance due to syllable sparsity, the top 10 syllables were used.
Free full text: Click here