Here we provide a technical description of the task and how to find the optimal behavior. More details and many helpful intuitions are provided in the Results section. We assume that the state of the world is either H1 or H2, and it is the aim of the decision maker to identify this state (indicated by a choice) based on stochastic evidence. This evidence δx ~ N(μδt, δt) is Gaussian for some small time period δt, with mean μδt and variance δt, where |μ| is the evidence strength, and μ ≥ 0 and μ < 0 correspond to H1 and H2 respectively. Such stochastic evidence corresponds to a diffusion model
dxdt=μ+η(t) , where η(t) is white noise with unit variance, and x(t) describes the trajectory of a drifting/diffusing particle. We assume the value of μ to be unknown to the decision maker, to be drawn from the prior p(μ) across trials, and to remain constant within a trial. After accumulating evidence δx0…t by observing the stimulus for some time t, the decision maker holds belief g(t) ≡ p(H1|δx0…t) = p(μ ≥ 0|δx0t) (or 1 − g(t)) that H1 (or H2) is correct (Fig 1A). The exact form of this belief depends on the prior p(μ) over μ and will be discussed later for different priors. As long as this prior is symmetric, that is p(μ ≥ 0) = p(μ < 0) = ½, the initial belief at stimulus onset, t = 0, is always g(0) = ½.
The decision maker receives reward Rij for choosing Hi when Hj is correct. Here rewards can be positive or negative, allowing for instance for negative reinforcement when subjects pick the wrong hypothesis, that is, when i is different from j. Additionally, we assume the accumulation of evidence to come at a cost (internal to the decision maker), given by the cost function c(t). This cost is momentary, such that the total cost for accumulating evidence if a decision is made at decision time Td after stimulus onset is
C(Td)=0Tdc(t)dt (Fig. 1B). Each trial ends after Tt seconds and is followed by the inter-trial-interval ti and an optional penalty time tp for wrong decisions (Fig. 1C). We assume that decision makers aim at maximizing their reward rate, given by
ρ=RC(Td)Tt+ti+tp, where the averages are over choices, decision times, and randomizations of ti and tp. We differentiate between fixed duration tasks and reaction time tasks. In fixed duration tasks, we assume Tt to be fixed by the experimenter and to be large when compared to Td, and tp = 0. This makes the denominator of Eq. (1) constant with respect to the subject’s behavior, such that maximizing the reward rate ρ becomes equal to maximizing the expected net reward 〈R〉 − 〈C(Td) 〉 for a single trial. In contrast, in reaction time tasks, we need to consider the whole sequence of trials when maximizing ρ because the denominator depends on the subject’s reaction time through Tt, which, in turn, influences the ratio (provided that Tt is not too short compared to the inter-trial interval and penalty time),