PSTH (peri-stimulus time histogram)#

Origin in electrophysiology#

PSTH is a term inherited from electrophysiology. In the spike-counting world, you take a long extracellular recording, find every spike, line up a window around each behavioural event of interest (cue, reward, lever press), and count spikes in fine time bins relative to that event. Average across events and you get a histogram that says: “in the 200 ms after the cue, the neuron fires at roughly 30 Hz on average; before the cue, it fires at about 10 Hz.” The histogram shape is the neuron’s average response, time-locked to the event.

In fiber photometry the operation is identical in spirit but the underlying signal is different. Instead of discrete spikes you have a continuous fluorescence trace (typically z-scored dF/F) that proxies bulk calcium activity in the recorded region. There is nothing to count, only to slice. The “histogram” becomes a continuous amplitude trace: an average of the z-score waveform locked to the event.

The “histogram” name is therefore a vestige of the electrophysiology origin. In photometry it is more accurate to call it an event-aligned average, but the field has standardised on PSTH and that is what GuPPy uses.

Constructing and reading a PSTH#

A PSTH (peri-stimulus time histogram) is the across-event average of a continuous signal in a fixed window around each event. Its purpose is to isolate the event-locked component of the signal: the part that consistently appears at the same time relative to each event, separated from spontaneous activity, drift, and noise that do not. Operationally, it is computed by taking a z-scored signal, a list of event timestamps, and a pre and post window in seconds; extracting the corresponding window around each event; and averaging the extracted windows point by point.

PSTHs are computed on the z-scored trace rather than on raw fluorescence or dF/F because z-score puts different recordings on a comparable scale, with values expressed in standard-deviations-of-noise units. See the explainer on z-score normalization for more information.

A single event-aligned trace is dominated by noise and by signal unrelated to the event. The panel-3 overlay makes both visible: alongside the typical responses sit a pure-noise trace (orange) with no event-locked activity and an artifact trace (purple) with a spurious post-event peak. Averaging across events recovers the event-locked component because it is the only thing that lines up. Spontaneous transients, slow drift, and the purple trace’s artifact each land at a random offset relative to t = 0, so they are averaged out across the window and contribute negligibly to the mean. Panel 4 is what survives that smearing: the part of the signal that consistently appears at the same time across events.

The recovery only works with enough events. With small event counts a few loud individual traces can dominate the average, and the SEM (which shrinks as 1/sqrt(N)) is itself unreliable as an uncertainty estimate. Where “too few” begins depends on the recording’s SNR and the across-event variability, but it is a study-design constraint that the analysis cannot rescue.

The flip side is that PSTH cannot reveal activity that is not consistently locked to the chosen event; transient detection and cross-correlation are tools for those questions.

Correction for long-term drift#

Long photometry sessions are not flat. The trace drifts slowly up and down across the recording for reasons unrelated to the events of interest, including residual photobleaching, gradual changes in the animal’s general state, and slow shifts at the rig. Z-scoring the whole session does not remove this drift. So an event-aligned trace extracted from early in the session has a different starting height than one extracted from late in the session, and the average of those traces inherits the spread: the mean PSTH no longer sits cleanly at zero before the event, and the SEM band is widened by event-to-event differences in starting height that have nothing to do with the response.

Per-event baseline correction takes the simplest possible approach: rather than modelling drift across the whole session, it removes whatever the drift contributed to each event-aligned trace’s own pre-event window. The middle panel of the figure makes the problem concrete. The three highlighted traces have effectively identical event responses, yet their pre-event baselines sit at noticeably different levels: positive for the start-of-session trace (blue), near zero for the middle (red), negative for the end (green). Each baseline level matches where the slow drift in the top panel happens to be when that event fires. The fix is to compute the mean of each trace over the shaded yellow baseline window and subtract it from that trace. The bottom panel shows the result: every trace is anchored at zero pre-event regardless of where in the session the event happened, and the event response itself is unchanged.

This correction sits on top of the session-wide z-score rather than replacing it: the two together give a y-axis in noise units (from z-score) that starts cleanly at zero pre-event (from baseline correction).

Why a separate subtraction step at all? You might wonder whether better z-scoring could absorb the drift directly. It cannot, and the reason is structural. The mean \(\mu\) and standard deviation \(\sigma\) in \(z = (x - \mu) / \sigma\) are single numbers computed over the entire recording, not functions of time. Subtracting \(\mu\) from every sample shifts the whole trace down by the same amount, so it removes only the constant part of any drift, not its time-varying part. Concretely: if the drift is linear, \(\text{drift}(t) = a \cdot t + b\), then \(\mu = a \cdot T/2 + b\), and subtracting \(\mu\) leaves \(a \cdot t - a \cdot T/2\), still linear in \(t\) with the slope \(a\) unchanged. More generally, z-score is an affine map applied uniformly in time: it can shift and scale any input but not change its shape. Whatever time-varying structure was in \(x(t)\) is also in \(z(t)\). Removing drift therefore requires a time-aware operation that uses a different baseline at each time point, which is exactly what per-event baseline correction does, one event at a time.

Summary statistics#

Once a PSTH has been built and corrected for drift, scalars summarise its response into single numbers that can be compared across conditions, sessions, or animals. Three are commonly reported:

Peak amplitude: the largest value of the PSTH inside a chosen post-event window.
Peak latency: the time at which that maximum occurs, relative to the event.
AUC (area under the curve): the integral of the PSTH over the same window.

For responses that go below baseline (suppressions, omitted-reward dips), the relevant version of peak amplitude is the signed minimum rather than the maximum, and peak latency is the time of that minimum.

The three scalars answer two different kinds of question. Peak amplitude and AUC both measure how big the response was, and they can disagree in ways that reveal what each is actually measuring. Peak latency measures when the response was largest; two recordings can have the same peak amplitude but different latencies, which generally tells a different biological story (early sensory vs late cognitive, for instance). Latency does not interact with response shape the same way the magnitude pair does, so we focus first on peak amplitude and AUC to understand them in a comparative sense.

The top row is the canonical case: a single clean event-locked peak. Peak and AUC tell the same story (peak ~3.0, AUC ~4.5) and reporting either one would be defensible. The lower two rows are where the metrics begin to disagree, and the comparison is what reveals what each metric is actually measuring.

The broad sustained response in the middle row peaks at less than half the height of the canonical case (peak ~1.2 vs ~3.0), and yet its AUC is higher (~5.4 vs ~4.5) because the trace stays elevated for several seconds. Peak collapses on the lower amplitude; AUC adds across the window and rewards the duration. The opposite asymmetry shows up in the bottom row, where a small real response is contaminated by a tall narrow artifact spike. Peak latches onto the spike and reports a value (~4.5) much larger than the real response, while AUC stays close to what the real response alone would contribute (~2.2) because the spike is too narrow to add much area. AUC’s robustness here is asymmetric, though: a broad artifact (a slow drift bump that survived correction, a contaminating long-duration nuisance signal) would inflate AUC the same way a narrow spike inflates peak.

The structural reason behind both disagreements is that peak is the maximum, a single sample, while AUC is an integral over the window. A tall narrow shape and a short broad shape can have the same peak but very different AUCs, or the same AUC but very different peaks. Reporting both is therefore standard practice: together they distinguish “taller in this condition” from “taller-and-broader in this condition”, and they protect against the artifact-spike failure mode where a single noisy sample dominates the summary.

Peak latency tells a different kind of story. Where peak amplitude and AUC are competing answers to how big, latency answers when. Two responses can have nearly identical peak amplitude and AUC, and yet peak at very different times relative to the event — and that distinction is biology, not noise. Primary sensory regions peak in tens of milliseconds; downstream associative regions in hundreds; striatal dopamine in the few-hundred-millisecond range. A predicted reward peaks earlier than a surprising one. Peak amplitude alone cannot separate any of these.

Latency does inherit one limitation from peak amplitude. Both are read off a single sample — the maximum — so a narrow contamination spike will hijack both: not just the magnitude reading but also the timing reading. AUC, by contrast, has no analogous timing scalar; the integral has no preferred time point.

Both metrics depend on the post-event window itself, which is a user-set modelling decision rather than a fact of the data. A window that is too short truncates broad sustained responses, leaving AUC understated and possibly clipping a late peak entirely. A window that is too long dilutes a sharp response with post-event activity that is no longer event-locked, which inflates AUC without changing peak. Different windows on the same PSTH produce different but equally valid summaries that mean slightly different things, the same caveat that applies to the burst-rejection threshold below.

Event rejection#

Not every event timestamp produces a usable event-aligned trace. Some sit too close to the start or end of the recording for the extraction window to fit; some arrive in clusters tight enough that adjacent windows overlap and contaminate each other. Both cases would distort the average if left in. Two filters address these failure modes.

Edge rejection drops events whose pre or post extraction window would extend past the recording bounds. There is no data outside the recording, so the missing samples cannot be averaged honestly, and keeping such an event would distort the event-aligned average and any downstream statistics. The figure below shows this visually: the extraction window of an event near the recording start clips past t = 0, into a “no recording” zone where no data exists.

Burst rejection drops events that fall closer to a previous kept event than a user-set inter-event threshold. This matters for behaviours that come in bursts (rapid licks, repeated lever presses) where adjacent extraction windows would otherwise overlap and contaminate each other. The threshold is task-dependent and is a modelling choice rather than a tunable with a formal optimum: the same recording with two different thresholds produces two valid PSTHs that mean slightly different things.