Order of Operations in Data Harmonization¶
When the Labeler cleans a signal it may trim, fill gaps, resample, smooth, normalize, and export. Because each step is an operator on the data, changing their order usually changes the result. This page explains the math, recommends a default pipeline, and shows two examples (one simple, one advanced).
Why the order matters¶
Let \(\mathcal{X}\) be the space of signals with \(x\in\mathcal{X}\). Each step—Trim \((T)\), Fill \((F)\), Resample \((R)\), Smooth \((S)\), Normalize \((N)\)—acts as an operator on \(\mathcal{X}\).
Most harmonization steps do not commute because they either
- depend on statistics of the input (non-linear),
- discard information (non-invertible),
- or both.
| Step | Symbol | Nature | Typical action |
|---|---|---|---|
| Trim | \(T\) | selection (non-invertible) | drop out-of-ROI spans, clip to interval |
| Data fill | \(F\) | non-linear, data-dependent | replace NaNs by interpolation/constant |
| Resampling | \(R\) | linear, often non-invertible | change sampling grid (up/down/irregular→regular) |
| Smoothing | \(S\) | linear, non-invertible | low-pass filter, moving average, Savitzky–Golay |
| Normalization | \(N\) | non-linear, data-dependent | z-score, median–IQR, min–max, robust scaling |
| Export | — | I/O | write artifacts with provenance |
Unless special symmetry holds,
Recommended pipeline (default)¶
Trim → Fill → Resample → Smooth → Normalize → Export
| Step | Reason |
|---|---|
| Trim first | Restricts processing to the region of interest; avoids computing statistics on discarded spans. |
| Fill second | Ensures later steps see complete data and can compute their own statistics. |
| Resample third | Aligns every channel on a common grid before any cosmetic/denoising filters. |
| Smooth fourth | Removes high-frequency noise (and interpolation stair-steps) before scaling. |
| Normalize fifth | Computes scale/offset on the final, de-noised series—the values models will actually consume. |
| Export last | Persist the final artifacts and full provenance (config + versions + checksums). |
DSP caveat: when downsampling uniformly-sampled data
Classical DSP recommends anti-alias smoothing before decimation on a fixed-rate uniform grid to avoid aliasing when you downsample (e.g., large factor reductions or frequency-domain analysis). Polyphase FIR resamplers (e.g., Kaiser-window designs) embed this safeguard by low-pass filtering to the new Nyquist rate during rate conversion. See Crochiere & Rabiner (1983) in References.
Worked example (numbers)¶
Raw signal sampled every \(1\,\mathrm{s}\):
x = [ 2, NaN, 6, 5 ]
Configuration
- Fill – linear interpolation
- Resample – upsample to \(0.5\,\mathrm{s}\) (factor 2) by linear interpolation
- Smooth – 3-point simple moving average
- Normalize – z-score \((y-\mu)/\sigma\)
Pipeline A — Fill → Resample → Smooth → Normalize¶
| Stage | Output |
|---|---|
| Fill \(F\) | \(x_{\text{filled}} = [2,\; 4,\; 6,\; 5]\) |
| Resample \(R\) | \(x_{\text{resampled}} = [2,\; 3,\; 4,\; 5,\; 6,\; 5.5]\) |
| Smooth \(S\) | \(x_{\text{smoothed}} = [\text{NaN},\; 3,\; 4,\; 5,\; 5.5,\; \text{NaN}]\) |
| Normalize \(N\) | mean \(\mu=4.38\), stdev \(\sigma\approx0.96\) → \([\text{NaN},\, -1.43,\, -0.39,\, 0.65,\, 1.17,\, \text{NaN}]\) |
Pipeline B — Fill → Resample → Normalize → Smooth¶
| Stage | Output |
|---|---|
| Fill | \(x_{\text{filled}} = [2,\; 4,\; 6,\; 5]\) |
| Resample | \(x_{\text{resampled}} = [2,\; 3,\; 4,\; 5,\; 6,\; 5.5]\) |
| Normalize | \(\mu=4.25,\;\sigma\approx1.41 \Rightarrow [-1.60,\; -0.89,\; -0.18,\; 0.53,\; 1.24,\; 0.89]\) |
| Smooth | \([\text{NaN},\; -0.89,\; -0.18,\; 0.53,\; 0.89,\; \text{NaN}]\) |
Result – The final vectors differ in magnitude and structure, confirming \(N\circ S \neq S\circ N\).
Reproducible code¶
=== "Python"
import numpy as np
def fill_linear(x):
"""
Linearly interpolate over NaN values in a 1D array.
Parameters
----------
x : array-like of float
Input signal with possible NaN entries. Assumed to be
sampled at uniform time intervals (default 1 s in this workflow).
Returns
-------
x_filled : ndarray
Array where NaN values have been replaced by values obtained from
1D linear interpolation between valid points.
Notes
-----
This implements the 'F' (Fill) step in the preprocessing pipeline.
It assumes the time axis is simply the array index (0, 1, ..., N-1).
"""
x = np.asarray(x, dtype=float)
t = np.arange(len(x), dtype=float)
mask = np.isfinite(x)
# Linear interpolation over finite points only
x_filled = np.interp(t, t[mask], x[mask])
return x_filled
def resample_linear(x_filled, dt_src=1.0, dt_new=0.5):
"""
Uniformly resample a 1D signal to a finer (or coarser) time step
using linear interpolation.
Parameters
----------
x_filled : ndarray
Signal sampled at dt_src seconds, with no NaNs.
dt_src : float, optional
Original sampling interval in seconds. Default is 1.0 s.
dt_new : float, optional
New sampling interval in seconds. Default is 0.5 s.
Returns
-------
y : ndarray
Resampled signal at dt_new spacing, stopping before the final point.
t_new : ndarray
Corresponding time coordinates for the resampled signal.
Notes
-----
This is the 'R' (Resample) step. The time grid stops at T - dt_new,
where T = (N-1) * dt_src, to avoid duplicating the last original sample.
"""
T = dt_src * (len(x_filled) - 1)
t_src = np.arange(0.0, T + 1e-12, dt_src)
t_new = np.arange(0.0, T, dt_new) # excludes the final T point
y = np.interp(t_new, t_src, x_filled)
return y, t_new
def moving_average_3(y):
"""
Apply a centered 3-point moving average to a 1D signal.
Parameters
----------
y : ndarray
Input signal (1D array).
Returns
-------
sm : ndarray
Smoothed signal with NaN at the first and last positions,
because a centered average cannot be computed at endpoints.
Notes
-----
This is the 'S' (Smooth) step. Each valid point is the mean of the
sample itself and its immediate neighbors.
"""
y = np.asarray(y, dtype=float)
sm = np.full_like(y, np.nan)
for i in range(1, len(y)-1):
sm[i] = (y[i-1] + y[i] + y[i+1]) / 3.0
return sm
def zscore(arr, mask=None, ddof=0):
"""
Standardize a signal to zero mean and unit variance (z-score).
Parameters
----------
arr : ndarray
Input signal (1D array) with possible NaNs.
mask : ndarray of bool, optional
Boolean mask of which entries to include in mean/std computation.
If None, all finite values are included.
ddof : int, optional
Delta degrees of freedom for the standard deviation.
ddof=0 gives the population standard deviation,
ddof=1 gives the sample standard deviation.
Returns
-------
z : ndarray
z-scored signal, with NaN where mask is False.
mu : float
Mean of the included values.
sigma : float
Standard deviation of the included values.
Notes
-----
This is the 'N' (Normalize) step in the pipeline.
"""
a = np.asarray(arr, dtype=float)
if mask is None:
mask = np.isfinite(a)
mu = np.nanmean(a[mask])
if ddof > 0:
sigma = np.sqrt(np.nanmean((a[mask] - mu) ** 2) *
(len(a[mask]) / (len(a[mask]) - ddof)))
else:
sigma = np.sqrt(np.nanmean((a[mask] - mu) ** 2))
z = np.full_like(a, np.nan)
z[mask] = (a[mask] - mu) / sigma
return z, mu, sigma
def fmt(arr):
"""Pretty-print "format (fmt)" a NumPy array with consistent formatting."""
return np.array2string(arr, precision=2, separator=", ", suppress_small=False)
# -------------------------------------------------------------------------
# Raw input signal: sampled at 1 s, with a missing value at index 1
x_raw = np.array([2.0, np.nan, 6.0, 5.0])
# Pipeline A: Fill -> Resample -> Smooth -> Normalize
x_filled_A = fill_linear(x_raw)
y_A, tA = resample_linear(x_filled_A, dt_src=1.0, dt_new=0.5)
yA_sm = moving_average_3(y_A)
zA, muA, sigA = zscore(yA_sm, mask=np.isfinite(yA_sm), ddof=0)
# Pipeline B: Fill -> Resample -> Normalize -> Smooth
x_filled_B = fill_linear(x_raw)
y_B, tB = resample_linear(x_filled_B, dt_src=1.0, dt_new=0.5)
zB, muB, sigB = zscore(y_B, ddof=0)
zB_sm = moving_average_3(zB)
# -------------------------------------------------------------------------
# Display results
print("Pipeline A (F -> R -> S -> N)")
print("Fill:", fmt(x_filled_A))
print("Resample:", fmt(y_A))
print("Smooth (3-pt centered MA):", fmt(yA_sm))
print(f"Normalize (z-score over finite smoothed): mu={muA:.2f}, sigma={sigA:.2f}")
print("z:", fmt(zA))
print()
print("Pipeline B (F -> R -> N -> S)")
print("Fill:", fmt(x_filled_B))
print("Resample:", fmt(y_B))
print(f"Normalize (z-score over resampled): mu={muB:.2f}, sigma={sigB:.2f}")
print("z:", fmt(zB))
print("Smooth (on normalized):", fmt(zB_sm))
Advanced example: Non-commutative smoothing & polyphase Kaiser down-sampling¶
We synthesize a \(10\,\mathrm{s}\) record with a slow \(0.6\,\mathrm{Hz}\) sine, mild Gaussian noise, and a brief, high-amplitude \(12\,\mathrm{Hz}\) burst confined to \(t\in(4,4.5)\,\mathrm{s}\). We compare two pipelines:
- SG(31,3): Savitzky–Golay local cubic smoother with window 31 samples.
- Kaiser(d=10, β=2): Kaiser-window polyphase FIR rate conversion with downsample factor 10 and Kaiser–Bessel parameter \(\beta=2\).
What does β control?
The Kaiser window’s \(\beta\) parameter (a.k.a. shape or Kaiser–Bessel parameter) trades main-lobe width against side-lobe attenuation: larger \(\beta\) → stronger stop-band attenuation but wider transition band.
Outcome (quantitative):
- Pipeline A retains the burst’s \(\pm 5\) swing.
- Pipeline B suppresses it almost completely.
- Divergence: \(\max\lvert \text{A} - \text{B} \rvert = 4.6\).
*Pipeline A: Smooth → Kaiser Downsample. The burst survives (attenuated).*
*Pipeline B: Kaiser Downsample → Smooth. The burst is nearly erased.*
Practical guidance¶
- Order is crucial. Swapping steps that compute statistics (fill, normalize) or drop information (decimate, smooth) changes the data.
- Stay consistent. Downstream models implicitly assume identical preprocessing across all inputs.
- Default to Trim → Fill → Resample → Smooth → Normalize → Export;
deviate when domain knowledge demands it (e.g., normalize raw counts before frequency analysis; anti-alias smooth before aggressive decimation).
Reproducible code¶
=== "Python"
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import resample_poly, savgol_filter
# --- 1. Synthetic signal with an ultra‑high‑frequency burst ---------------
np.random.seed(1)
N = 5000
x = np.linspace(0, 10, N)
# Slow sine + short, high‑freq, high‑amplitude burst + mild noise
signal = np.sin(2 * np.pi * 0.6 * x)
burst_mask = (x > 4.0) & (x < 4.5)
signal += burst_mask * 5 * np.sin(2 * np.pi * 12 * x) # 12Hz inside 4-4.5s
signal += 0.1 * np.random.randn(N)
# --- 2. Define operators ---------------------------------------------------
def smooth(sig):
"""Savitzky–Golay local cubic smoothing (window = 31 samples)."""
return savgol_filter(sig, 31, 3)
def kaiser_downsample(sig, factor=10, beta=2.0):
"""
Downsample by `factor' using a *polyphase FIR* whose taps are
shaped by a Kaiser window. A low beta ( approx 2) deliberately relaxes
stop-band attenuation so any aliasing becomes visually obvious.
"""
return resample_poly(sig, up=1, down=factor, window=('kaiser', beta))
factor = 10 # 10x rate reduction
# --- 3. Two noncommuting pipelines ---------------------------------------
pipe_A = kaiser_downsample(smooth(signal), factor) # Smooth → Downsample
pipe_B = smooth(kaiser_downsample(signal, factor)) # Downsample → Smooth
# --- 4. Stretch results back to original grid for overlay -----------------
def stretch(arr, factor, target_len):
return np.repeat(arr, factor)[:target_len]
pipe_A_stretched = stretch(pipe_A, factor, N)
pipe_B_stretched = stretch(pipe_B, factor, N)
# --- 5. Plot 1: Smooth -> Kaiser-Downsample --------------------------------
plt.figure(figsize=(10, 4))
plt.plot(x, signal, label="Original", alpha=0.4, linewidth=1)
plt.plot(x, pipe_A_stretched, label="Smooth → Kaiser‑Downsample", linewidth=1)
plt.title("Pipeline A: Smooth then Kaiser Downsample")
plt.xlabel("Time[s]")
plt.ylabel("Amplitude")
plt.legend()
plt.tight_layout()
plt.show()
# --- 6. Plot 2: Kaiser-Downsample -> Smooth --------------------------------
plt.figure(figsize=(10, 4))
plt.plot(x, signal, label="Original", alpha=0.4, linewidth=1)
plt.plot(x, pipe_B_stretched, label="Kaiser‑Downsample → Smooth", linewidth=1)
plt.title("Pipeline B: Kaiser Downsample then Smooth")
plt.xlabel("Time [s]")
plt.ylabel("Amplitude")
plt.legend()
plt.tight_layout()
plt.show()