Order of Operations in Data Harmonization¶

When the Labeler cleans a signal it may trim, fill gaps, resample, smooth, normalize, and export. Because each step is an operator on the data, changing their order usually changes the result. This page explains the math, recommends a default pipeline, and shows two examples (one simple, one advanced).

Why the order matters¶

Let \(\mathcal{X}\) be the space of signals with \(x\in\mathcal{X}\). Each step—Trim \((T)\), Fill \((F)\), Resample \((R)\), Smooth \((S)\), Normalize \((N)\)—acts as an operator on \(\mathcal{X}\).

\[ A\circ B(x)=B\circ A(x)\quad\text{for all }x\in\mathcal{X}\quad\Longleftrightarrow\quad A\text{ and }B\text{ commute.} \]

Most harmonization steps do not commute because they either

depend on statistics of the input (non-linear),
discard information (non-invertible),
or both.

Step	Symbol	Nature	Typical action
Trim	\(T\)	selection (non-invertible)	drop out-of-ROI spans, clip to interval
Data fill	\(F\)	non-linear, data-dependent	replace NaNs by interpolation/constant
Resampling	\(R\)	linear, often non-invertible	change sampling grid (up/down/irregular→regular)
Smoothing	\(S\)	linear, non-invertible	low-pass filter, moving average, Savitzky–Golay
Normalization	\(N\)	non-linear, data-dependent	z-score, median–IQR, min–max, robust scaling
Export	—	I/O	write artifacts with provenance

Unless special symmetry holds,

\[ N\!\circ S \neq S\!\circ N,\qquad S\!\circ R \neq R\!\circ S,\qquad R\!\circ F \neq F\!\circ R. \]

Recommended pipeline (default)¶

Trim → Fill → Resample → Smooth → Normalize → Export

Step	Reason
Trim first	Restricts processing to the region of interest; avoids computing statistics on discarded spans.
Fill second	Ensures later steps see complete data and can compute their own statistics.
Resample third	Aligns every channel on a common grid before any cosmetic/denoising filters.
Smooth fourth	Removes high-frequency noise (and interpolation stair-steps) before scaling.
Normalize fifth	Computes scale/offset on the final, de-noised series—the values models will actually consume.
Export last	Persist the final artifacts and full provenance (config + versions + checksums).

DSP caveat: when downsampling uniformly-sampled data

Classical DSP recommends anti-alias smoothing before decimation on a fixed-rate uniform grid to avoid aliasing when you downsample (e.g., large factor reductions or frequency-domain analysis). Polyphase FIR resamplers (e.g., Kaiser-window designs) embed this safeguard by low-pass filtering to the new Nyquist rate during rate conversion. See Crochiere & Rabiner (1983) in References.

Worked example (numbers)¶

Raw signal sampled every \(1\,\mathrm{s}\):
x = [ 2, NaN, 6, 5 ]

Configuration

Fill – linear interpolation
Resample – upsample to \(0.5\,\mathrm{s}\) (factor 2) by linear interpolation
Smooth – 3-point simple moving average
Normalize – z-score \((y-\mu)/\sigma\)

Pipeline A — Fill → Resample → Smooth → Normalize¶

Stage	Output
Fill \(F\)	\(x_{\text{filled}} = [2,\; 4,\; 6,\; 5]\)
Resample \(R\)	\(x_{\text{resampled}} = [2,\; 3,\; 4,\; 5,\; 6,\; 5.5]\)
Smooth \(S\)	\(x_{\text{smoothed}} = [\text{NaN},\; 3,\; 4,\; 5,\; 5.5,\; \text{NaN}]\)
Normalize \(N\)	mean \(\mu=4.38\), stdev \(\sigma\approx0.96\) → \([\text{NaN},\, -1.43,\, -0.39,\, 0.65,\, 1.17,\, \text{NaN}]\)

Pipeline B — Fill → Resample → Normalize → Smooth¶

Stage	Output
Fill	\(x_{\text{filled}} = [2,\; 4,\; 6,\; 5]\)
Resample	\(x_{\text{resampled}} = [2,\; 3,\; 4,\; 5,\; 6,\; 5.5]\)
Normalize	\(\mu=4.25,\;\sigma\approx1.41 \Rightarrow [-1.60,\; -0.89,\; -0.18,\; 0.53,\; 1.24,\; 0.89]\)
Smooth	\([\text{NaN},\; -0.89,\; -0.18,\; 0.53,\; 0.89,\; \text{NaN}]\)

Result – The final vectors differ in magnitude and structure, confirming \(N\circ S \neq S\circ N\).

Reproducible code¶

=== "Python"

import numpy as np

def fill_linear(x):
    """
    Linearly interpolate over NaN values in a 1D array.

    Parameters
    ----------
    x : array-like of float
        Input signal with possible NaN entries. Assumed to be
        sampled at uniform time intervals (default 1 s in this workflow).

    Returns
    -------
    x_filled : ndarray
        Array where NaN values have been replaced by values obtained from
        1D linear interpolation between valid points.

    Notes
    -----
    This implements the 'F' (Fill) step in the preprocessing pipeline.
    It assumes the time axis is simply the array index (0, 1, ..., N-1).
    """
    x = np.asarray(x, dtype=float)
    t = np.arange(len(x), dtype=float)
    mask = np.isfinite(x)
    # Linear interpolation over finite points only
    x_filled = np.interp(t, t[mask], x[mask])
    return x_filled


def resample_linear(x_filled, dt_src=1.0, dt_new=0.5):
    """
    Uniformly resample a 1D signal to a finer (or coarser) time step
    using linear interpolation.

    Parameters
    ----------
    x_filled : ndarray
        Signal sampled at dt_src seconds, with no NaNs.
    dt_src : float, optional
        Original sampling interval in seconds. Default is 1.0 s.
    dt_new : float, optional
        New sampling interval in seconds. Default is 0.5 s.

    Returns
    -------
    y : ndarray
        Resampled signal at dt_new spacing, stopping before the final point.
    t_new : ndarray
        Corresponding time coordinates for the resampled signal.

    Notes
    -----
    This is the 'R' (Resample) step. The time grid stops at T - dt_new,
    where T = (N-1) * dt_src, to avoid duplicating the last original sample.
    """
    T = dt_src * (len(x_filled) - 1)
    t_src = np.arange(0.0, T + 1e-12, dt_src)
    t_new = np.arange(0.0, T, dt_new)  # excludes the final T point
    y = np.interp(t_new, t_src, x_filled)
    return y, t_new


def moving_average_3(y):
    """
    Apply a centered 3-point moving average to a 1D signal.

    Parameters
    ----------
    y : ndarray
        Input signal (1D array).

    Returns
    -------
    sm : ndarray
        Smoothed signal with NaN at the first and last positions,
        because a centered average cannot be computed at endpoints.

    Notes
    -----
    This is the 'S' (Smooth) step. Each valid point is the mean of the
    sample itself and its immediate neighbors.
    """
    y = np.asarray(y, dtype=float)
    sm = np.full_like(y, np.nan)
    for i in range(1, len(y)-1):
        sm[i] = (y[i-1] + y[i] + y[i+1]) / 3.0
    return sm


def zscore(arr, mask=None, ddof=0):
    """
    Standardize a signal to zero mean and unit variance (z-score).

    Parameters
    ----------
    arr : ndarray
        Input signal (1D array) with possible NaNs.
    mask : ndarray of bool, optional
        Boolean mask of which entries to include in mean/std computation.
        If None, all finite values are included.
    ddof : int, optional
        Delta degrees of freedom for the standard deviation.
        ddof=0 gives the population standard deviation,
        ddof=1 gives the sample standard deviation.

    Returns
    -------
    z : ndarray
        z-scored signal, with NaN where mask is False.
    mu : float
        Mean of the included values.
    sigma : float
        Standard deviation of the included values.

    Notes
    -----
    This is the 'N' (Normalize) step in the pipeline.
    """
    a = np.asarray(arr, dtype=float)
    if mask is None:
        mask = np.isfinite(a)
    mu = np.nanmean(a[mask])
    if ddof > 0:
        sigma = np.sqrt(np.nanmean((a[mask] - mu) ** 2) *
                        (len(a[mask]) / (len(a[mask]) - ddof)))
    else:
        sigma = np.sqrt(np.nanmean((a[mask] - mu) ** 2))
    z = np.full_like(a, np.nan)
    z[mask] = (a[mask] - mu) / sigma
    return z, mu, sigma


def fmt(arr):
    """Pretty-print "format (fmt)" a NumPy array with consistent formatting."""
    return np.array2string(arr, precision=2, separator=", ", suppress_small=False)


# -------------------------------------------------------------------------
# Raw input signal: sampled at 1 s, with a missing value at index 1
x_raw = np.array([2.0, np.nan, 6.0, 5.0])

# Pipeline A: Fill -> Resample -> Smooth -> Normalize
x_filled_A = fill_linear(x_raw)
y_A, tA = resample_linear(x_filled_A, dt_src=1.0, dt_new=0.5)
yA_sm = moving_average_3(y_A)
zA, muA, sigA = zscore(yA_sm, mask=np.isfinite(yA_sm), ddof=0)

# Pipeline B: Fill -> Resample -> Normalize -> Smooth
x_filled_B = fill_linear(x_raw)
y_B, tB = resample_linear(x_filled_B, dt_src=1.0, dt_new=0.5)
zB, muB, sigB = zscore(y_B, ddof=0)
zB_sm = moving_average_3(zB)

# -------------------------------------------------------------------------
# Display results
print("Pipeline A (F -> R -> S -> N)")
print("Fill:", fmt(x_filled_A))
print("Resample:", fmt(y_A))
print("Smooth (3-pt centered MA):", fmt(yA_sm))
print(f"Normalize (z-score over finite smoothed): mu={muA:.2f}, sigma={sigA:.2f}")
print("z:", fmt(zA))
print()
print("Pipeline B (F -> R -> N -> S)")
print("Fill:", fmt(x_filled_B))
print("Resample:", fmt(y_B))
print(f"Normalize (z-score over resampled): mu={muB:.2f}, sigma={sigB:.2f}")
print("z:", fmt(zB))
print("Smooth (on normalized):", fmt(zB_sm))

Advanced example: Non-commutative smoothing & polyphase Kaiser down-sampling¶

We synthesize a \(10\,\mathrm{s}\) record with a slow \(0.6\,\mathrm{Hz}\) sine, mild Gaussian noise, and a brief, high-amplitude \(12\,\mathrm{Hz}\) burst confined to \(t\in(4,4.5)\,\mathrm{s}\). We compare two pipelines:

\[ \begin{aligned} \text{Pipeline A}:&\ \text{SG}(31,3)\;\longrightarrow\;\text{Kaiser}(d{=}10,\ \beta{=}2),\\[4pt] \text{Pipeline B}:&\ \text{Kaiser}(d{=}10,\ \beta{=}2)\;\longrightarrow\;\text{SG}(31,3). \end{aligned} \]

SG(31,3): Savitzky–Golay local cubic smoother with window 31 samples.
Kaiser(d=10, β=2): Kaiser-window polyphase FIR rate conversion with downsample factor 10 and Kaiser–Bessel parameter \(\beta=2\).

What does β control?

The Kaiser window’s \(\beta\) parameter (a.k.a. shape or Kaiser–Bessel parameter) trades main-lobe width against side-lobe attenuation: larger \(\beta\) → stronger stop-band attenuation but wider transition band.

Outcome (quantitative):

Pipeline A retains the burst’s \(\pm 5\) swing.
Pipeline B suppresses it almost completely.
Divergence: \(\max\lvert \text{A} - \text{B} \rvert = 4.6\).

*Pipeline A: Smooth → Kaiser Downsample. The burst survives (attenuated).*

*Pipeline B: Kaiser Downsample → Smooth. The burst is nearly erased.*

Practical guidance¶

Order is crucial. Swapping steps that compute statistics (fill, normalize) or drop information (decimate, smooth) changes the data.
Stay consistent. Downstream models implicitly assume identical preprocessing across all inputs.
Default to Trim → Fill → Resample → Smooth → Normalize → Export;
deviate when domain knowledge demands it (e.g., normalize raw counts before frequency analysis; anti-alias smooth before aggressive decimation).

Reproducible code¶

=== "Python"

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import resample_poly, savgol_filter

# --- 1. Synthetic signal with an ultra‑high‑frequency burst ---------------
np.random.seed(1)
N = 5000
x = np.linspace(0, 10, N)

# Slow sine + short, high‑freq, high‑amplitude burst + mild noise
signal = np.sin(2 * np.pi * 0.6 * x)
burst_mask = (x > 4.0) & (x < 4.5)
signal += burst_mask * 5 * np.sin(2 * np.pi * 12 * x)      # 12Hz inside 4-4.5s
signal += 0.1 * np.random.randn(N)

# --- 2. Define operators ---------------------------------------------------
def smooth(sig):
    """Savitzky–Golay local cubic smoothing (window = 31 samples)."""
    return savgol_filter(sig, 31, 3)

def kaiser_downsample(sig, factor=10, beta=2.0):
    """
    Downsample by `factor' using a *polyphase FIR* whose taps are
    shaped by a Kaiser window.  A low beta ( approx 2) deliberately relaxes
    stop-band attenuation so any aliasing becomes visually obvious.
    """
    return resample_poly(sig, up=1, down=factor, window=('kaiser', beta))

factor = 10  # 10x rate reduction

# --- 3. Two noncommuting pipelines ---------------------------------------
pipe_A = kaiser_downsample(smooth(signal), factor)   # Smooth → Downsample
pipe_B = smooth(kaiser_downsample(signal, factor))   # Downsample → Smooth

# --- 4. Stretch results back to original grid for overlay -----------------
def stretch(arr, factor, target_len):
    return np.repeat(arr, factor)[:target_len]

pipe_A_stretched = stretch(pipe_A, factor, N)
pipe_B_stretched = stretch(pipe_B, factor, N)

# --- 5. Plot 1: Smooth -> Kaiser-Downsample --------------------------------
plt.figure(figsize=(10, 4))
plt.plot(x, signal, label="Original", alpha=0.4, linewidth=1)
plt.plot(x, pipe_A_stretched, label="Smooth → Kaiser‑Downsample", linewidth=1)
plt.title("Pipeline A: Smooth then Kaiser Downsample")
plt.xlabel("Time[s]")
plt.ylabel("Amplitude")
plt.legend()
plt.tight_layout()
plt.show()

# --- 6. Plot 2: Kaiser-Downsample -> Smooth --------------------------------
plt.figure(figsize=(10, 4))
plt.plot(x, signal, label="Original", alpha=0.4, linewidth=1)
plt.plot(x, pipe_B_stretched, label="Kaiser‑Downsample → Smooth", linewidth=1)
plt.title("Pipeline B: Kaiser Downsample then Smooth")
plt.xlabel("Time [s]")
plt.ylabel("Amplitude")
plt.legend()
plt.tight_layout()
plt.show()

Order of Operations in Data Harmonization¶

Why the order matters¶

Recommended pipeline (default)¶

Worked example (numbers)¶

Pipeline A — Fill → Resample → Smooth → Normalize¶

Pipeline B — Fill → Resample → Normalize → Smooth¶

Reproducible code¶

Advanced example: Non-commutative smoothing & polyphase Kaiser down-sampling¶

Practical guidance¶

Reproducible code¶

Further reading¶