Skip to content

Autolabeling

Manual labeling of data does not scale to petabyte-level, multimodal datasets. Automated labeling offers a way to generate consistent, reproducible event tags at scale, supporting both offline analysis and real-time decision-making.

dFL supports a variety of automated labeling strategies. The following are five common exemplars, though the space of possible methods is much broader:


Physics-informed (or Application-informed) methods

Rule-based approaches grounded in known signal patterns.

Examples:

  • Peak finding for ELMs, sawteeth, or pellet events.
    1. Peak finding can be achieved in the Archaieus Autolabeler.
  • Derivative thresholds for abrupt MHD changes.
    1. A simple threshold autolabeler is provided in Custom Autolabeling.
  • Zero-crossing or turning-point detection in simulation outputs.
    1. Zero crossings can be achieved in the Archaieus Autolabeler.
  • Custom Labelers (an example can be found in the Custom Autolabeling section for ELM slope-based outlier auto-detection).

Statistics-based methods

Probabilistic tools for detecting anomalies, drifts, or regime changes.

Examples:

  • Control charts for calibration shifts
  • Outlier detection
  • Hypothesis testing or change-point detection
  • Kernel density estimation for distributional shifts
  • Custom autolabeling.

Three statistcs-based autolabeling methods natively supported in dFL through the "Stats" graph, include:

1. Moving-Window Z-Score

For each sample \(x_t\), compute the rolling mean \(\mu_t\) and standard deviation \(\sigma_t\) over a window of size \(W\):

\[ Z_t = \frac{x_t - \mu_t}{\sigma_t}, \quad \mu_t = \frac{1}{W}\sum_{i=t-W+1}^t x_i, \quad \sigma_t = \sqrt{\frac{1}{W-1}\sum_{i=t-W+1}^t (x_i - \mu_t)^2}. \]

An anomaly is flagged when \(|Z_t| > \tau_\sigma\), where \(\tau_\sigma\) is the user-defined threshold.

  • Pros: Simple, efficient, interpretable
  • Cons: Sensitive to nonstationarity and outliers

2. Moving-Window CUSUM

Tracks small, persistent shifts in the mean using cumulative sums:

\[ C_t^+ = \max\big[0, C_{t-1}^+ + (x_t - \mu_0 - k)\big], \quad C_t^- = \max\big[0, C_{t-1}^- + (\mu_0 - x_t - k)\big], \]

where \(\mu_0\) is the nominal mean (estimated per window), and \(k\) is a sensitivity parameter.

An anomaly is flagged when \(C_t^+ > h\) or \(C_t^- > h\), where \(h\) is the detection threshold.

  • Pros: Detects gradual drifts and sustained changes
  • Cons: Requires a stable baseline; less effective for transient events

3. Moving Average with Confidence Intervals

Computes a rolling mean \(\mu_t\) and confidence interval:

\[ \text{CI}_t = \mu_t \pm z_{1-\alpha/2}\,\frac{\sigma_t}{\sqrt{W}}, \]

where \(\sigma_t\) is the rolling standard deviation, \(\alpha\) is the significance level, and \(z_{1-\alpha/2}\) is the standard normal quantile.

An anomaly is flagged when \(x_t \notin \text{CI}_t\).

  • Pros: Statistically interpretable, tunable false-positive rate
  • Cons: Assumes approximate normality and weak autocorrelation

Workflow example in dFL

  1. Select Stats as the graph type.
  2. Choose a detector under Statistical Parameters of First Signal from Selection
  3. Set window size and sensitivity (Threshold or CI level).
  4. View anomalies directly on the signal (optional: highlight cutoff lines).
  5. Capture anomalies as proposed labels with Capture Anomalies into Proposed Labels.
  6. Confirm them under Auto Labeler → Confirm Autolabels for Current Records.
  7. Export results via the Current Labels panel.

Data-driven methods

Machine learning techniques for noisy, high-dimensional, or partially labeled datasets.

Examples:

  • Pre-trained classifiers (e.g., CNNs for MHD modes)
  • Transfer learning from related time-series tasks
  • Clustering to reveal latent regimes
  • Autoencoders or contrastive methods for feature embeddings

Workflow pre-trained classifier example in dFL

  1. In the Auto Labeler tab, under the Select Autolabeler dropdown, select "Plasma mode autolabeling"
  2. Add as many graphs as needed to analyze the multimodal signal behavior over the data set.
  3. Select the record you would like to autolabel.
  4. Select the "Target mode" under the Auto Labeler tab options.
  5. Hit Autolabel Record and make sure Display Labels** is selected under the Main Control Panel tab.
  6. View labels under Current Proposed Labels.
  7. Confirm them under Auto Labeler → Confirm Autolabels for Current Records.
  8. Export results via the Current Labels panel.

Simulation-driven methods

Labels derived from high-fidelity models or synthetic diagnostics may be used as well.

Examples:
- Extended-MHD or gyrokinetic simulations (e.g., from MGKDB) to mark instability onset
- Transport solvers for L-H transitions or ITB formation
- Synthetic diagnostics mapping simulations onto experiments

In this case, either (1) a framework for mapping the simulations onto the experimental data must be applied and integrated as an autolabeler, or (2) an autolabeler must be written to identify regions (or patterns) of the simulation data. Both can be easily developed in dFL using the Custom Autolabeling API.


Hybrid / expert-in-the-loop

Pipelines that blend multiple methods with human oversight.

Examples:
- A derivative-based ELM detector refined by a neural network. - Expert-reviewed candidate events for training ML classifiers (this is how the above classifier was actually trained, first using manual labeling, then using a classifier on a labeled subset of shots).


Autolabeling: tip of the iceberg

These exemplars illustrate the flexibility of dFL: users can combine, extend, or replace methods to fit their specific data and workflows. Automated labeling is not a fixed recipe but a toolkit for extracting meaningful events from complex datasets.

Automated labeling in dFL is an enabling technology: it accelerates ML training, supports cross-facility data sharing, and provides reliable event markers for archival analysis, real-time control, or predicting future events.