Autolabeling¶
Manual labeling of data does not scale to petabyte-level, multimodal datasets. Automated labeling offers a way to generate consistent, reproducible event tags at scale, supporting both offline analysis and real-time decision-making.
dFL supports a variety of automated labeling strategies. The following are five common exemplars, though the space of possible methods is much broader:
Physics-informed (or Application-informed) methods¶
Rule-based approaches grounded in known signal patterns.
Examples:
- Peak finding for ELMs, sawteeth, or pellet events.
- Peak finding can be achieved in the Archaieus Autolabeler.
- Derivative thresholds for abrupt MHD changes.
- A simple threshold autolabeler is provided in Custom Autolabeling.
- Zero-crossing or turning-point detection in simulation outputs.
- Zero crossings can be achieved in the Archaieus Autolabeler.
- Custom Labelers (an example can be found in the Custom Autolabeling section for ELM slope-based outlier auto-detection).
Statistics-based methods¶
Probabilistic tools for detecting anomalies, drifts, or regime changes.
Examples:
- Control charts for calibration shifts
- Outlier detection
- Hypothesis testing or change-point detection
- Kernel density estimation for distributional shifts
- Custom autolabeling.
Three statistcs-based autolabeling methods natively supported in dFL through the "Stats" graph, include:
1. Moving-Window Z-Score¶
For each sample \(x_t\), compute the rolling mean \(\mu_t\) and standard deviation \(\sigma_t\) over a window of size \(W\):
An anomaly is flagged when \(|Z_t| > \tau_\sigma\), where \(\tau_\sigma\) is the user-defined threshold.
- Pros: Simple, efficient, interpretable
- Cons: Sensitive to nonstationarity and outliers
2. Moving-Window CUSUM¶
Tracks small, persistent shifts in the mean using cumulative sums:
where \(\mu_0\) is the nominal mean (estimated per window), and \(k\) is a sensitivity parameter.
An anomaly is flagged when \(C_t^+ > h\) or \(C_t^- > h\), where \(h\) is the detection threshold.
- Pros: Detects gradual drifts and sustained changes
- Cons: Requires a stable baseline; less effective for transient events
3. Moving Average with Confidence Intervals¶
Computes a rolling mean \(\mu_t\) and confidence interval:
where \(\sigma_t\) is the rolling standard deviation, \(\alpha\) is the significance level, and \(z_{1-\alpha/2}\) is the standard normal quantile.
An anomaly is flagged when \(x_t \notin \text{CI}_t\).
- Pros: Statistically interpretable, tunable false-positive rate
- Cons: Assumes approximate normality and weak autocorrelation
Workflow example in dFL¶
- Select Stats as the graph type.
- Choose a detector under Statistical Parameters of First Signal from Selection
- Set window size and sensitivity (Threshold or CI level).
- View anomalies directly on the signal (optional: highlight cutoff lines).
- Capture anomalies as proposed labels with Capture Anomalies into Proposed Labels.
- Confirm them under Auto Labeler → Confirm Autolabels for Current Records.
- Export results via the Current Labels panel.
Data-driven methods¶
Machine learning techniques for noisy, high-dimensional, or partially labeled datasets.
Examples:
- Pre-trained classifiers (e.g., CNNs for MHD modes)
- Transfer learning from related time-series tasks
- Clustering to reveal latent regimes
- Autoencoders or contrastive methods for feature embeddings
Workflow pre-trained classifier example in dFL¶
- In the Auto Labeler tab, under the Select Autolabeler dropdown, select "Plasma mode autolabeling"
- Add as many graphs as needed to analyze the multimodal signal behavior over the data set.
- Select the record you would like to autolabel.
- Select the "Target mode" under the Auto Labeler tab options.
- Hit Autolabel Record and make sure Display Labels** is selected under the Main Control Panel tab.
- View labels under Current Proposed Labels.
- Confirm them under Auto Labeler → Confirm Autolabels for Current Records.
- Export results via the Current Labels panel.
Simulation-driven methods¶
Labels derived from high-fidelity models or synthetic diagnostics may be used as well.
Examples:
- Extended-MHD or gyrokinetic simulations (e.g., from MGKDB) to mark instability onset
- Transport solvers for L-H transitions or ITB formation
- Synthetic diagnostics mapping simulations onto experiments
In this case, either (1) a framework for mapping the simulations onto the experimental data must be applied and integrated as an autolabeler, or (2) an autolabeler must be written to identify regions (or patterns) of the simulation data. Both can be easily developed in dFL using the Custom Autolabeling API.
Hybrid / expert-in-the-loop¶
Pipelines that blend multiple methods with human oversight.
Examples:
- A derivative-based ELM detector refined by a neural network.
- Expert-reviewed candidate events for training ML classifiers (this is how the above classifier was actually trained, first using manual labeling, then using a classifier on a labeled subset of shots).
Autolabeling: tip of the iceberg¶
These exemplars illustrate the flexibility of dFL: users can combine, extend, or replace methods to fit their specific data and workflows. Automated labeling is not a fixed recipe but a toolkit for extracting meaningful events from complex datasets.
Automated labeling in dFL is an enabling technology: it accelerates ML training, supports cross-facility data sharing, and provides reliable event markers for archival analysis, real-time control, or predicting future events.