Overall Pipeline
End-to-end flow from raw waveforms to model predictions.
Waveform Ingestion & Preprocessing
Load raw waveforms from HDF5 files. Apply baseline subtraction, Gaussian or Savitzky-Golay smoothing, and pole-zero correction to normalize detector response. Convert to frequency space and compute gradients.
Feature Extraction
Compute over 20 physics-informed features spanning rise-time dynamics, local peak structure, tail charge behavior, current-pulse statistics, and frequency-domain characteristics.
Data Splitting & Stratification
Split data into 60% training, 20% validation, and 20% test sets using stratified sampling to preserve class proportions — critical given severe imbalances in targets like the delayed charge recovery label.
Model Training
Train classification models to predict four PSD labels differentiating SSE from MSE. Train regression models to reconstruct continuous event energy in keV.
Validation & Tuning
Optimize hyperparameters via 5-fold cross-validation grouped by waveform file to prevent data leakage. Tune decision thresholds and assess final performance on the held-out test set.
Classification Models
Distinguishing signal-like single-site events (SSE) from background multi-site events (MSE) across four PSD targets.
Linear baseline model used to assess if the engineered feature space was approximately linearly separable.
Ensemble of decision trees that captures nonlinear dependencies. Balanced Random Forest variants undersample the majority class per tree to handle imbalances.
Gradient-boosted tree model that iteratively corrects errors, achieving the highest precision for the high_avse target.
A multi-layer perceptron with two hidden layers (64 and 32 nodes), which provided the best performance for identifying delayed charge recovery.
Regression Models
Estimating continuous energy values directly from engineered waveform features.
Baseline model that captures broad spectral shapes but struggles with nonlinear charge collection dynamics at higher energies.
Gradient-boosted decision trees with regularization, capable of applying localized, nonlinear corrections to preserve sharp peak structures.
Histogram-based gradient boosting framework that improves computational efficiency while accurately capturing complex feature interactions.
Model Rigor & Validation
Steps taken to ensure fair, leak-free evaluation across all models.
The dataset was divided into a 60% training set, 20% validation set, and 20% test set to ensure models were evaluated on fully unseen data.
Stratified sampling was used during splitting to maintain exact class proportions across all sets — critical given severe imbalances in targets like the delayed charge recovery label.
Waveforms collected in the same acquisition batch can share subtle correlations. To prevent leakage, we implemented grouped validation by waveform file, keeping entire batches together during cross-validation.
Hyperparameter tuning was performed using 5-fold cross-validation with grouped folds, preventing the model from inflating performance by memorizing batch-specific noise patterns.
Evaluation Strategy
Metrics used to assess model performance on both classification and regression tasks.
Matthews Correlation Coefficient and F1 Score were prioritized for classification to account for severe class imbalances across PSD labels.
Used to summarize class separation across all thresholds. PR-AUC was especially useful as it is more sensitive to minority class performance.
Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) quantified model performance, with RMSE directly measuring energy resolution in keV.
Final model performance was always reported on the held-out test set — never the validation fold — to give an unbiased estimate of generalization.