Methods | Majorana Neutrino Hunt

Pipeline

Overall Pipeline

End-to-end flow from raw waveforms to model predictions.

01

Waveform Ingestion & Preprocessing

Load raw waveforms from HDF5 files. Apply baseline subtraction, Gaussian or Savitzky-Golay smoothing, and pole-zero correction to normalize detector response. Convert to frequency space and compute gradients.

↓

02

Feature Extraction

Compute over 20 physics-informed features spanning rise-time dynamics, local peak structure, tail charge behavior, current-pulse statistics, and frequency-domain characteristics.

↓

03

Data Splitting & Stratification

Split data into 60% training, 20% validation, and 20% test sets using stratified sampling to preserve class proportions — critical given severe imbalances in targets like the delayed charge recovery label.

↓

04

Model Training

Train classification models to predict four PSD labels differentiating SSE from MSE. Train regression models to reconstruct continuous event energy in keV.

↓

05

Validation & Tuning

Optimize hyperparameters via 5-fold cross-validation grouped by waveform file to prevent data leakage. Tune decision thresholds and assess final performance on the held-out test set.

Classification

Classification Models

Distinguishing signal-like single-site events (SSE) from background multi-site events (MSE) across four PSD targets.

Logistic Regression

Linear baseline model used to assess if the engineered feature space was approximately linearly separable.

Random Forest & BRF

Ensemble of decision trees that captures nonlinear dependencies. Balanced Random Forest variants undersample the majority class per tree to handle imbalances.

XGBoost Classifier

Gradient-boosted tree model that iteratively corrects errors, achieving the highest precision for the high_avse target.

Neural Network

A multi-layer perceptron with two hidden layers (64 and 32 nodes), which provided the best performance for identifying delayed charge recovery.

Regression

Regression Models

Estimating continuous energy values directly from engineered waveform features.

Linear Regression

Baseline model that captures broad spectral shapes but struggles with nonlinear charge collection dynamics at higher energies.

XGBoost Regressor

Gradient-boosted decision trees with regularization, capable of applying localized, nonlinear corrections to preserve sharp peak structures.

LightGBM Regressor

Histogram-based gradient boosting framework that improves computational efficiency while accurately capturing complex feature interactions.

Validation

Model Rigor & Validation

Steps taken to ensure fair, leak-free evaluation across all models.

Data Splits

The dataset was divided into a 60% training set, 20% validation set, and 20% test set to ensure models were evaluated on fully unseen data.

Stratified Sampling

Stratified sampling was used during splitting to maintain exact class proportions across all sets — critical given severe imbalances in targets like the delayed charge recovery label.

Grouped Validation

Waveforms collected in the same acquisition batch can share subtle correlations. To prevent leakage, we implemented grouped validation by waveform file, keeping entire batches together during cross-validation.

5-Fold Cross-Validation

Hyperparameter tuning was performed using 5-fold cross-validation with grouped folds, preventing the model from inflating performance by memorizing batch-specific noise patterns.

Evaluation

Evaluation Strategy

Metrics used to assess model performance on both classification and regression tasks.

MCC & F1 Score

Matthews Correlation Coefficient and F1 Score were prioritized for classification to account for severe class imbalances across PSD labels.

AUC-ROC & AUC-PR

Used to summarize class separation across all thresholds. PR-AUC was especially useful as it is more sensitive to minority class performance.

MSE & RMSE

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) quantified model performance, with RMSE directly measuring energy resolution in keV.

Held-Out Test Set

Final model performance was always reported on the held-out test set — never the validation fold — to give an unbiased estimate of generalization.