This directory contains the files from the ND280 deep learning project. I did it as a minor project in the BIST Master's of Multidisciplinary Research degree. If you've decided to work on this voluntarily, welcome. If you were forced to do it, good luck.
The aim of the project was to create a robust electron/muon binary classification method using simulated ND280 HA-TPC data from GEANT4.
Below, I will describe the models in the project, the structure of the directory, and some key terms used throughout. The explanations assume basic familiarity with deep learning and Pytorch. In addition to this readme, many files contain a brief explanation of their purpose at the very beginning.
The following models are included in the project:
- multilayer perceptron (MLP), built from scratch
- convolutional neural network (CNN), based on efficientnet_b4
- vision transformer (ViT) from the Huggingface Transformers package
Each model may have variants with 1 and 3 input channels.
Additionally, the following ideas have been tested to some extent:
- CNN with data augmentation
- CNN for positive particles (proton/pion)
The directory contains the following subdirectories:
data
– training data was supposed to be there, but it's not- You can find the training data in
/data/neutrinos/common/casado/T2K/ND280Cont/
- You can find the training data in
models
– models were supposed to be here, but are bundled with training code insteadnotebooks
– this directory contains numerous Jupyter notebooks for data exploration and model testing- refer to individual notebooks for details on their purpose; each notebook starts with a brief explanation
scripts
– this directory contains all scripts submitted to the cluster; the subdirectory structure is roughly:tune.py
/train.py
– the tuning/training script executedutils.py
– shared utility code*.sh
– a shell script that prepares the environment and runs the Python script*.sub
– a Condor submit file for the scriptlogs/
– Condor log, output and error filesout/
– output files created by the script
The following relevant terms are used in the files, in no particular order:
- Color – this refers to an input variable in the data (
qmax
,tmax
orfwhm
), and to the associated input channels in input tensors; the tensors normally have a shape of(b,c,h,w)
, wherec
is the color; "color" and "channel" may be used interchangeably - Offset – the data in input files is stored in sparse format, i.e., each row refers to a single non-zero pixel; however, the file is grouped by event ID; dataset objects store a table of offsets, i.e., the row index of the first and last pixel of each event
- Tuning – the exploration of hyperparameter space through low-budget training of numerous models with different setups; in this project, it is performed with Optuna
- Training – in this project, it usually refers to full, high-budget optimization of a model with a good setup identified through tuning
- Eager loading – a type of data loading in which all the data is loaded to RAM at once
- Lazy loading – a type of data loading in which small portions of data are read from hard drive to conserve memory; also called data streaming