Steps for running on Northeastern Univeristy Discovery cluster

How to train the model that detects classes of dice/coins in a given image, using YOLOv5:

The model can be trained either on your own/custom machine or the NEU discovery cluster. The latter one requires a Northeastern University account. (Docs: https://rc-docs.northeastern.edu/en/latest/welcome/welcome.html). Would highly recommend using a GPU for this task, as training takes quite a bit of resources and time to complete. It took ~4 hours on a p100 GPU (https://rc-docs.northeastern.edu/en/latest/using-discovery/workingwithgpu.html).

If you're using your custom machine (which is not the NEU Discovery cluster), you can stop reading here and follow the steps on train_on_own_machine.md.

Steps for running on Northeastern Univeristy Discovery cluster

Connect to the Discovery cluster: ssh @login.discovery.neu.edu

Run Step I and Step II from the train_on_own_machine.md file, to get the dataset ready.

Example image (The model would detect the labels and bouding boxes for the two dice/coins in the images):

Step III. [Optional, only to test if you can really get connected to a GPU] Create the PyTorch environment by running below commands.

srun --partition=gpu --nodes=1 --pty --export=All --gres=gpu:1 --mem=4G --time=01:00:00 /bin/bash 
module load cuda/11.3
module load anaconda3/2021.05
conda create --name pytorch_env_new python=3.7
conda activate pytorch_env_new
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
#(You can also install specific version if you like, conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch, instructions from https://pytorch.org/get-started/previous-versions/)
python -c'import torch; print(torch.cuda.is_available())'
#(This print statement should return "True". If it does not, it implies you were not able to connect to a GPU. You can still proceed, but the commands would be slow as they would instead run on a CPU.)

Step IV. Create a batch file named runModel.batch, or any other name you like, with below content.

#!/bin/bash
#SBATCH --time=8:00:00
#SBATCH --job-name=yolov5_65
#SBATCH --mem=3G
#SBATCH --export=All
#SBATCH --gres=gpu:p100:1
#SBATCH --partition=gpu
#SBATCH --output=output_65epochs/exec.%j.out
module load cuda/11.3
module load anaconda3/2021.05
source /shared/centos7/anaconda3/2021.05/etc/profile.d/conda.sh
conda activate pytorch_env_new
pip install -qr requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
echo $CONDA_DEFAULT_ENV
echo $(python -c'import torch; print(torch.cuda.is_available());')
echo $(python -c 'import torch; print(torch.cuda.current_device())')
echo $(python -c'import torch; print(torch.cuda.get_device_properties(torch.cuda.current_device()));')
echo $(python -c'import torch; print(min((int(arch.split("_")[1]) for arch in torch.cuda.get_arch_list()), default=35))')
time python train.py --epochs 65 --data ./dicedataset/data.yaml --cfg ./models/custom_yolov5s.yaml --weights '' --name yolov5s_results_initial --batch-size=12

Step V. Run the batch file to start training the model. Note the job ID after running the command below.

mkdir output_65epochs # to store logs
sbatch runModel.batch

Step VI. Wait until the batch runs completely. You can run the below commands to check the job status,

seff <jobID>
squeue -u <username>

or, you can also check the output from this batch file. For instance, if you want to check the output of your print statements from the batch file.

less</cat/tail> output_65epochs/exec.<jobID>.out

Step V. Test.

Test can be run either on CPU or GPU. I ran on the GPU by following the same steps as in Step IV above, but this time with the command below. It takes about 2-3 minutes to finish, so you can request a different GPU for less amount of time.

time python detect.py --weights runs/train/yolov5s_results_initial/weights/best.pt --conf-thres 0.6 --source ../test/images --save-txt --save-conf

Open the log file and check the directory path where results are saved. You are looking for a line similar to this one, "2553 labels saved to runs/detect/exp7/labels". Edit the script eval.py with the correct directory path and run it. This script prints out the number of images that were detected correctly in entirety.

A huge thanks to Dr Ravi Sundaram and Khoury College at Northeastern University!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
download_dice_dataset.py		download_dice_dataset.py
eval.py		eval.py
multiply_dataset.py		multiply_dataset.py
train_on_own_machine.md		train_on_own_machine.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

If you're using your custom machine (which is not the NEU Discovery cluster), you can stop reading here and follow the steps on train_on_own_machine.md.

Steps for running on Northeastern Univeristy Discovery cluster

About

Releases

Packages

Languages

guptaaka/dice-detection-using-YOLOv5

Folders and files

Latest commit

History

Repository files navigation

If you're using your custom machine (which is not the NEU Discovery cluster), you can stop reading here and follow the steps on train_on_own_machine.md.

Steps for running on Northeastern Univeristy Discovery cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages