Testing Imbalanced cateGory classifiERS
git clone https://github.com/alan-turing-institute/ARC-TIGERS
cd ARC-TIGERS
python -m pip install .
### scripts/dataset_download.py
This script downloads a reddit dataset and saves in an appropriate place in the parent directory.
It takes as arguments:
dataset_name
: the name of the dataset to load from huggingface, for examplebit0/reddit_dataset_12
target_subreddits
: a list subreddits being used for the experiment, this should be in.json
format.max_rows
: The maximum number of rows to use in the resultant dataset, this should be an integer. It saves a.json
file calledfiltered_rows
containing the data in a subdirectory named using the dataset name and the maximum number of rows.
### scripts/dataset_generation.py
This script generates train and test splits from the downloaded reddit dataset(s).
It currently takes as arguments:
data_dir
: the path to the dataset being used to form the splitssplit
: the specific subreddit split to generate. Defined in the dictionaryDATASET_COMBINATIONS
in/data/utils
.r
: The ratio of target subreddits to non-target subreddits
The script saves two csv files, train.csv
and test.csv
within a subdirectory splits
, these contain the train and evaluation splits and are of roughly equal size.
See CONTRIBUTING.md for instructions on how to contribute.
Distributed under the terms of the MIT license.