Skip to content

public datasets for evaluation #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
meigaoms opened this issue Feb 19, 2021 · 4 comments
Open

public datasets for evaluation #45

meigaoms opened this issue Feb 19, 2021 · 4 comments

Comments

@meigaoms
Copy link

Hi there,
I'm trying to set up public datasets for evaluation listed in Table 9, but got different train/test size for some datasets:

  1. Facial Emotion Recognition 2013
    Dataset I found on Kaggle has train dataset 28,709, Val(public test) 3,589, (Train+Val 32,298 in total) and Test (private test) 3,589.
  2. STL-10
    Tensorflow stl10 has training dataset with 5,000 images and testing dataset with 8,000.
  3. EuroSAT
    Tensorflow eurosat only has training dataset with 27,000 images.
  4. RESISC45
    The site Tensorflow refers to only have training dataset, which is 31,500 images.
  5. GTSRB
    This archive I found has 2 training datasets (GTSRB_Final_Training_Images.zip and GTSRB-Training_fixed.zip), but both have size different from Table 9.

This is what Table 9 shows:

Dataset Classes Train size Test size Evaluation metric
Facial Emotion Recognition 2013 8 32,140 3,574 accuracy
STL-10 10 1000 8000 accuracy
EuroSAT 10 10,000 5,000 accuracy
RESISC45 45 3,150 25,200 accuracy
GTSRB 43 26,640 12,630 accuracy

It would be greatly appreciated if you could point me to the source of data split shown in Table 9.

@pj-ms
Copy link

pj-ms commented Sep 13, 2021

Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.

For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper

image

In the paper cited by CLIP paper about FER-2013, it says "The resulting
dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.

@jongwook
Copy link
Collaborator

jongwook commented Sep 24, 2021

Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:

  1. Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.
  2. STL-10: We reported the average of the 10 pre-defined folds as provided by the official source
  3. EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.
root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
    keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
    keep_paths = [os.path.relpath(path, root) for path in keep_paths]
    train_paths.extend(keep_paths[:500])
    valid_paths.extend(keep_paths[500:1000])
    test_paths.extend(keep_paths[1000:])

We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.

  1. RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:
root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
    paths = paths[:len(paths) // 10]
elif split == 'valid':
    paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
    paths = paths[(len(paths) // 10) * 2:]
else:
    raise NotImplementedError
  1. GTSRB: As @pj-ms found in GTSRB dataset issue #156, it turns out that we have used inconsistent train/test split.

@pj-ms
Copy link

pj-ms commented Sep 24, 2021

Hi Jong, thanks so much for all the information! It is super helpful.

I have some questions about two more datasets and would really appreciate it if you could help. Thanks in advance.

Birdsnap: the official site only provides the image urls. When using the associated script from the official dataset to download the images, I ended up with ""“NEW_OK:40318, ALREADY_OK:0, DOWNLOAD_FAILED:5030, SAVE_FAILED:0, MD5_FAILED:4481, MYSTERY_FAILED:0.”. Have you folks experienced similar problems?

CLEVR(Counts): the CLIP paper says "2,500 random samples of the CLEVR
dataset (Johnson et al., 2017)
", while the official data site says "A training set of 70,000 images and 699,989 questions, A validation set of 15,000 images and 149,991 questions, A test set of 15,000 images and 14,988 questions". The original dataset seems to be a VQA dataset. According to the prompts and words in the paper ", counting objects in synthetic
scenes (CLEVRCounts)". It seems that it is transformed into a counting classification dataset. Could you share a little bit information about how this was achieved and do you happen to still have the sampling script? Thanks

image

Thanks!

@meigaoms
Copy link
Author

Same q here. Would be great and appreciate it if you could share which sources you get those public datasets for evaluation. Thanks.

For example, here is the table of stats about FER-2013 dataset in the the other paper which is consistent with Kaggle page, but different than the stats reported from CLIP paper

image

In the paper cited by CLIP paper about FER-2013, it says "The resulting
dataset contains 35887 images, with 4953 “Anger” images, 547 “Disgust” images, 5121 “Fear” images, 8989 “Happiness” images, 6077 “Sadness” images, 4002 “Surprise” images, and 6198 “Neutral” images.". This is consistent with the numbers on Kaggle page but different than the number reported in CLIP paper.

Hi, thanks for pointing out some of the details we were cursory or missing; upon investigating, we found that:

  1. Facial Emotion Recognition 2013: We noticed an error in the table generation script which reported smaller numbers than it's supposed to; you can use the official numbers. We had a similar issue with the UCF-101 dataset.
  2. STL-10: We reported the average of the 10 pre-defined folds as provided by the official source
  3. EuroSAT: We realize that the paper is lacking a critical reproducibility info on EuroSAT; given the lack of the official splits and to make a class-balanced dataset, we randomly sampled 500 train/validation/test images for each class. Below is the code for deterministically sampling those images.
root = f"{DATA_ROOT}/eurosat/2750"
seed = 42
random.seed(seed)
train_paths, valid_paths, test_paths = [], [], []
for folder in [os.path.basename(folder) for folder in sorted(glob.glob(os.path.join(root, "*")))]:
    keep_paths = random.sample(glob.glob(os.path.join(root, folder, "*")), 1500)
    keep_paths = [os.path.relpath(path, root) for path in keep_paths]
    train_paths.extend(keep_paths[:500])
    valid_paths.extend(keep_paths[500:1000])
    test_paths.extend(keep_paths[1000:])

We could’ve used a better setup such as mean-per-class using all available data and would rather encourage future studies to do so, while we note that the comparisons in the paper used this same subset across all models, so their relative scores can still be considered “fair”.

  1. RESISC-45: Similar to EuroSAT, we used our custom split given the lack of an official one:
root = f"{DATA_ROOT}/resisc45"
seed = 42
paths = sorted(glob.glob(os.path.join(root, "*.jpg")))
random.seed(seed)
random.shuffle(paths)
if split == 'train':
    paths = paths[:len(paths) // 10]
elif split == 'valid':
    paths = paths[len(paths) // 10:(len(paths) // 10) * 2]
elif split == 'test':
    paths = paths[(len(paths) // 10) * 2:]
else:
    raise NotImplementedError
  1. GTSRB: As @pj-ms found in GTSRB dataset issue #156, it turns out that we have used inconsistent train/test split.

@jongwook Thank you so much for sharing these details. I have two more detailed questions:
For datasets EuroSAT and RESISC45, in your experiments is seed=42 always used for a deterministic sampling?
From Table 9, the train size of RESISC45 is 3,150. I got train set with 3,150 images and validation set with 3,150 images with your code. Are they supposed to be added together to form total train size as 6,300?
We would like to set our benchmarking settings comparable to CLIP if possible. However, when I run sampled EuroSAT with a few models ( such as ResNet50, ResNet101, efficientnet_b0, and clip ViT-B/32), scores I got are all about 2.3~4.6% less than their equivalent in Table 10. And, RESISC45 scores with models mentioned previously are close to what in Table 10, but 5% higher on ViT_base_patch16_224.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants