4 random sampling #13

J-Dymond · 2025-05-16T11:47:26Z

Random sampling + analysis scripts
Refactoring changes, moving dataset loading functions into separate files
Additional configs

…max probabilities for correctly/incorrectly classified samples

…into 4-random-sampling

…mpling.py

jack89roberts · 2025-05-16T12:03:21Z

scripts/random_sampling.py

 from arc_tigers.utils import load_yaml


-def imbalance_dataset(dataset, seed, class_balance):
+def imbalance_dataset(dataset: Dataset, seed: int, class_balance: float) -> Dataset:


Ideally I think changing the imbalance would be handled in the data scripts, and not the sampling script

Would we want different levels of imbalance when training? And would this just be in the binary case?

Basically I just think that as far as the sampling script is concerned the dataset logic shouldn't be much more than test_data = load_dataset(config) or similar. We may want the option of imbalance in the training splits too, but that's separate to anything to do with this script (apart from making sure the test data doesn't overlap with the training data).

I've made a change now to reflect this

In arc_tigers.data.get_reddit_data If balanced is False, it checks for a class_balance argument and uses that to imbalance the train and test splits.

jack89roberts · 2025-05-16T12:04:31Z

There's also a lot going on in the __main__ block of random sampling that could be pulled out into functions (anything that isn't just argparse)

J-Dymond · 2025-05-16T12:53:12Z

Will refactor __main__ in scripts/random_sampling.py

J-Dymond · 2025-05-16T14:43:50Z

removed conflicts with main

J-Dymond and others added 6 commits May 12, 2025 17:46

shifted some functions around to tidy up repo

06e2ca5

WIP EDA script

d322190

numpy typing

9290034

eda script which pulls out the most common words, entropies, and soft…

a096581

…max probabilities for correctly/incorrectly classified samples

Merge remote-tracking branch 'refs/remotes/origin/4-random-sampling' …

3f11963

…into 4-random-sampling

added functionality of imbalancing classes the other way in random_sa…

0f1e1a7

…mpling.py

J-Dymond linked an issue May 16, 2025 that may be closed by this pull request

Benchmark performance using random sampling #4

Open

jack89roberts reviewed May 16, 2025

View reviewed changes

J-Dymond linked an issue May 16, 2025 that may be closed by this pull request

Assess performance on balanced class problem #3

Open

dataset imbalancing now happens in get_reddit_data

023d330

J-Dymond marked this pull request as ready for review May 16, 2025 14:35

Merge branch 'main' into 4-random-sampling

2d01b48

J-Dymond requested a review from klh5 May 16, 2025 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4 random sampling #13

4 random sampling #13

J-Dymond commented May 16, 2025 •

edited

Loading

jack89roberts May 16, 2025

J-Dymond May 16, 2025

jack89roberts May 16, 2025

J-Dymond May 16, 2025

J-Dymond May 16, 2025

jack89roberts commented May 16, 2025

J-Dymond commented May 16, 2025 •

edited

Loading

J-Dymond commented May 16, 2025

4 random sampling #13

Are you sure you want to change the base?

4 random sampling #13

Conversation

J-Dymond commented May 16, 2025 • edited Loading

jack89roberts May 16, 2025

Choose a reason for hiding this comment

J-Dymond May 16, 2025

Choose a reason for hiding this comment

jack89roberts May 16, 2025

Choose a reason for hiding this comment

J-Dymond May 16, 2025

Choose a reason for hiding this comment

J-Dymond May 16, 2025

Choose a reason for hiding this comment

jack89roberts commented May 16, 2025

J-Dymond commented May 16, 2025 • edited Loading

J-Dymond commented May 16, 2025

J-Dymond commented May 16, 2025 •

edited

Loading

J-Dymond commented May 16, 2025 •

edited

Loading