Skip to content

4 random sampling #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

4 random sampling #13

wants to merge 8 commits into from

Conversation

J-Dymond
Copy link
Collaborator

@J-Dymond J-Dymond commented May 16, 2025

  • Random sampling + analysis scripts

  • Refactoring changes, moving dataset loading functions into separate files

  • Additional configs

@J-Dymond J-Dymond linked an issue May 16, 2025 that may be closed by this pull request
from arc_tigers.utils import load_yaml


def imbalance_dataset(dataset, seed, class_balance):
def imbalance_dataset(dataset: Dataset, seed: int, class_balance: float) -> Dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I think changing the imbalance would be handled in the data scripts, and not the sampling script

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want different levels of imbalance when training? And would this just be in the binary case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I just think that as far as the sampling script is concerned the dataset logic shouldn't be much more than test_data = load_dataset(config) or similar. We may want the option of imbalance in the training splits too, but that's separate to anything to do with this script (apart from making sure the test data doesn't overlap with the training data).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a change now to reflect this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In arc_tigers.data.get_reddit_data If balanced is False, it checks for a class_balance argument and uses that to imbalance the train and test splits.

@jack89roberts
Copy link
Collaborator

There's also a lot going on in the __main__ block of random sampling that could be pulled out into functions (anything that isn't just argparse)

@J-Dymond J-Dymond linked an issue May 16, 2025 that may be closed by this pull request
@J-Dymond
Copy link
Collaborator Author

J-Dymond commented May 16, 2025

Will refactor __main__ in scripts/random_sampling.py

@J-Dymond J-Dymond marked this pull request as ready for review May 16, 2025 14:35
@J-Dymond
Copy link
Collaborator Author

removed conflicts with main

@J-Dymond J-Dymond requested a review from klh5 May 16, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmark performance using random sampling Assess performance on balanced class problem
2 participants