-
Notifications
You must be signed in to change notification settings - Fork 0
4 random sampling #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…max probabilities for correctly/incorrectly classified samples
…into 4-random-sampling
scripts/random_sampling.py
Outdated
from arc_tigers.utils import load_yaml | ||
|
||
|
||
def imbalance_dataset(dataset, seed, class_balance): | ||
def imbalance_dataset(dataset: Dataset, seed: int, class_balance: float) -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I think changing the imbalance would be handled in the data scripts, and not the sampling script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want different levels of imbalance when training? And would this just be in the binary case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I just think that as far as the sampling script is concerned the dataset logic shouldn't be much more than test_data = load_dataset(config)
or similar. We may want the option of imbalance in the training splits too, but that's separate to anything to do with this script (apart from making sure the test data doesn't overlap with the training data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a change now to reflect this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In arc_tigers.data.get_reddit_data
If balanced
is False
, it checks for a class_balance
argument and uses that to imbalance the train and test splits.
There's also a lot going on in the |
Will refactor |
removed conflicts with main |
Random sampling + analysis scripts
Refactoring changes, moving dataset loading functions into separate files
Additional configs