Skip to content

Gracefully skip BigBench tasks with no data & guard final aggregation #3066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NourFahmy
Copy link
Contributor

Description:
Running the harness on BigBench stalls (or crashes) once it reaches bigbench_simple_arithmetic_multiple_targets_json_generate_until because that task’s configured split (“train”) doesn’t exist, causing:

  • A ValueError: Instruction "train" corresponds to no data! in datasets.load_dataset().

  • An unguarded “Test One Doc” block in ConfigurableTask.init that blindly indexes eval_docs[0].

  • Later, an UnboundLocalError in evaluate() when referencing show_group_table if all tasks have zero examples.

This PR makes the harness robust to missing-split/empty tasks by:

  • Catching missing-split errors in ConfigurableTask.download() and substituting an empty DatasetDict({"default": …}).

  • Wrapping the “one‐doc” sanity checks in ConfigurableTask.init inside if self.eval_docs: (with an else fallback for minimal attribute initialization).

  • Filtering out zero-example tasks at the top of evaluate(), so they never build requests or invoke the LM.

  • Initializing show_group_table=False before the group-aggregation step and switching from bitwise & to logical and when inserting the "groups" entry in results_dict.

These changes resolve the crash and prevent any lag or endless stalling when the harness reaches that BigBench task, or other tasks that experience similar issues.

Gracefully skip BigBench tasks with no data & guard final aggregation
@baberabb
Copy link
Contributor

Sorry for the tardiness, I'll review soon. My main hesitation is we do not want tasks with misconfigured configs to silently work, as that would make maintenance much harder: we'd lose visibility into which tasks are actually broken/misconfigured and would be especially problematic for aggregating group scores. I'll have to think what workaround works better. Currently thinking an exclude-tasks param might work so users can explicitly skip some tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants