Skip to content

Support configurability of FIM tokens on SantaCoder #313

@Jay-Roberts

Description

@Jay-Roberts

Currently the SantaCoder and StarCoder FIM tasks have fixed FIM tokens; however, other models may use different fim-tokens. For example to use Qwen2.5-Coder models one might have to create

class SantaCoderQwen25CoderFIM(SantaCoderFIM):
    DATASET_PATH = "bigcode/santacoder-fim-task"

    def __init__(self):
        fim_prefix = "<|fim_prefix|>"
        fim_middle = "<|fim_middle|>"
        fim_suffix = "<|fim_suffix|>"
        stop_words = ["<|endoftext|>", "<|filename|>"]
        super().__init__(
            stop_words=stop_words,
            requires_execution=False,
            fim_prefix=fim_prefix,
            fim_middle=fim_middle,
            fim_suffix=fim_suffix,
        )

Allowing the FIM parameters (e.g. --fim_tokens and --stop_words) to be passed in similarly to the --instruction_tokens for HumanEval would allow this task to be a single class and support future FIM models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions