This repository contains the code and resources used for the experiments conducted as part of my Bachelor's Thesis. The primary focus is on fine-tuning of language models for investigating model collpase when training on generational synthetic data, by evaluating the generated text on different metrics. The experiments involve iteratively fine-tuning consequent generations of OPT-125m (Open Pre-trained Transformer) model from Hugging Faces using a custom dataset (WritingPrompts), evaluating each generation on story creation.
- Model: The model used in the experiment is the OPT (Open Pre-trained Transformer) model provided by Meta AI. The specific version used is
facebook/opt-125m
. - Dataset: The dataset consists of a collection of stories that were pre-processed to remove artifacts such as
<newline>
,(Edit :)
, and other non-story text. The dataset was further cleaned to ensure high-quality training data.
Before training, the dataset undergoes several preprocessing steps:
- Removing special artifacts like
<newline>
and(Edit :)
. - Eliminating comments or promotional content that do not belong to the story.
- Normalizing whitespace by replacing multiple spaces with a single space.
- Ensuring proper punctuation by removing spaces before periods.
The fine-tuning process is conducted using the DeepSpeed library to leverage distributed training capabilities. The following configuration is used:
- Training Script:
run_clm.py
- Configuration:
num_train_epochs
: 5learning_rate
: 5e-5per_device_train_batch_size
: 4gradient_accumulation_steps
: 4save_strategy
: epochdeepspeed
:ds_config_AdamW.json
- Optimizer: AdamW with a learning rate of 5e-5, beta parameters (0.9, 0.999), epsilon 1e-8, and weight decay 0.01.
- Scheduler: WarmupLR with warmup steps set to 300.
After fine-tuning, the model is used to generate stories based on given prompts. The generation parameters are as follows:
max_length
: 500 tokensmin_length
: 300 tokenstemperature
: 0.7top_k
: 50top_p
: 0.9repetition_penalty
: 1.0 / 1.1do_sample
: True
To ensure the generated stories meet the minimum length requirement, an iterative augmentation process is used, where additional tokens are generated until the story reaches the desired length.
- Python 3.8+
- PyTorch
- Transformers
- DeepSpeed