BA_Thesis_Experiment

Description

This repository contains the code and resources used for the experiments conducted as part of my Bachelor's Thesis. The primary focus is on fine-tuning of language models for investigating model collpase when training on generational synthetic data, by evaluating the generated text on different metrics. The experiments involve iteratively fine-tuning consequent generations of OPT-125m (Open Pre-trained Transformer) model from Hugging Faces using a custom dataset (WritingPrompts), evaluating each generation on story creation.

Experiment Overview

Model and Dataset

Model: The model used in the experiment is the OPT (Open Pre-trained Transformer) model provided by Meta AI. The specific version used is facebook/opt-125m.
Dataset: The dataset consists of a collection of stories that were pre-processed to remove artifacts such as <newline>, (Edit :), and other non-story text. The dataset was further cleaned to ensure high-quality training data.

Preprocessing

Before training, the dataset undergoes several preprocessing steps:

Removing special artifacts like <newline> and (Edit :).
Eliminating comments or promotional content that do not belong to the story.
Normalizing whitespace by replacing multiple spaces with a single space.
Ensuring proper punctuation by removing spaces before periods.

Fine-Tuning

The fine-tuning process is conducted using the DeepSpeed library to leverage distributed training capabilities. The following configuration is used:

Training Script: run_clm.py
Configuration:
- num_train_epochs: 5
- learning_rate: 5e-5
- per_device_train_batch_size: 4
- gradient_accumulation_steps: 4
- save_strategy: epoch
- deepspeed: ds_config_AdamW.json
Optimizer: AdamW with a learning rate of 5e-5, beta parameters (0.9, 0.999), epsilon 1e-8, and weight decay 0.01.
Scheduler: WarmupLR with warmup steps set to 300.

Generation and Post-processing

After fine-tuning, the model is used to generate stories based on given prompts. The generation parameters are as follows:

max_length: 500 tokens
min_length: 300 tokens
temperature: 0.7
top_k: 50
top_p: 0.9
repetition_penalty: 1.0 / 1.1
do_sample: True

To ensure the generated stories meet the minimum length requirement, an iterative augmentation process is used, where additional tokens are generated until the story reaches the desired length.

Prerequisites

Python 3.8+
PyTorch
Transformers
DeepSpeed

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data/hd/raw		data/hd/raw
evaluation		evaluation
metrics		metrics
outputs		outputs
tools		tools
.gitignore		.gitignore
HighLeve_ML_Experiment4.webp		HighLeve_ML_Experiment4.webp
base_model.ipynb		base_model.ipynb
ds_config.json		ds_config.json
ds_config_AdamW.json		ds_config_AdamW.json
environemt_setup.ipynb		environemt_setup.ipynb
evaluation.ipynb		evaluation.ipynb
gen0_clm.ipynb		gen0_clm.ipynb
gen1_clm.ipynb		gen1_clm.ipynb
gen2_clm.ipynb		gen2_clm.ipynb
gen3_clm.ipynb		gen3_clm.ipynb
gen4_clm.ipynb		gen4_clm.ipynb
gen5_clm.ipynb		gen5_clm.ipynb
gen6_clm.ipynb		gen6_clm.ipynb
gen7_clm.ipynb		gen7_clm.ipynb
metrics_tests.ipynb		metrics_tests.ipynb
preprocess.ipynb		preprocess.ipynb
readme.md		readme.md
run_clm.py		run_clm.py
test_rd_evaluation.ipynb		test_rd_evaluation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BA_Thesis_Experiment

Description

Experiment Overview

Model and Dataset

Preprocessing

Fine-Tuning

Generation and Post-processing

Prerequisites

About

Uh oh!

Releases

Packages

Languages

VasilisAvgoustakis/BA_Thesis_Experiment

Folders and files

Latest commit

History

Repository files navigation

BA_Thesis_Experiment

Description

Experiment Overview

Model and Dataset

Preprocessing

Fine-Tuning

Generation and Post-processing

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages