PDF to Text and JSONL Converter

This is a command line Python tool to convert larger PDF files into a format that AI can use to fine-tune a model or prime a vector search.

This project converts a PDF document into images, extracts text from those images using OCR (Optical Character Recognition), and then processes and formats the text using OpenAI's GPT model by comparing the image and text and creating metadata in a multimodal fashion. (the ai reviews the image and the text extracted by tesseract side by side) The final output is saved as both a formatted text file and a JSONL file with metadata.

I am currently tuning this project for gpt4o and gpt4o-mini and will be making updates all September as I craft it to micro-tune and fade to develop models for specific purposes that react to small changes while maintaining a proper context. Stay Tuned

Features

Convert PDF pages to images
Extract text from images using Tesseract OCR
Analyze Images and text, side by side multimodal and format meta-text with the help of OpenAI's GPT model
Segment text into sections and save as JSONL with metadata

Prerequisites

Before running the script, ensure you have the following installed:

Python 3.6+
Tesseract OCR
Poppler-utils (for PDF to image conversion)
Required Python packages (listed in requirements.txt)

Installation

Clone the repository:

git clone https://github.com/yourusername/pdf-to-text-jsonl.git
cd pdf-to-text-jsonl

Install dependencies:
```
pip install -r requirements.txt
```
Set up Tesseract OCR:
- Windows: Download and install Tesseract from here.
- macOS: Install via Homebrew:
```
brew install tesseract
```
- Linux: Install via package manager:
```
sudo apt-get install tesseract-ocr
```
Set up Poppler-utils:
- Windows: Download and install Poppler from here.
- macOS: Install via Homebrew:
```
brew install poppler
```
- Linux: Install via package manager:
```
sudo apt-get install poppler-utils
```
Configure OpenAI API key:

Replace the placeholder API key in the script with your actual OpenAI API key.
```
client = OpenAI(api_key='your-api-key-here')
```

Usage

Place your PDF file:

Ensure your PDF file (e.g., your_input_file.pdf) is in the project directory.
Run the script:
```
python main.py
```
Follow the prompts to convert the PDF to images, extract text, analyze text, and segment text. The script will skip pipeline steps if it sees the files exist, simply move or delete the appropriate file (listed below under Output) to re-activate that step, (this feature allows for starting and stopping without starting over.)

Script Details

convert_pdf_to_images(pdf_path, output_image_dir): Converts each page of the PDF to an image.
extract_text_from_images(image_paths): Extracts text from the generated images using Tesseract OCR.
analyze_and_format_text_with_images(text_chunks, image_paths): Uses OpenAI's GPT model to analyze and format the extracted text.
segment_text_by_analysis(formatted_text): Segments the formatted text into sections.
prepare_jsonl_data(segmented_text): Prepares the segmented text for JSONL format.
save_text_to_file(text, file_path): Saves text to a file.
save_jsonl_to_file(jsonl_data, file_path): Saves JSONL data to a file.

Output

Images: Saved in the output_images directory.
Extracted Text: Saved as extracted_text.txt.
Formatted Text: Saved as extracted_text_formatted.txt.
JSONL Data: Saved as segmented_text.jsonl.

Example

Here's an example of how to run the script:

python main.py

You will be prompted to convert the PDF to images, extract text from images, analyze and format the text, and segment the text into JSONL format.

Contributing

Feel free to fork this repository, make improvements, and submit pull requests. Contributions are welcome!

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

OpenAI for their GPT model.
Tesseract OCR for the OCR engine.
pdf2image for converting PDF to images.
Poppler-utils for PDF rendering.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
transform.jpg		transform.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Text and JSONL Converter

I am currently tuning this project for gpt4o and gpt4o-mini and will be making updates all September as I craft it to micro-tune and fade to develop models for specific purposes that react to small changes while maintaining a proper context. Stay Tuned

Features

Prerequisites

Installation

Usage

Script Details

Output

Example

Contributing

License

Acknowledgements

About

Releases

Packages

Languages

License

alanchelmickjr/jsonl-data-preparation-for-model-fine-tuning-and-vector-stores

Folders and files

Latest commit

History

Repository files navigation

PDF to Text and JSONL Converter

I am currently tuning this project for gpt4o and gpt4o-mini and will be making updates all September as I craft it to micro-tune and fade to develop models for specific purposes that react to small changes while maintaining a proper context. Stay Tuned

Features

Prerequisites

Installation

Usage

Script Details

Output

Example

Contributing

License

Acknowledgements

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages