Arabic Chat with PDF is an innovative tool designed to enable users to interactively query Arabic PDF documents. Powered by state-of-the-art language models and document processing libraries, this application extracts, processes, and retrieves meaningful insights from Arabic text documents. Users can ask questions in Arabic, and the system responds in a professional tone, making this an essential tool for Arabic language researchers, educators, and professionals.
Explore the hosted version of this project on Hugging Face Spaces:
Upload an Arabic PDF and start chatting in seconds! π¬π
- Seamless PDF Integration: Upload Arabic PDFs under 10 MB, and start chatting.
- Advanced Text Recognition: Utilizes OCR for Arabic text extraction from searchable PDFs.
- Conversational Interface: Interact via a chatbot with RTL (Right-To-Left) support for natural Arabic conversation.
- Multilingual Embeddings: Employs multilingual embeddings for precise text analysis.
- Text-to-Speech: Outputs audio responses in Arabic for accessibility.
- Customizable UI: Designed with Arabic aesthetics and user-friendly components.
- Python Libraries:
- Gradio: User-friendly UI for interaction.
- PyPDF2: PDF text extraction.
- pytesseract: OCR for PDFs.
- LangChain: Framework for conversational AI with retrieval-based querying.
- gTTS: Arabic text-to-speech functionality.
- Machine Learning Models:
- LLMs: Powered by ChatGroq using the
gemma2-9b-it
model. - Embeddings: Utilizes sentence-transformers/paraphrase-multilingual-mpnet-base-v2.
- LLMs: Powered by ChatGroq using the
- Vector Store: FAISS for efficient similarity search and retrieval.
Ensure you have:
- Python 3.9+
pip
for package management- Access to API keys:
GROQ_API_KEY
HF_TOKEN
- Clone the Repository:
git clone https://github.com/your-repo/arabic-pdf-chat.git cd arabic-chat-with-pdf
- Install Dependencies:
pip install -r requirements.txt
- Set Up Environment Variables:
Create a.env
file and add the required API keys:GROQ_API_KEY=your_groq_api_key HF_TOKEN=your_huggingface_token
Launch the app with:
python app.py
The Gradio interface will open in your browser.
The system follows a structured ETL pipeline:
- Extract: Reads Arabic PDFs using OCR (
pytesseract
) andPyPDF2
. - Transform:
- Splits text into manageable chunks with
CharacterTextSplitter
. - Converts raw text into vector embeddings using
sentence-transformers
.
- Splits text into manageable chunks with
- Load: Stores transformed data in a FAISS vector database for efficient retrieval.
- File Size: Limited to PDFs under 10 MB.
- Language Support: Optimized only for Arabic text. Non-Arabic content is not supported.
- Scanned Documents: OCR may struggle with low-quality scans.
- Performance: Response times may vary depending on document size and complexity.
π€ M. N. Gaber
π GitHub Profile
π LinkedIn
This project is licensed under the Apache License.
Special thanks to the developers of the libraries and frameworks that made this project possible.
Contributions are welcome! Please fork the repository and submit a pull request. For major changes, open an issue first to discuss what you would like to change.
- Add support for scanned and handwriting docs.
- Improve OCR accuracy for complex layouts.
- Enhance conversational capabilities with additional LLM models.
Enjoy exploring Arabic text in a whole new way! π