A powerful Python application for intelligent document analysis and summarization using state-of-the-art language models. Features include smart document chunking, iterative summarization, and an intuitive web interface.
graph TB
subgraph "Frontend Layer"
A[Streamlit UI]
end
subgraph "Application Layer"
D[Document Chunker]
E[Summarization Engine]
end
subgraph "Service Layer"
G[Ollama Service]
H[Token Counter]
I[Logger]
end
A --> D
A --> E
E --> G
D --> H
E --> H
D --> I
E --> I
- Smart document chunking with configurable parameters
- Token-based text splitting for optimal LLM processing
- Context preservation through sliding window approach
- Real-time token and character statistics
- Multiple summarization strategies based on text length
- Support for various Ollama models
- Configurable output parameters
- Progress tracking and error handling
- Intuitive Streamlit-based UI
- Real-time processing feedback
- Configuration management
- Summary history tracking
- Python 3.11 or higher
- Ollama with at least one model installed
- UV package manager (recommended)
- Clone the repository:
git clone https://github.com/palash-jain-cw/DocumentSummariser.git
cd DocumentSummariser
- Install dependencies (Use UV for dependency management):
uv pip install -e .
streamlit run src/documentsummariser/app/Home.py
from documentsummariser.summarisation.summarizer import Summarizer
from documentsummariser.summarisation.document_chunker import DocumentChunker
# Initialize components
chunker = DocumentChunker(chunk_size=256, overlap_size=30)
summarizer = Summarizer(model_name="llama3.2:3b", word_limit=250)
# Process a document
chunks = chunker.chunk_document(long_text)
summary = summarizer.summarize_text(text)
graph LR
A[documentsummariser] --> B[app]
A --> C[summarisation]
A --> D[logger]
B --> E[Home.py]
B --> F[Document_Chunker.py]
B --> G[Summarizer.py]
C --> H[summarizer.py]
C --> I[document_chunker.py]
D --> J[logger_setup.py]
OLLAMA_HOST=http://localhost:11434
LOG_LEVEL=INFO
# Default configuration
config = {
"chunk_size": 256,
"overlap_size": 30,
"model_name": "llama3.2:3b",
"word_limit": 250
}
class DocumentChunker:
"""Handles document splitting with context preservation."""
def chunk_document(text: str) -> List[str]:
"""Split document into chunks."""
def get_chunk_info(text: str) -> Dict:
"""Get chunking statistics."""
class Summarizer:
"""Manages document summarization process."""
def summarize_text(text: str) -> str:
"""Generate summary for text."""
def process_records(texts: List[str]) -> List[str]:
"""Process multiple documents."""
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For support, please:
- Check the documentation
- Search existing issues
- Create a new issue if needed