Skip to content

A document summarization engine using local LLM inference with Ollama. Processes large documents through intelligent chunking and context-preserving summarization, featuring a Streamlit interface with real-time tracking. Built with Python for API and standalone use.

Notifications You must be signed in to change notification settings

palash-jain-cw/DocumentSummariser

Repository files navigation

Document Summarizer

A powerful Python application for intelligent document analysis and summarization using state-of-the-art language models. Features include smart document chunking, iterative summarization, and an intuitive web interface.

System Architecture

graph TB
    subgraph "Frontend Layer"
        A[Streamlit UI]
    end
    
    subgraph "Application Layer"
        D[Document Chunker]
        E[Summarization Engine]
    end
    
    subgraph "Service Layer"
        G[Ollama Service]
        H[Token Counter]
        I[Logger]
    end
    
    A --> D
    A --> E
    E --> G
    D --> H
    E --> H
    D --> I
    E --> I
Loading

Features

1. Document Processing

  • Smart document chunking with configurable parameters
  • Token-based text splitting for optimal LLM processing
  • Context preservation through sliding window approach
  • Real-time token and character statistics

2. Summarization

  • Multiple summarization strategies based on text length
  • Support for various Ollama models
  • Configurable output parameters
  • Progress tracking and error handling

3. Web Interface

  • Intuitive Streamlit-based UI
  • Real-time processing feedback
  • Configuration management
  • Summary history tracking

Installation

Prerequisites

  • Python 3.11 or higher
  • Ollama with at least one model installed
  • UV package manager (recommended)

Setup

  1. Clone the repository:
git clone https://github.com/palash-jain-cw/DocumentSummariser.git
cd DocumentSummariser
  1. Install dependencies (Use UV for dependency management):
uv pip install -e .

Usage

1. Start the Application

streamlit run src/documentsummariser/app/Home.py

2. Using the API

from documentsummariser.summarisation.summarizer import Summarizer
from documentsummariser.summarisation.document_chunker import DocumentChunker

# Initialize components
chunker = DocumentChunker(chunk_size=256, overlap_size=30)
summarizer = Summarizer(model_name="llama3.2:3b", word_limit=250)

# Process a document
chunks = chunker.chunk_document(long_text)
summary = summarizer.summarize_text(text)

Module Structure

graph LR
    A[documentsummariser] --> B[app]
    A --> C[summarisation]
    A --> D[logger]
    
    B --> E[Home.py]
    B --> F[Document_Chunker.py]
    B --> G[Summarizer.py]
    
    C --> H[summarizer.py]
    C --> I[document_chunker.py]
    
    D --> J[logger_setup.py]
Loading

Configuration

1. Environment Variables

OLLAMA_HOST=http://localhost:11434
LOG_LEVEL=INFO

2. Application Settings

# Default configuration
config = {
    "chunk_size": 256,
    "overlap_size": 30,
    "model_name": "llama3.2:3b",
    "word_limit": 250
}

API Documentation

1. Document Chunker

class DocumentChunker:
    """Handles document splitting with context preservation."""
    
    def chunk_document(text: str) -> List[str]:
        """Split document into chunks."""
        
    def get_chunk_info(text: str) -> Dict:
        """Get chunking statistics."""

2. Summarizer

class Summarizer:
    """Manages document summarization process."""
    
    def summarize_text(text: str) -> str:
        """Generate summary for text."""
        
    def process_records(texts: List[str]) -> List[str]:
        """Process multiple documents."""

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For support, please:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue if needed

About

A document summarization engine using local LLM inference with Ollama. Processes large documents through intelligent chunking and context-preserving summarization, featuring a Streamlit interface with real-time tracking. Built with Python for API and standalone use.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages