Skip to content

Deploy high-performance AI models and inference pipelines on FastAPI with built-in batching, streaming and more.

License

Notifications You must be signed in to change notification settings

Lightning-AI/LitServe

Repository files navigation

Deploy AI models and inference pipelines - ⚡ fast

Lightning

 

LitServe lets you build high-performance AI inference pipelines on top of FastAPI - no boilerplate. Define one or more models, connect vector DBs, stream responses, batch requests, and autoscale on GPUs out of the box.

LitServe is at least 2x faster than plain FastAPI due to AI-specific multi-worker handling.

✅ (2x)+ faster serving  ✅ Easy to use               ✅ LLMs, non LLMs and more
✅ Bring your own model  ✅ PyTorch/JAX/TF/...        ✅ Built on FastAPI       
✅ GPU autoscaling       ✅ Batching, Streaming       ✅ Self-host or ⚡️ managed
✅ Inference pipeline    ✅ Integrate with vLLM, etc  ✅ Serverless             
   

PyPI Downloads Discord cpu-tests codecov license

 

 

Quick start

Install LitServe via pip (more options):

pip install litserve

Define a server

This toy example with 2 models (inference pipeline) shows LitServe's flexibility (see real examples):

# server.py
import litserve as ls

# (STEP 1) - DEFINE THE API ("inference" pipeline)
class SimpleLitAPI(ls.LitAPI):
    def setup(self, device):
        # setup is called once at startup. Defines elements of the pipeline: models, connect DBs, load data, etc...
        self.model1 = lambda x: x**2
        self.model2 = lambda x: x**3

    def decode_request(self, request):
        # Convert the request payload to model input.
        return request["input"] 

    def predict(self, x):
        # Run the inference pipeline and return the output
        a = self.model1(x)
        b = self.model2(x)
        c = a + b
        return {"output": c}

    def encode_response(self, output):
        # Convert the model output to a response payload.
        return {"output": output} 

# (STEP 2) - START THE SERVER
if __name__ == "__main__":
    # scale with advanced features (batching, GPUs, etc...)
    server = ls.LitServer(SimpleLitAPI(), accelerator="auto", max_batch_size=1)
    server.run(port=8000)

Now run the server anywhere (local or cloud) via the command-line.

# Deploy to the cloud of your choice via Lightning AI (serverless, autoscaling, etc.)
lightning serve server.py

# Or run locally (self host anywhere)
lightning serve server.py --local

Learn more about managed hosting on Lightning AI.

You can also run the server manually:

python server.py

Test the server

Simulate an http request (run this on any terminal):

curl -X POST http://127.0.0.1:8000/predict -H "Content-Type: application/json" -d '{"input": 4.0}'

LLM serving

LitServe isn’t just for LLMs like vLLM or Ollama; it serves any AI model with full control over internals (learn more).
For easy LLM serving, integrate vLLM with LitServe, or use LitGPT (built on LitServe).

litgpt serve microsoft/phi-2

Summary

  • LitAPI lets you easily build complex AI systems with one or more models (docs).
  • Use the setup method for one-time tasks like connecting models, DBs, and loading data (docs).
  • LitServer handles optimizations like batching, GPU autoscaling, streaming, etc... (docs).
  • Self host on your machines or create a fully managed deployment with Lightning (learn more).

Learn how to make this server 200x faster.

 

Featured examples

Here are examples of inference pipelines for common model types and use cases.

Toy model:      Hello world
LLMs:           Llama 3.2, LLM Proxy server, Agent with tool use
RAG:            vLLM RAG (Llama 3.2), RAG API (LlamaIndex)
NLP:            Hugging face, BERT, Text embedding API
Multimodal:     OpenAI Clip, MiniCPM, Phi-3.5 Vision Instruct, Qwen2-VL, Pixtral
Audio:          Whisper, AudioCraft, StableAudio, Noise cancellation (DeepFilterNet)
Vision:         Stable diffusion 2, AuraFlow, Flux, Image Super Resolution (Aura SR),
                Background Removal, Control Stable Diffusion (ControlNet)
Speech:         Text-speech (XTTS V2), Parler-TTS
Classical ML:   Random forest, XGBoost
Miscellaneous:  Media conversion API (ffmpeg), PyTorch + TensorFlow in one API, LLM proxy server

Browse 100+ community-built templates

 

Hosting options

Self host LitServe anywhere or deploy to your favorite cloud via Lightning AI.

deploy.mp4

Self-hosting is ideal for hackers, students, and DIY developers while fully managed hosting is ideal for enterprise developers needing easy autoscaling, security, release management, and 99.995% uptime and observability.

Note: Lightning offers a generous free tier for developers.

To host on Lightning AI, simply run the command, login and choose the cloud of your choice.

lightning serve server.py

 

Features

Feature Self Managed Fully Managed on Lightning
Docker-first deployment ✅ DIY ✅ One-click deploy
Cost ✅ Free (DIY) ✅ Generous free tier with pay as you go
Full control
Use any engine (vLLM, etc.) ✅ vLLM, Ollama, LitServe, etc.
Own VPC ✅ (manual setup) ✅ Connect your own VPC
(2x)+ faster than plain FastAPI
Bring your own model
Build compound systems (1+ models)
GPU autoscaling
Batching
Streaming
Worker autoscaling
Serve all models: (LLMs, vision, etc.)
Supports PyTorch, JAX, TF, etc...
OpenAPI compliant
Open AI compatibility
Authentication ❌ DIY ✅ Token, password, custom
GPUs ❌ DIY ✅ 8+ GPU types, H100s from $1.75
Load balancing ✅ Built-in
Scale to zero (serverless) ✅ No machine runs when idle
Autoscale up on demand ✅ Auto scale up/down
Multi-node inference ✅ Distribute across nodes
Use AWS/GCP credits ✅ Use existing cloud commits
Versioning ✅ Make and roll back releases
Enterprise-grade uptime (99.95%) ✅ SLA-backed
SOC2 / HIPAA compliance ✅ Certified & secure
Observability ✅ Built-in, connect 3rd party tools
CI/CD ready ✅ Lightning SDK
24/7 enterprise support ✅ Dedicated support
Cost controls & audit logs ✅ Budgets, breakdowns, logs
Debug on GPUs ✅ Studio integration
20+ features - -

 

Performance

LitServe is designed for AI workloads. Specialized multi-worker handling delivers a minimum 2x speedup over FastAPI.

Additional features like batching and GPU autoscaling can drive performance well beyond 2x, scaling efficiently to handle more simultaneous requests than FastAPI and TorchServe.

Reproduce the full benchmarks here (higher is better).

LitServe

These results are for image and text classification ML tasks. The performance relationships hold for other ML tasks (embedding, LLM serving, audio, segmentation, object detection, summarization etc...).

💡 Note on LLM serving: For high-performance LLM serving (like Ollama/vLLM), integrate vLLM with LitServe, use LitGPT, or build your custom vLLM-like server with LitServe. Optimizations like kv-caching, which can be done with LitServe, are needed to maximize LLM performance.

 

Community

LitServe is a community project accepting contributions - Let's make the world's most advanced AI inference engine.

💬 Get help on Discord
📋 License: Apache 2.0