LitServe lets you build high-performance AI inference pipelines on top of FastAPI - no boilerplate. Define one or more models, connect vector DBs, stream responses, batch requests, and autoscale on GPUs out of the box.
LitServe is at least 2x faster than plain FastAPI due to AI-specific multi-worker handling.
✅ (2x)+ faster serving ✅ Easy to use ✅ LLMs, non LLMs and more ✅ Bring your own model ✅ PyTorch/JAX/TF/... ✅ Built on FastAPI ✅ GPU autoscaling ✅ Batching, Streaming ✅ Self-host or ⚡️ managed ✅ Inference pipeline ✅ Integrate with vLLM, etc ✅ Serverless
Install LitServe via pip (more options):
pip install litserve
This toy example with 2 models (inference pipeline) shows LitServe's flexibility (see real examples):
# server.py
import litserve as ls
# (STEP 1) - DEFINE THE API ("inference" pipeline)
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
# setup is called once at startup. Defines elements of the pipeline: models, connect DBs, load data, etc...
self.model1 = lambda x: x**2
self.model2 = lambda x: x**3
def decode_request(self, request):
# Convert the request payload to model input.
return request["input"]
def predict(self, x):
# Run the inference pipeline and return the output
a = self.model1(x)
b = self.model2(x)
c = a + b
return {"output": c}
def encode_response(self, output):
# Convert the model output to a response payload.
return {"output": output}
# (STEP 2) - START THE SERVER
if __name__ == "__main__":
# scale with advanced features (batching, GPUs, etc...)
server = ls.LitServer(SimpleLitAPI(), accelerator="auto", max_batch_size=1)
server.run(port=8000)
Now run the server anywhere (local or cloud) via the command-line.
# Deploy to the cloud of your choice via Lightning AI (serverless, autoscaling, etc.)
lightning serve server.py
# Or run locally (self host anywhere)
lightning serve server.py --local
Learn more about managed hosting on Lightning AI.
You can also run the server manually:
python server.py
Simulate an http request (run this on any terminal):
curl -X POST http://127.0.0.1:8000/predict -H "Content-Type: application/json" -d '{"input": 4.0}'
LitServe isn’t just for LLMs like vLLM or Ollama; it serves any AI model with full control over internals (learn more).
For easy LLM serving, integrate vLLM with LitServe, or use LitGPT (built on LitServe).
litgpt serve microsoft/phi-2
- LitAPI lets you easily build complex AI systems with one or more models (docs).
- Use the setup method for one-time tasks like connecting models, DBs, and loading data (docs).
- LitServer handles optimizations like batching, GPU autoscaling, streaming, etc... (docs).
- Self host on your machines or create a fully managed deployment with Lightning (learn more).
Learn how to make this server 200x faster.
Here are examples of inference pipelines for common model types and use cases.
Toy model: Hello world LLMs: Llama 3.2, LLM Proxy server, Agent with tool use RAG: vLLM RAG (Llama 3.2), RAG API (LlamaIndex) NLP: Hugging face, BERT, Text embedding API Multimodal: OpenAI Clip, MiniCPM, Phi-3.5 Vision Instruct, Qwen2-VL, Pixtral Audio: Whisper, AudioCraft, StableAudio, Noise cancellation (DeepFilterNet) Vision: Stable diffusion 2, AuraFlow, Flux, Image Super Resolution (Aura SR), Background Removal, Control Stable Diffusion (ControlNet) Speech: Text-speech (XTTS V2), Parler-TTS Classical ML: Random forest, XGBoost Miscellaneous: Media conversion API (ffmpeg), PyTorch + TensorFlow in one API, LLM proxy server
Browse 100+ community-built templates
Self host LitServe anywhere or deploy to your favorite cloud via Lightning AI.
deploy.mp4
Self-hosting is ideal for hackers, students, and DIY developers while fully managed hosting is ideal for enterprise developers needing easy autoscaling, security, release management, and 99.995% uptime and observability.
Note: Lightning offers a generous free tier for developers.
To host on Lightning AI, simply run the command, login and choose the cloud of your choice.
lightning serve server.py
Feature | Self Managed | Fully Managed on Lightning |
---|---|---|
Docker-first deployment | ✅ DIY | ✅ One-click deploy |
Cost | ✅ Free (DIY) | ✅ Generous free tier with pay as you go |
Full control | ✅ | ✅ |
Use any engine (vLLM, etc.) | ✅ | ✅ vLLM, Ollama, LitServe, etc. |
Own VPC | ✅ (manual setup) | ✅ Connect your own VPC |
(2x)+ faster than plain FastAPI | ✅ | ✅ |
Bring your own model | ✅ | ✅ |
Build compound systems (1+ models) | ✅ | ✅ |
GPU autoscaling | ✅ | ✅ |
Batching | ✅ | ✅ |
Streaming | ✅ | ✅ |
Worker autoscaling | ✅ | ✅ |
Serve all models: (LLMs, vision, etc.) | ✅ | ✅ |
Supports PyTorch, JAX, TF, etc... | ✅ | ✅ |
OpenAPI compliant | ✅ | ✅ |
Open AI compatibility | ✅ | ✅ |
Authentication | ❌ DIY | ✅ Token, password, custom |
GPUs | ❌ DIY | ✅ 8+ GPU types, H100s from $1.75 |
Load balancing | ❌ | ✅ Built-in |
Scale to zero (serverless) | ❌ | ✅ No machine runs when idle |
Autoscale up on demand | ❌ | ✅ Auto scale up/down |
Multi-node inference | ❌ | ✅ Distribute across nodes |
Use AWS/GCP credits | ❌ | ✅ Use existing cloud commits |
Versioning | ❌ | ✅ Make and roll back releases |
Enterprise-grade uptime (99.95%) | ❌ | ✅ SLA-backed |
SOC2 / HIPAA compliance | ❌ | ✅ Certified & secure |
Observability | ❌ | ✅ Built-in, connect 3rd party tools |
CI/CD ready | ❌ | ✅ Lightning SDK |
24/7 enterprise support | ❌ | ✅ Dedicated support |
Cost controls & audit logs | ❌ | ✅ Budgets, breakdowns, logs |
Debug on GPUs | ❌ | ✅ Studio integration |
20+ features | - | - |
LitServe is designed for AI workloads. Specialized multi-worker handling delivers a minimum 2x speedup over FastAPI.
Additional features like batching and GPU autoscaling can drive performance well beyond 2x, scaling efficiently to handle more simultaneous requests than FastAPI and TorchServe.
Reproduce the full benchmarks here (higher is better).
These results are for image and text classification ML tasks. The performance relationships hold for other ML tasks (embedding, LLM serving, audio, segmentation, object detection, summarization etc...).
💡 Note on LLM serving: For high-performance LLM serving (like Ollama/vLLM), integrate vLLM with LitServe, use LitGPT, or build your custom vLLM-like server with LitServe. Optimizations like kv-caching, which can be done with LitServe, are needed to maximize LLM performance.
LitServe is a community project accepting contributions - Let's make the world's most advanced AI inference engine.