🚀

Model Serving

Inference servers, gateways, and deployment platforms for running models in production.

Graduated · 4Incubating · 48 total

vLLM

Graduated

5/5

High-throughput LLM inference server with PagedAttention

The gold standard for self-hosted model serving. PagedAttention makes KV cache management memory-efficient. Supports continuous batching and OpenAI-compatible API. Run this on GPU.

Open Source

Article →Docs ↗

Ollama

Incubating

4/5

Run open-weight LLMs locally — one-command setup

Best developer experience for local model running. Pull, run, and get an OpenAI-compatible API in one command. Essential for offline development and demo environments.

Open Source

Docs ↗

LiteLLM

Graduated

5/5

100+ LLM providers behind a single OpenAI-compatible API

Essential for multi-provider strategies. LiteLLM Proxy adds cost tracking, rate limiting, and fallback routing. Use as a team-wide LLM gateway to abstract provider lock-in.

Open Source

Article →Docs ↗

NVIDIA Triton

Graduated

4/5

NVIDIA's production model serving platform — any framework, any GPU

Enterprise-grade serving for HPC/GPU clusters. Supports simultaneous deployment of multiple models. High operational complexity — best for dedicated ML platform teams.

Open Source

Docs ↗

BentoML

Incubating

3/5

Build, ship, and scale AI applications with Python

Great for wrapping models as services with minimal boilerplate. BentoCloud handles autoscaling. Simpler than Triton for most use cases but lacks fine-grained batching control.

Hybrid

Docs ↗

Ray Serve

Incubating

4/5

Scalable model serving on Ray distributed compute

Best choice when you already use Ray for training/data pipelines. Composition API enables complex serving graphs. High learning curve if you are not in the Ray ecosystem.

Open Source

Docs ↗

MLflow

Incubating

4/5

Open-source MLOps platform — tracking, registry, and serving

Industry standard for experiment tracking and model registry. MLflow Models lets you deploy to REST endpoint cleanly. Integrates with Databricks for enterprise scale.

Open Source

Docs ↗

TensorRT-LLM

Graduated

4/5

NVIDIA's optimised inference library for LLMs on GPUs

Highest raw throughput for NVIDIA GPUs. Kernel fusion, INT4/FP8 quantisation, and in-flight batching push tokens/sec to the limit. Pairs with Triton for production serving. Requires NVIDIA hardware.

Open Source

Docs ↗

Other Categories

🧠Text Generation & Reasoning20 tools 💻Code Generation12 tools 🎨Image Generation12 tools 🎬Video Generation10 tools 🎙️Speech & Audio10 tools 🔗Embedding Models10 tools 👁️Multimodal & Vision10 tools 🤖AI Agents & Platforms10 tools 🔧Frameworks & SDKs12 tools ☁️Cloud AI Platforms10 tools 🗄️Vector Databases10 tools 🔬Data & Fine-tuning10 tools 📊Observability & Evals8 tools 🛡️Guardrails & Safety8 tools