AI Wisdom
๐Ÿš€

Model Serving

Inference servers, gateways, and deployment platforms for running models in production.

Graduated ยท 4Incubating ยท 48 total
โ† All categories

vLLM

Graduated
5/5

High-throughput LLM inference server with PagedAttention

The gold standard for self-hosted model serving. PagedAttention makes KV cache management memory-efficient. Supports continuous batching and OpenAI-compatible API. Run this on GPU.

Ollama

Incubating
4/5

Run open-weight LLMs locally โ€” one-command setup

Best developer experience for local model running. Pull, run, and get an OpenAI-compatible API in one command. Essential for offline development and demo environments.

Open Source

LiteLLM

Graduated
5/5

100+ LLM providers behind a single OpenAI-compatible API

Essential for multi-provider strategies. LiteLLM Proxy adds cost tracking, rate limiting, and fallback routing. Use as a team-wide LLM gateway to abstract provider lock-in.

NVIDIA Triton

Graduated
4/5

NVIDIA's production model serving platform โ€” any framework, any GPU

Enterprise-grade serving for HPC/GPU clusters. Supports simultaneous deployment of multiple models. High operational complexity โ€” best for dedicated ML platform teams.

Open Source

BentoML

Incubating
3/5

Build, ship, and scale AI applications with Python

Great for wrapping models as services with minimal boilerplate. BentoCloud handles autoscaling. Simpler than Triton for most use cases but lacks fine-grained batching control.

Ray Serve

Incubating
4/5

Scalable model serving on Ray distributed compute

Best choice when you already use Ray for training/data pipelines. Composition API enables complex serving graphs. High learning curve if you are not in the Ray ecosystem.

Open Source

MLflow

Incubating
4/5

Open-source MLOps platform โ€” tracking, registry, and serving

Industry standard for experiment tracking and model registry. MLflow Models lets you deploy to REST endpoint cleanly. Integrates with Databricks for enterprise scale.

Open Source

TensorRT-LLM

Graduated
4/5

NVIDIA's optimised inference library for LLMs on GPUs

Highest raw throughput for NVIDIA GPUs. Kernel fusion, INT4/FP8 quantisation, and in-flight batching push tokens/sec to the limit. Pairs with Triton for production serving. Requires NVIDIA hardware.

Open Source