👁️

Multimodal & Vision

Vision-language models, OCR, and any-to-any multimodal AI systems.

Graduated · 1Incubating · 5Sandbox · 410 total

LLaVA-NeXT

Incubating

4/5

Leading open-source vision-language model with strong reasoning

Best open-source VLM for general-purpose vision understanding. Strong at image captioning, VQA, and document analysis. Multiple size variants from 7B to 72B. Active research community.

Open Source

Docs ↗

Pixtral Large

Incubating

4/5

Mistral's 124B vision-language model with 128K context

Strong vision-language capabilities at 124B parameters. Handles multiple images in a single prompt. Good for document understanding and multi-page analysis. EU data residency.

Open Source

Docs ↗

InternVL 2.5

Incubating

4/5

Open-source VLM rivalling GPT-4o on vision benchmarks

Closest open-source model to GPT-4o vision quality. Strong OCR, chart understanding, and mathematical reasoning from images. Multiple sizes available for different compute budgets.

Open Source

Docs ↗

Qwen-VL-Max

Incubating

4/5

Alibaba's flagship vision-language model with video understanding

Excellent at OCR, document parsing, and video understanding. Supports interleaved image-text inputs. Best multilingual VLM for CJK content. Available via DashScope API.

Proprietary

Docs ↗

Florence-2

Graduated

3/5

Microsoft's unified vision foundation model for multiple tasks

Versatile vision model handling captioning, detection, segmentation, and OCR in one model. Prompt-based task switching. Excellent for building multi-task vision pipelines.

Open Source

Docs ↗

Molmo

Sandbox

3/5

Allen AI's fully open VLM with pointing and grounding

Truly open VLM with open weights, data, and code. Unique pointing capability for spatial grounding. Good for academic research and applications requiring full transparency.

Open Source

Docs ↗

CogVLM2

Sandbox

3/5

Zhipu AI's vision-language model with video understanding

Strong at image and video understanding with multi-turn conversation. Good Chinese language support in VLM context. Open weights enable self-hosted multimodal applications.

Open Source

Docs ↗

GLM-OCR

Sandbox

3/5

Zhipu AI's specialised model for document OCR and extraction

Specialised OCR model that handles complex layouts, tables, and multi-language documents. Good for document digitisation pipelines. Lighter than general VLMs for pure OCR tasks.

Open Source

Docs ↗

PaLI-Gemma 2

Incubating

3/5

Google's vision encoder model for fine-tuning and transfer learning

Excellent base for fine-tuning custom vision tasks. Strong transfer learning to detection, segmentation, and VQA. Multiple sizes for different deployment targets.

Open Source

Docs ↗

BLIP-3 / xGen-MM

Sandbox

3/5

Salesforce's multimodal model for enterprise vision tasks

Strong at visual grounding, image captioning, and structured extraction. Good for enterprise workflows requiring custom vision-language pipelines. Apache 2.0 licensed.

Open Source

Docs ↗

Other Categories

🧠Text Generation & Reasoning20 tools 💻Code Generation12 tools 🎨Image Generation12 tools 🎬Video Generation10 tools 🎙️Speech & Audio10 tools 🔗Embedding Models10 tools 🤖AI Agents & Platforms10 tools 🔧Frameworks & SDKs12 tools ☁️Cloud AI Platforms10 tools 🗄️Vector Databases10 tools 🔬Data & Fine-tuning10 tools 📊Observability & Evals8 tools 🚀Model Serving8 tools 🛡️Guardrails & Safety8 tools