LLaVA-NeXT
IncubatingLeading open-source vision-language model with strong reasoning
Best open-source VLM for general-purpose vision understanding. Strong at image captioning, VQA, and document analysis. Multiple size variants from 7B to 72B. Active research community.
Pixtral Large
IncubatingMistral's 124B vision-language model with 128K context
Strong vision-language capabilities at 124B parameters. Handles multiple images in a single prompt. Good for document understanding and multi-page analysis. EU data residency.
InternVL 2.5
IncubatingOpen-source VLM rivalling GPT-4o on vision benchmarks
Closest open-source model to GPT-4o vision quality. Strong OCR, chart understanding, and mathematical reasoning from images. Multiple sizes available for different compute budgets.
Qwen-VL-Max
IncubatingAlibaba's flagship vision-language model with video understanding
Excellent at OCR, document parsing, and video understanding. Supports interleaved image-text inputs. Best multilingual VLM for CJK content. Available via DashScope API.
Florence-2
GraduatedMicrosoft's unified vision foundation model for multiple tasks
Versatile vision model handling captioning, detection, segmentation, and OCR in one model. Prompt-based task switching. Excellent for building multi-task vision pipelines.
Molmo
SandboxAllen AI's fully open VLM with pointing and grounding
Truly open VLM with open weights, data, and code. Unique pointing capability for spatial grounding. Good for academic research and applications requiring full transparency.
CogVLM2
SandboxZhipu AI's vision-language model with video understanding
Strong at image and video understanding with multi-turn conversation. Good Chinese language support in VLM context. Open weights enable self-hosted multimodal applications.
GLM-OCR
SandboxZhipu AI's specialised model for document OCR and extraction
Specialised OCR model that handles complex layouts, tables, and multi-language documents. Good for document digitisation pipelines. Lighter than general VLMs for pure OCR tasks.
PaLI-Gemma 2
IncubatingGoogle's vision encoder model for fine-tuning and transfer learning
Excellent base for fine-tuning custom vision tasks. Strong transfer learning to detection, segmentation, and VQA. Multiple sizes for different deployment targets.
BLIP-3 / xGen-MM
SandboxSalesforce's multimodal model for enterprise vision tasks
Strong at visual grounding, image captioning, and structured extraction. Good for enterprise workflows requiring custom vision-language pipelines. Apache 2.0 licensed.