AI Wisdom
๐Ÿ‘๏ธ

Multimodal & Vision

Vision-language models, OCR, and any-to-any multimodal AI systems.

Graduated ยท 1Incubating ยท 5Sandbox ยท 410 total
โ† All categories

LLaVA-NeXT

Incubating
4/5

Leading open-source vision-language model with strong reasoning

Best open-source VLM for general-purpose vision understanding. Strong at image captioning, VQA, and document analysis. Multiple size variants from 7B to 72B. Active research community.

Open Source

Pixtral Large

Incubating
4/5

Mistral's 124B vision-language model with 128K context

Strong vision-language capabilities at 124B parameters. Handles multiple images in a single prompt. Good for document understanding and multi-page analysis. EU data residency.

Open Source

InternVL 2.5

Incubating
4/5

Open-source VLM rivalling GPT-4o on vision benchmarks

Closest open-source model to GPT-4o vision quality. Strong OCR, chart understanding, and mathematical reasoning from images. Multiple sizes available for different compute budgets.

Open Source

Qwen-VL-Max

Incubating
4/5

Alibaba's flagship vision-language model with video understanding

Excellent at OCR, document parsing, and video understanding. Supports interleaved image-text inputs. Best multilingual VLM for CJK content. Available via DashScope API.

Proprietary

Florence-2

Graduated
3/5

Microsoft's unified vision foundation model for multiple tasks

Versatile vision model handling captioning, detection, segmentation, and OCR in one model. Prompt-based task switching. Excellent for building multi-task vision pipelines.

Open Source

Molmo

Sandbox
3/5

Allen AI's fully open VLM with pointing and grounding

Truly open VLM with open weights, data, and code. Unique pointing capability for spatial grounding. Good for academic research and applications requiring full transparency.

Open Source

CogVLM2

Sandbox
3/5

Zhipu AI's vision-language model with video understanding

Strong at image and video understanding with multi-turn conversation. Good Chinese language support in VLM context. Open weights enable self-hosted multimodal applications.

Open Source

GLM-OCR

Sandbox
3/5

Zhipu AI's specialised model for document OCR and extraction

Specialised OCR model that handles complex layouts, tables, and multi-language documents. Good for document digitisation pipelines. Lighter than general VLMs for pure OCR tasks.

Open Source

PaLI-Gemma 2

Incubating
3/5

Google's vision encoder model for fine-tuning and transfer learning

Excellent base for fine-tuning custom vision tasks. Strong transfer learning to detection, segmentation, and VQA. Multiple sizes for different deployment targets.

Open Source

BLIP-3 / xGen-MM

Sandbox
3/5

Salesforce's multimodal model for enterprise vision tasks

Strong at visual grounding, image captioning, and structured extraction. Good for enterprise workflows requiring custom vision-language pipelines. Apache 2.0 licensed.

Open Source