Multimodal AI officially left the lab and entered mainstream deployment. Three foundational models—OpenAI’s GPT-4o, Google’s Gemini 1.5, and Anthropic’s Claude 3 Opus—demonstrated unprecedented abilities to reason across text, image, audio, and even video simultaneously.
OpenAI’s GPT-4o (‘o’ for omni) stood out for its native multimodal support. Unlike earlier models that stitched together separate vision and language backends, GPT-4o was trained from the ground up to handle diverse input types through a unified transformer architecture. This allowed it to seamlessly switch between interpreting text, describing images, generating speech, and answering audio-based queries. GPT-4o could, for example, take a screenshot, identify UX flaws, and then recommend frontend improvements in natural language.
Gemini 1.5, on the other hand, leaned heavily into search augmentation and scientific problem solving. Google’s model showed prowess in reading scientific papers, interpreting accompanying charts and figures, and explaining the material at varying levels of complexity. This opened up new possibilities for education platforms, pharma R&D, and technical support agents who needed to parse visuals like diagrams or graphs in real time.
Claude 3 Opus, true to Anthropic’s roots, placed a strong emphasis on safety and intent alignment across modalities. Its ability to provide detailed and emotionally aware descriptions of visual scenes made it ideal for accessibility applications, especially for describing photos or interfaces to visually impaired users. One standout demo showed Claude interpreting a social media post with an embedded video and summarising its tone, subject, and sentiment.
These capabilities didn’t just wow in demos—they started showing real-world value:
- E-commerce: Retailers began using multimodal AI to automatically tag product photos, generate alt text, and create image-based SEO descriptions at scale.
- Healthcare: AI assistants could interpret X-rays and ultrasounds alongside textual reports, flagging discrepancies or generating second opinions.
- Media & Publishing: Newsrooms used models to summarise video footage, generate captions, and translate embedded interviews across languages.
Developers, too, gained new superpowers. Multimodal prompt workflows emerged, where inputs combined text + images + diagrams. For example:
“Using the image of this circuit board, explain why the right-hand capacitor might be overheating.”
Or:
“Generate a LinkedIn post for this slide deck, focusing on slide 4’s chart and its key takeaway.”
Tooling also evolved. Platforms like Cursor, Dust, and LangChain introduced support for multimodal pipelines. These systems handled everything from OCR and transcription to image embedding and video segmentation — routing outputs into LLMs for layered reasoning. Vector stores such as Weaviate and Pinecone added support for visual search and hybrid retrieval.
But multimodal AI also raised new challenges:
- Latency: Processing images and video introduced lags that made real-time applications more complex.
- Data privacy: Screenshots, diagrams, and user-submitted media often contained sensitive information, demanding tighter guardrails.
- Evaluation: Traditional NLP benchmarks fell short. New metrics emerged to test visual understanding, spatial reasoning, and cross-modal alignment.
Organisations had to rethink their prompt libraries. Prompt engineering moved from a text-only practice to one that incorporated image pre-processing, region tagging, and temporal sequencing for videos.
Educational content providers, UI testers, and game studios were among the first movers. A game studio used multimodal LLMs to generate narrative descriptions of game scenes, making titles more accessible to blind players. A test automation firm used them to identify broken UI components by comparing screenshots of test runs with expected baselines and flagging subtle layout shifts.
All of this pointed toward a future where language models act more like perception engines—not just answering questions, but seeing, hearing, and understanding context in ways that mirror human cognition.
July 2024 was a breakthrough month. It demonstrated that the fusion of modalities isn’t just technically feasible — it’s commercially viable, usable at scale, and ripe with untapped potential. As models continue to evolve, expect the next wave of GenAI apps to be powered not just by what you say — but also by what you show.
No responses yet