Elon Musk’s AI company, xAI, is making progress on adding multimodal inputs to its Grok chatbot, according to public developer documents. What this means is that, soon, users may be able to upload photos to Grok and receive text-based answers.
This was first teased in a blog post last month from xAI which said Grok-1.5V will offer “multimodal models in a number of domains.” The latest update to the developer documents appear to show progress on shipping a new model.
In the developer documents, a sample Python script demonstrates how developers can use the xAI software development kit library to generate a response based on both text and images. This script reads an image file, sets up a text prompt, and uses the xAI SDK to generate a response.
This is a big update for Grok, which xAI first released in November 2023 and is available to users who pay for the X Premium Plus subscription. The last update was Grok 1.5 in March, which came with improved reasoning capabilities.
The model is trained “on a variety of text data from publicly available sources from the Internet up to Q3 2023 and data sets reviewed and curated by … human reviewers,” according to a blog post from X. Grok-1 was not trained on X data (including public X posts), the blog added. However, Grok does have “real-time knowledge of the world,” including posts on X.
xAI, founded by Elon Musk in March 2023, is relatively new in the AI field and trails behind competitors such as OpenAI’s ChatGPT. However, according to a blog post from xAI, their Grok 1.5 model is closing the gap with GPT-4 on various benchmarks that span a wide range of grade school to high school competition problems. It’s important to note that benchmarks for large language models are often criticized because the models can perform well on benchmarks if those benchmarks are included in their training data. It’s sort of like memorizing test answers, rather than actually learning the material.
Multimodal conversational chatbots seem to be the next frontier for AI, with multiple advancements announced at Google I/O and OpenAI releasing GPT-4o, so Grok lacking multimodal capabilities has put it behind the curve — until now.