Microsoft has announced the creation of Kosmos-1, a multimodal large language model (MLLM) that can respond to both language and visual cues. The model has been designed to carry out various tasks, such as image captioning and visual question answering.
Take a look at some of the examples of generations from the Kosmos-1 model that can perceive general modalities. Would love to see how something like this might do with more data-specific visuals like charts, etc. pic.twitter.com/HBjtuGJSwN
— elvis (@omarsar0) February 28, 2023
Although LLMs, like the well-known GPT (Generative Pre-trained Transformer) model, have demonstrated impressive chat capabilities, they still encounter difficulties with multimodal inputs, including images and audio prompts. According to a research paper titled ‘Language Is Not All You Need: Aligning Perception with Language Models’, the problem is that LLMs require grounding in multimodal perception or knowledge acquisition in the real world to achieve artificial general intelligence (AGI).
The paper emphasizes that unlocking multimodal input widens the potential applications of language models to high-value areas such as multimodal machine learning, document intelligence, and robotics. Everyday Robots, an Alphabet-owned robotics firm, and Google’s Brain Team demonstrated the importance of grounding in 2022, using LLMs to teach robots how to perform physical tasks according to human descriptions.
Microsoft’s Prometheus AI model was created to integrate OpenAI’s GPT models with real-world feedback from Bing search ranking and search results by also using grounding. Kosmos-1 is capable of perceiving general modalities, following instructions (zero-shot learning), and learning in context (few-shot learning), which aligns perception with LLMs to allow them to see and talk, according to Microsoft.
Kosmos-1’s capabilities were demonstrated by answering various prompts, including explaining why a photograph of a kitten with a person holding a paper with a drawn smile over its mouth is funny, identifying a tennis player with a ponytail from an image, calculating the sum of 4+5 from an image, answering a question about TorchScale based on a GitHub description page, and reading the heart rate from an Apple Watch face.
Microsoft plans to improve Bing’s web page question-answering capabilities by using transformer-based language models like Kosmos-1. The task aims to find answers to questions from web pages by understanding both the semantics and structure of text. The researchers believe that this task can help to evaluate the model’s ability to understand the semantics and structure of web pages.
In conclusion, Microsoft’s Kosmos-1 has demonstrated its potential to automate various tasks in different scenarios. The research paper highlights the importance of grounding in multimodal perception, which is needed to achieve AGI. Microsoft intends to use transformer-based language models like Kosmos-1 to improve Bing’s search capabilities, particularly in web page question answering.