Describing an Image
Vision-enabled language models can analyze both text and images, but combining them in a single prompt requires specific message formatting. This example demonstrates how to structure a multimodal request by combining a text question with an image URL in the message content array.
import { Mastra } from '@mastra/core';
const mastra = new Mastra();
const llm = mastra.LLM({
provider: 'OPEN_AI',
name: 'gpt-4-turbo',
});
const response = await llm.generate([
{
role: 'user',
content: [
{
type: 'text',
text: 'what is that black bold text at the top?',
},
{
type: 'image',
image: new URL(
'https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/491_BC_-_1902_AD_-_A_Long_Time_Between_Drinks.jpg/1000px-491_BC_-_1902_AD_-_A_Long_Time_Between_Drinks.jpg',
),
},
],
},
]);
console.log(response.text);
View Example on GitHub