Multimodal Input in an AI Chat with Mastra AI and Vercel AI SDK

Published on 10/2/2025

In recent months, I’ve been working extensively on integrating multimodal AI agents into conversational experiences for my project Sarion, an intelligent assistant that helps you manage your day more effectively.

Thanks to Mastra AI and Vercel AI SDK v4, it’s possible to build workflows that combine text and images in the same chat, keeping the user experience smooth both on the frontend and backend.

In this article, I’ll walk you through how I implemented multimodal input (text + images) in an AI chat, leveraging the experimental_attachments feature of the new SDK.

Why Multimodal Input Matters

An agent that only understands text is useful, but limited.

Think about use cases such as:

Analyzing images uploaded by users,
Conversations that include both screenshots and text prompts,
AI chats that support professional workflows (design, documents, e-commerce).

The real value is allowing users to interact with AI through multiple input channels simultaneously: typing a message, uploading an image, asking the agent to interpret it, and receiving a coherent response.

Technologies Used

Mastra AI (latest version) → to orchestrate intelligent agents and workflows.
Vercel AI SDK v4 → to manage chat and message flows, including experimental_attachments.

Frontend: Handling Text and Images

On the frontend, I defined a simple message structure, where each part can be either text or image:

type MessagePart =
  | { type: "text"; text: string }
  | { type: "image"; imageUrl: string; mimeType: string };

When the user writes a message and attaches files:

Text is added as text.
Images are converted to base64 and added as image with mimeType.
Attachments are sent to the backend via experimental_attachments.

The key piece is this block:

await append(
  {
    role: "user",
    content: textContent, 
    data: {
      ...messageData,
      threadId: currentThreadToUse?.id || "",
      timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
    },
    experimental_attachments: attachments, // <-- multimodal magic
  }
);

Here, Vercel AI SDK v4 allows you to send both text and images together without having to completely restructure your message model.

Best practices:

Limit each image to 5MB and text to 5000 characters.
Allow a maximum of 5 images per thread, with an overall cap of 10 threads per user, to ensure performance and scalability.

Backend: Validating and Building the Message

On the server side, the main task is to validate attachments and build a consistent message for the AI agent.

The key part is how the multimodal message is constructed:

if (hasAttachments) {
  const contentParts = [];

  // text
  if (latestMessage.content) {
    contentParts.push({ type: "text", text: latestMessage.content });
  }

  // images
  for (const att of attachments) {
    if (att.contentType?.startsWith("image/")) {
      contentParts.push({
        type: "image",
        image: att.url,
        mimeType: att.contentType,
      });
    }
  }

  messageToSend = {
    id: uuidv4(),
    role: "user",
    content: contentParts,
  };
} else {
  messageToSend = {
    id: uuidv4(),
    role: "user",
    content: latestMessage.content,
  };
}

In practice:

If the user sends text only, the message is a simple string.
If they send text + images, the message becomes an array of contentParts.

I have also integrated an input validator into the main agent via a ModerationProcessor, which allows evaluating — through a dedicated LLM — and blocking content deemed inappropriate.

This validation, combined with dimensional checks, is essential to prevent abuse and maintain a consistent UX.

Result: A Multimodal Chat

With this integration, users can type prompts and attach images directly in the chat.

The Mastra agent receives a multimodal input and can:

Analyze the image,
Read the text,
Merge the information to provide intelligent responses.

Conclusion

With just a few lines of code, using Mastra AI and the new Vercel AI SDK v4, you can upgrade a simple text-based chat into a true multimodal experience.

This opens the door to powerful use cases: from assistants that interpret screenshots to systems that combine text descriptions and images to generate reports, analyses, or suggestions.

Contact me

Do you have an idea and want to see if it could work? Want to talk about technology? Interested in organizing a talk?

Contact me