Send images, audio, video, and other media to models that support multimodal input.
Multimodal models can accept both text and non-text content — images, audio, video, and documents — as part of a single prompt. Genkit uses a unified Part type to represent all content, making it straightforward to mix text and media in the same message.
from genkit import Genkitfrom genkit.plugins.google_genai import GoogleAIimport base64ai = Genkit(plugins=[GoogleAI()])# From a URLresponse = await ai.generate( prompt=[ {'media': {'url': 'https://example.com/photo.jpg', 'content_type': 'image/jpeg'}}, {'text': 'Describe what you see in this image.'}, ])print(response.text)# From a local filewith open('./photo.jpg', 'rb') as f: image_b64 = base64.b64encode(f.read()).decode()response = await ai.generate( prompt=[ {'media': {'url': f'data:image/jpeg;base64,{image_b64}', 'content_type': 'image/jpeg'}}, {'text': 'What objects are visible?'}, ])
Not all models support multimodal input. Each model exposes its capabilities in the ModelInfo.supports metadata:
// Check if a model supports media inputconst modelRef = googleAI.model('gemini-2.0-flash');console.log(modelRef.info?.supports?.media); // true
The supports object includes:
Field
Type
Description
media
boolean
Can process images, audio, video, documents.
multiturn
boolean
Can process conversation history.
tools
boolean
Can call tools.
systemRole
boolean
Accepts a system message.
output
string[]
Supported output types (e.g., ["text", "media"]).
contentType
string[]
Accepted MIME types for media input.
Multimodal support depends on the model. Gemini 1.5 and 2.x models support images, audio, video, and PDF input. Older or text-only models will return an error if media parts are included in the prompt. Check the plugin documentation for the specific model you are using.