Make sense of YouTube videos using Gemini

Gemini provides a surprisingly affordable API for video understanding, and you can just paste a YouTube video link!
In early June, IDinsight participated in OpenNyAI Maker Residency 2025, a week-long hackathon reimagining India’s public grievance redressal ecosystem. My group developed a new interface for citizens to file grievances.
The current form-based system burdens citizens by requiring them to navigate hundreds of categories and attach all required documents for their grievance to be processed. We designed a multi-modal chat interface that actively guides citizens through the filing process, from categorization to identifying the document proof requirements, helping users submit grievances accurately and completely using their preferred modality.
I implemented the video-understanding portion using Google’s Gemini, which allowed users to paste a link to their YouTube video describing their grievance. For testing and demonstration, we used a few of Video Volunteers’ 18,000 reporting videos, filmed by their community correspondents across 20 states in India.
In this post, I wanted to bring two things to your attention, so that you can go run with them.
- That you can process a YouTube video using Gemini (and how to do it)
- That it’s surprisingly cheap to do so.
It’s pretty cheap
I was surprised to learn it costs only 2 cents to process a 10-minute video with Gemini 2.5 Flash Lite, and 7 cents with Gemini 2.5 Flash (as of July 2025).
Estimated cost for a) a 10-minute video and b) a 1 hour video, sampled at 1 fps, assuming 2500 output tokens (roughly the length of the transcript for a 10-minute speech).
This is assuming 1 frame per second sampling, and Gemini recommends you slow down fast-action sequences if 1 second interval is too slow. For our purpose of identifying the grievance described in a given video, using the default parameters with Gemini 2.0 Flash worked quite well, although we didn’t get a chance to do any proper evaluation. You can see the full list of (I’d say) reasonable assumptions I made and the cost calculations here, based on Vertex AI’s guide and pricing documents.
How to use Gemini’s video understanding on YouTube videos
The nice thing with Gemini’s video understanding API is that you can also just share a link to a YouTube video.
Google AI Studio
For non-developers, you can try this out on Google AI Studio. Just paste the YouTube link along with your instructions. It also tells you how many tokens the video has.
Vertex AI APIs
For developers, here are the key things to note:
- Use the YouTube video URL in place of the file path (short url starting with youtu.be also works)
- Specify
video/mp4
as the file type or MIME type - The Gemini API recommends that, for single-video processing, you put the video before the text. (docs)
From package to package, how you specify the file and file type varies. Here are working examples that use the Vertex AI APIs via LiteLLM’s python SDK and Vercel’s AI SDK.
Vertex AI authentication
For both examples, you will need to set up the Vertex AI credentials with the following environment variables.
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_credentials.json
GOOGLE_VERTEX_PROJECT_ID=
GOOGLE_VERTEX_LOCATION=
On more about how to create a Google Cloud Platform service account and a key for that service account, read this and then this. Make sure to add the “Vertex AI User” role to the service account.
LiteLLM
For Python LiteLLM users, first, install litellm
and google-auth
and try out the following:
from litellm import completion
import json
# Set Vertex AI credentials
GOOGLE_APPLICATION_CREDENTIALS = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
with open(GOOGLE_APPLICATION_CREDENTIALS, 'r') as file:
vertex_credentials = json.load(file)
vertex_credentials_json = json.dumps(vertex_credentials)
# LiteLLM call
response = completion(
model="vertex_ai/gemini-2.0-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "file",
"file": {
"file_id": "{video_URL}",
"format": "video/mp4",
}
},
{"type": "text", "text": "Please summarize this video."},
],
}
],
max_tokens=1000,
vertex_credentials=vertex_credentials_json,
vertex_project=os.getenv("GOOGLE_VERTEX_PROJECT"),
vertex_location=os.getenv("GOOGLE_VERTEX_LOCATION"),
)
print(response.choices[0].message.content)
AI SDK
Using NextJS’s AI SDK, in your typical message content block, you can simply pass the video URL (videoUrl
) as the data and define the mimeType
as video/mp4
. You must first install it using npm install ai
.
import { createVertex } from '@ai-sdk/google-vertex';
import { generateText } from "ai";
// Set Vertex AI credentials
const GOOGLE_APPLICATION_CREDENTIALS = process.env.GOOGLE_APPLICATION_CREDENTIALS;
const GOOGLE_VERTEX_PROJECT_ID = process.env.GOOGLE_VERTEX_PROJECT_ID;
const GOOGLE_VERTEX_LOCATION = process.env.GOOGLE_VERTEX_LOCATION;
const vertex = createVertex({
project: GOOGLE_VERTEX_PROJECT_ID,
location: GOOGLE_VERTEX_LOCATION,
googleAuthOptions: {
keyFilename: GOOGLE_APPLICATION_CREDENTIALS,
},
});
// Your YouTube video URL
const videoUrl = "https://youtube.com/shorts/abcde";
// AI SDK generateText call
(
async () => {
const { text: result } = await generateText({
model: vertex('gemini-2.0-flash-001'),
messages: [
{
role: 'user',
content: [
{
type: 'file',
data: videoUrl,
mimeType: 'video/mp4',
},
{
type: 'text',
text: 'What is happening in this video?',
},
],
},
],
});
console.log(result);
}
)();
Here is the full working codebase for the prototype grievance collection chatbot we created, and the pull request that added the video understanding assistant to the chatbot.
A big, big thanks to the AskJunior team for setting up the scaffolding and teaching me a bunch of stuff, and the Agami team for organizing the event so thoughtfully, and Video Volunteers for the impactful work they do.