-
Notifications
You must be signed in to change notification settings - Fork 2.5k
core[minor],google-common[minor]: Add support for generic objects in prompts, gemini audio/video docs #5043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
de2f5fa
728613c
05280fd
1be1fd3
cd314c3
b556bfb
86d9804
a12da6c
7612070
f43168e
a938200
1aecff7
5d30d06
d3f2cb6
8f703e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Audio/Video Structured Extraction | ||
|
||
Google's Gemini API offers support for audio and video input, along with function calling. | ||
Together, we can pair these API features to extract structured data given audio or video input. | ||
|
||
In the following examples, we'll demonstrate how to read and send MP3 and MP4 files to the Gemini API, and receive structured output as a response. | ||
|
||
## Setup | ||
|
||
These examples use the Gemini API, so you'll need a Google VertexAI credentials file (or stringified credentials file if using a web environment): | ||
|
||
```bash | ||
GOOGLE_APPLICATION_CREDENTIALS="credentials.json" | ||
``` | ||
|
||
Next, install the `@langchain/google-vertexai` and `@langchain/community` packages: | ||
|
||
import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx"; | ||
|
||
<IntegrationInstallTooltip></IntegrationInstallTooltip> | ||
|
||
```bash npm2yarn | ||
npm install @langchain/google-vertexai @langchain/core | ||
``` | ||
|
||
## Video | ||
|
||
This example uses a [LangChain YouTube video on datasets and testing in LangSmith](https://www.youtube.com/watch?v=N9hjO-Uy1Vo) sped up to 1.5x speed. | ||
It's then converted to `base64`, and sent to Gemini with a prompt asking for structured output of tasks I can do to improve my knowledge of datasets and testing in LangSmith. | ||
|
||
We create a new tool for this using Zod, and pass it to the model via the `withStructuredOutput` method. | ||
|
||
import CodeBlock from "@theme/CodeBlock"; | ||
|
||
import VideoExample from "@examples/use_cases/media/video.ts"; | ||
|
||
<CodeBlock language="typescript">{VideoExample}</CodeBlock> | ||
|
||
## Audio | ||
|
||
The next example loads an audio (MP3) file containing Mozart's Requiem in D Minor and prompts Gemini to return a single array of strings, with each string being an instrument from the song. | ||
|
||
Here, we'll also use the `withStructuredOutput` method to get structured output from the model. | ||
|
||
import AudioExample from "@examples/use_cases/media/audio.ts"; | ||
|
||
<CodeBlock language="typescript">{AudioExample}</CodeBlock> | ||
|
||
From a quick Google search, we see the song was composed using the following instruments: | ||
|
||
```txt | ||
The Requiem is scored for 2 basset horns in F, 2 bassoons, 2 trumpets in D, 3 trombones (alto, tenor, and bass), | ||
timpani (2 drums), violins, viola, and basso continuo (cello, double bass, and organ). | ||
``` | ||
|
||
Gemini did pretty well here! For music not being its primary focus, it was able to identify a few of the instruments used in the song, and didn't hallucinate any! |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we remove? Slows down git |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
import { | ||
ChatPromptTemplate, | ||
MessagesPlaceholder, | ||
} from "@langchain/core/prompts"; | ||
import { ChatVertexAI } from "@langchain/google-vertexai"; | ||
import { HumanMessage } from "@langchain/core/messages"; | ||
import fs from "fs"; | ||
import { z } from "zod"; | ||
|
||
function fileToBase64(filePath: string): string { | ||
return fs.readFileSync(filePath, "base64"); | ||
} | ||
|
||
const mozartMp3File = "Mozart_Requiem_D_minor.mp3"; | ||
const mozartInBase64 = fileToBase64(mozartMp3File); | ||
|
||
const tool = z.object({ | ||
instruments: z | ||
.array(z.string()) | ||
.describe("A list of instruments found in the audio."), | ||
}); | ||
|
||
const model = new ChatVertexAI({ | ||
model: "gemini-1.5-pro-preview-0409", | ||
temperature: 0, | ||
}).withStructuredOutput(tool, { | ||
name: "instruments_list_tool", | ||
}); | ||
|
||
const prompt = ChatPromptTemplate.fromMessages([ | ||
new MessagesPlaceholder("audio"), | ||
]); | ||
|
||
const chain = prompt.pipe(model); | ||
const response = await chain.invoke({ | ||
audio: new HumanMessage({ | ||
content: [ | ||
{ | ||
type: "media", | ||
mimeType: "audio/mp3", | ||
data: mozartInBase64, | ||
}, | ||
|
||
{ | ||
type: "text", | ||
text: `The following audio is a song by Mozart. Respond with a list of instruments you hear in the song. | ||
|
||
Rules: | ||
Use the "instruments_list_tool" to return a list of tasks.`, | ||
}, | ||
], | ||
}), | ||
}); | ||
|
||
console.log("response", response); | ||
/* | ||
response { | ||
instruments: [ | ||
'violin', 'viola', | ||
'cello', 'double bass', | ||
'flute', 'oboe', | ||
'clarinet', 'bassoon', | ||
'horn', 'trumpet', | ||
'timpani' | ||
] | ||
} | ||
*/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
import { | ||
ChatPromptTemplate, | ||
MessagesPlaceholder, | ||
} from "@langchain/core/prompts"; | ||
import { ChatVertexAI } from "@langchain/google-vertexai"; | ||
import { HumanMessage } from "@langchain/core/messages"; | ||
import fs from "fs"; | ||
import { z } from "zod"; | ||
|
||
function fileToBase64(filePath: string): string { | ||
return fs.readFileSync(filePath, "base64"); | ||
} | ||
|
||
const lanceLsEvalsVideo = "lance_ls_eval_video.mp4"; | ||
const lanceInBase64 = fileToBase64(lanceLsEvalsVideo); | ||
|
||
const tool = z.object({ | ||
tasks: z.array(z.string()).describe("A list of tasks."), | ||
}); | ||
|
||
const model = new ChatVertexAI({ | ||
model: "gemini-1.5-pro-preview-0409", | ||
temperature: 0, | ||
}).withStructuredOutput(tool, { | ||
name: "tasks_list_tool", | ||
}); | ||
|
||
const prompt = ChatPromptTemplate.fromMessages([ | ||
new MessagesPlaceholder("video"), | ||
]); | ||
|
||
const chain = prompt.pipe(model); | ||
const response = await chain.invoke({ | ||
video: new HumanMessage({ | ||
content: [ | ||
{ | ||
type: "media", | ||
mimeType: "video/mp4", | ||
data: lanceInBase64, | ||
}, | ||
{ | ||
type: "text", | ||
text: `The following video is an overview of how to build datasets in LangSmith. | ||
Given the following video, come up with three tasks I should do to further improve my knowledge around using datasets in LangSmith. | ||
Only reference features that were outlined or described in the video. | ||
|
||
Rules: | ||
Use the "tasks_list_tool" to return a list of tasks. | ||
Your tasks should be tailored for an engineer who is looking to improve their knowledge around using datasets and evaluations, specifically with LangSmith.`, | ||
}, | ||
], | ||
}), | ||
}); | ||
|
||
console.log("response", response); | ||
/* | ||
response { | ||
tasks: [ | ||
'Explore the LangSmith SDK documentation for in-depth understanding of dataset creation, manipulation, and versioning functionalities.', | ||
'Experiment with different dataset types like Key-Value, Chat, and LLM to understand their structures and use cases.', | ||
'Try uploading a CSV file containing question-answer pairs to LangSmith and create a new dataset from it.' | ||
] | ||
} | ||
*/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,6 +32,18 @@ import type { | |
} from "../types.js"; | ||
import { GoogleAISafetyError } from "./safety.js"; | ||
|
||
const extractMimeType = ( | ||
str: string | ||
): { mimeType: string; data: string } | null => { | ||
if (str.startsWith("data:")) { | ||
return { | ||
mimeType: str.split(":")[1].split(";")[0], | ||
data: str.split(",")[1], | ||
}; | ||
} | ||
return null; | ||
}; | ||
|
||
function messageContentText( | ||
content: MessageContentText | ||
): GeminiPartText | null { | ||
|
@@ -51,17 +63,14 @@ function messageContentImageUrl( | |
typeof content.image_url === "string" | ||
? content.image_url | ||
: content.image_url.url; | ||
|
||
if (!url) { | ||
throw new Error("Missing Image URL"); | ||
} | ||
|
||
if (url.startsWith("data:")) { | ||
const mineTypeAndData = extractMimeType(url); | ||
if (mineTypeAndData) { | ||
return { | ||
inlineData: { | ||
mimeType: url.split(":")[1].split(";")[0], | ||
data: url.split(",")[1], | ||
}, | ||
inlineData: mineTypeAndData, | ||
}; | ||
} else { | ||
// FIXME - need some way to get mime type | ||
|
@@ -74,6 +83,29 @@ function messageContentImageUrl( | |
} | ||
} | ||
|
||
function messageContentMedia( | ||
// eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
content: Record<string, any> | ||
): GeminiPartInlineData | GeminiPartFileData { | ||
if ("mimeType" in content && "data" in content) { | ||
return { | ||
inlineData: { | ||
mimeType: content.mimeType, | ||
data: content.data, | ||
}, | ||
}; | ||
} else if ("mimeType" in content && "fileUri" in content) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pondering out loud:
This would then turn the (And it also means that if we add more sophisticated file handling later, we only have to change it in one place.) |
||
return { | ||
fileData: { | ||
mimeType: content.mimeType, | ||
fileUri: content.fileUri, | ||
}, | ||
}; | ||
} | ||
|
||
throw new Error("Invalid media content"); | ||
} | ||
|
||
export function messageContentToParts(content: MessageContent): GeminiPart[] { | ||
// Convert a string to a text type MessageContent if needed | ||
const messageContent: MessageContent = | ||
|
@@ -101,6 +133,8 @@ export function messageContentToParts(content: MessageContent): GeminiPart[] { | |
return messageContentImageUrl(content as MessageContentImageUrl); | ||
} | ||
break; | ||
case "media": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wish I had seen https://github.com/langchain-ai/langchainjs/blame/fc2f9de2910a6728cf9c24f9146b55ba48d3790f/langchain-core/src/messages/index.ts#L56C69-L56C69 when it went in! I'm honest, I'm a little anxious about defining MessageContent types with magic string values and not real types. Even if they're fundamentally Record types. Makes it a lot more difficult for other implementations to use consistent naming. |
||
return messageContentMedia(content); | ||
default: | ||
throw new Error( | ||
`Unsupported type received while converting message to message parts` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love these examples! Perhaps comment that you don't need to use structured output with audio and video, but it always helps to understand what the results can be.