Skip to content

Commit 5e6f8dc

Browse files
authored
Merge pull request #622 from hwchase17/nc/custom-pdfjs
Allow passing a custom pdfjs build
2 parents 7bc5f0f + 432567b commit 5e6f8dc

File tree

3 files changed

+36
-2
lines changed
  • docs/docs/modules/indexes/document_loaders/examples/file_loaders
  • langchain/src/document_loaders

3 files changed

+36
-2
lines changed

docs/docs/modules/indexes/document_loaders/examples/file_loaders/pdf.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,16 @@ const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
3333

3434
const docs = await loader.load();
3535
```
36+
37+
# Usage, legacy environments
38+
39+
In legacy environments, you can use the `pdfjs` option to provide a function that returns a promise that resolves to the `PDFJS` object. This is useful if you want to use a custom build of `pdfjs-dist` or if you want to use a different version of `pdfjs-dist`. Eg. here we use the legacy build of `pdfjs-dist`, which includes several polyfills that are not included in the default build.
40+
41+
```typescript
42+
import { PDFLoader } from "langchain/document_loaders";
43+
44+
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
45+
pdfjs: () =>
46+
import("pdfjs-dist/legacy/build/pdf.js").then((mod) => mod.default),
47+
});
48+
```

langchain/src/document_loaders/pdf.ts

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,22 @@ import { BufferLoader } from "./buffer.js";
55
export class PDFLoader extends BufferLoader {
66
private splitPages: boolean;
77

8-
constructor(filePathOrBlob: string | Blob, { splitPages = true } = {}) {
8+
private pdfjs: typeof PDFLoaderImports;
9+
10+
constructor(
11+
filePathOrBlob: string | Blob,
12+
{ splitPages = true, pdfjs = PDFLoaderImports } = {}
13+
) {
914
super(filePathOrBlob);
1015
this.splitPages = splitPages;
16+
this.pdfjs = pdfjs;
1117
}
1218

1319
public async parse(
1420
raw: Buffer,
1521
metadata: Document["metadata"]
1622
): Promise<Document[]> {
17-
const { getDocument, version } = await PDFLoaderImports();
23+
const { getDocument, version } = await this.pdfjs();
1824
const pdf = await getDocument({
1925
data: new Uint8Array(raw.buffer),
2026
useWorkerFetch: false,

langchain/src/document_loaders/tests/pdf.test.ts

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,18 @@ test("Test PDF loader from file to single document", async () => {
2626
expect(docs.length).toBe(1);
2727
expect(docs[0].pageContent).toContain("Attention Is All You Need");
2828
});
29+
30+
test("Test PDF loader from file using custom pdfjs", async () => {
31+
const filePath = path.resolve(
32+
path.dirname(url.fileURLToPath(import.meta.url)),
33+
"./example_data/1706.03762.pdf"
34+
);
35+
const loader = new PDFLoader(filePath, {
36+
pdfjs: () =>
37+
import("pdfjs-dist/legacy/build/pdf.js").then((mod) => mod.default),
38+
});
39+
const docs = await loader.load();
40+
41+
expect(docs.length).toBe(15);
42+
expect(docs[0].pageContent).toContain("Attention Is All You Need");
43+
});

0 commit comments

Comments
 (0)