Skip to content

[Self-Host] Screenshots are not supported #1028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Vvegetables opened this issue Dec 31, 2024 · 16 comments · May be fixed by #1372
Open

[Self-Host] Screenshots are not supported #1028

Vvegetables opened this issue Dec 31, 2024 · 16 comments · May be fixed by #1372

Comments

@Vvegetables
Copy link

To Reproduce
Steps to reproduce the issue:

Firstly

system config keep default

Then

run this code
resp = requests.post(
    "http://127.0.0.1:3002/v1/scrape",
    json={
        "url": "https://www.gov.cn/zhengce/zhengceku/2022-08/12/content_5705154.htm",
        "formats": ["html", "screenshot"],
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
        },
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)
print(resp.json())

client error log

{'success': False, 'error': "(Internal server error) - All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]."}

server error log

image

Environment (please complete the following information):

  • OS: [Windows 11]
  • Firecrawl Version: main feature
  • Node.js Version: [20-slim]
  • Docker Version (if applicable): [27.3.1]
@utopia2077
Copy link

same issue, seems that cause by 'screenshot'.

@watzon
Copy link

watzon commented Feb 1, 2025

Including a list of actions also causes it to fail. Here are the logs from my docker compose service:

2025-02-01 23:51:26 info [queue-worker:processJob]: 🐂 Worker taking job 86663101-8827-4734-8d61-3ad2e111bcfa 
2025-02-01 23:51:26 info [ScrapeURL:]: Scraping URL "http://google.com"... 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine playwright does not meet feature priority threshold 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine fetch does not meet feature priority threshold 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine pdf does not meet feature priority threshold 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine docx does not meet feature priority threshold 
2025-02-01 23:51:26 warn [ScrapeURL:]: scrapeURL: All scraping engines failed! {"module":"ScrapeURL","scrapeId":"86663101-8827-4734-8d61-3ad2e111bcfa","scrapeURL":"http://google.com","error":{"fallbackList":[],"results":{},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}}
2025-02-01 23:51:26 error [queue-worker:processJob]: 🐂 Job errored 86663101-8827-4734-8d61-3ad2e111bcfa - Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]. {"module":"queue-worker","method":"processJob","jobId":"86663101-8827-4734-8d61-3ad2e111bcfa","scrapeId":"86663101-8827-4734-8d61-3ad2e111bcfa","teamId":"bypass","error":{"fallbackList":[],"results":{},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}}
2025-02-01 23:51:26 error [queue-worker:processJob]: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]. {"module":"queue-worker","method":"processJob","jobId":"86663101-8827-4734-8d61-3ad2e111bcfa","scrapeId":"86663101-8827-4734-8d61-3ad2e111bcfa","teamId":"bypass","fallbackList":[],"results":{},"stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}
2025-02-01 23:51:26 error [queue-worker:processJob]: Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].
    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)
    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)
    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)
    at async processJob (/app/dist/src/services/queue-worker.js:593:26)

My guess would be that the issue stems from here:
https://github.com/mendableai/firecrawl/blob/e0c292f8476be21ee50281c0c1794fd5b2521833/apps/api/src/scraper/scrapeURL/engines/index.ts#L219C7-L219C25

Playwright technically should support screenshots, but maybe it requires additional configuration?

@b0o
Copy link

b0o commented Feb 10, 2025

Same issue here, has anyone found a workaround?

2025-02-10 23:06:51 info [ScrapeURL:]: Scraping URL "https://<redacted>"...
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine playwright does not meet feature priority threshold
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine fetch does not meet feature priority threshold
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine pdf does not meet feature priority threshold
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine docx does not meet feature priority threshold
2025-02-10 23:06:51 warn [ScrapeURL:]: scrapeURL: All scraping engines failed! {"module":"ScrapeURL","scrapeId":"c68e55f2-385d-46c7-ba20-ac1d8c9d6a80","scrapeURL":"https://<redacted>","error":{"fallbackList":[],"results":{},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}}

@b0o
Copy link

b0o commented Feb 10, 2025

After some more testing, it seems to work when called like this:

curl -X POST http://firecrawl-api:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
  "url": "https://example.com",
  "formats": ["html", "markdown"]
}'

But this fails:

curl -X POST http://firecrawl-api:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
  "url": "https://example.com",
  "formats": ["html", "markdown"],
  "location": {
    "country": "us",
    "languages": ["en"]
  }
}'
# {"success":false,"error":"(Internal server error) - All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]."}

@watzon
Copy link

watzon commented Feb 11, 2025

After some investigating the codebase it looks like some things just aren't supported by the puppeteer engine, such as screenshots and I guess location? Options seem to be to either use firecrawl directly, or pay for scrapingbee and use that engine.

@mogery
Copy link
Member

mogery commented Feb 20, 2025

Hi there!

Here's a list of what options aren't supported by the self-hosted version:

  • Actions (not built into the playwright microservice yet)
  • Screenshots (this is a hard one -- where do we host the images in a self-hosted environment? all ears for any solutions to this)
  • location, proxy: "stealth" scrape options (requires proxy management, not built into playwright microservice)
  • mobile, blockAds scrape options (not hard fail, but doesn't do anything)

Some stuff that requires extra configuration:

  • Extract / "json" format (requires OPENAI_API_KEY)

And that should be about it! Now that we have a test suite running on the self-hosted version too, you can look at apps/api/src/__tests__/snips/*.test.ts as reference for what is and is not expected to run on self-host.

@watzon
Copy link

watzon commented Feb 20, 2025

So for hosting screenshots I would personally do it like this (this is without looking at the codebase, so I could be way off base here):

  • Create a StorageDriver interface
  • Implement 2 storage drivers to start with, one for local storage and one for S3 compatible APIs
  • Expose driver options via environment variables
  • Add a proxy route for file fetching

@mogery
Copy link
Member

mogery commented Feb 20, 2025

That's a good idea, thinking about whether we can roll how we do it in prod together with this, so there wouldn't be too much code divergence. Local storage is tough since most of the self-hosted environments are dockerized, we would probably need to do static file serving on the playwright service, which means that the playwright service would need to be exposed in order to access the screenshots.

@watzon
Copy link

watzon commented Feb 20, 2025

Maybe avoiding a local storage option right now would be good. If people really want screenshots I don't think asking them to use an S3 compatible service is too much to ask. For the self hosted docker-compose file you could even include minio by default to make things especially easy.

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

@mogery mogery changed the title [Self-Host] when query self-host:3002/v1/scrape/ and request formats include screenshot would cause 500 error [Self-Host] Screenshots are not supported Feb 20, 2025
@mogery
Copy link
Member

mogery commented Feb 20, 2025

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

On the prod side, the screenshot management is handled either by fire-engine directly, or if it's coming from a different engine (which usually give us data URIs in the screenshot field), it's uploaded in here in the scrapeURL mechanism:

if (
process.env.USE_DB_AUTHENTICATION === "true" &&
document.screenshot !== undefined &&
document.screenshot.startsWith("data:")
) {
meta.logger.debug("Uploading screenshot to Supabase...");
const fileName = `screenshot-${crypto.randomUUID()}.png`;
supabase_service.storage
.from("media")
.upload(
fileName,
Buffer.from(document.screenshot.split(",")[1], "base64"),
{
cacheControl: "3600",
upsert: false,
contentType: document.screenshot.split(":")[1].split(";")[0],
},
);
document.screenshot = `https://service.firecrawl.dev/storage/v1/object/public/media/${encodeURIComponent(fileName)}`;
}

For the self hosted docker-compose file you could even include minio by default to make things especially easy.

This sounds great. Going to look into it!

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Hmm... would prefer to try and expose minio and point to it, but it might have to come to this.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

Would love a PR! Even if it's just a draft of the actual upload logic in the above file, I can connect the other bits up afterwards.

@KyTechInc
Copy link

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

On the prod side, the screenshot management is handled either by fire-engine directly, or if it's coming from a different engine (which usually give us data URIs in the screenshot field), it's uploaded in here in the scrapeURL mechanism:

firecrawl/apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts

Lines 9 to 31 in da46736
if (
process.env.USE_DB_AUTHENTICATION === "true" &&
document.screenshot !== undefined &&
document.screenshot.startsWith("data:")
) {
meta.logger.debug("Uploading screenshot to Supabase...");

const fileName = screenshot-${crypto.randomUUID()}.png;

supabase_service.storage
.from("media")
.upload(
fileName,
Buffer.from(document.screenshot.split(",")[1], "base64"),
{
cacheControl: "3600",
upsert: false,
contentType: document.screenshot.split(":")[1].split(";")[0],
},
);

document.screenshot = https://service.firecrawl.dev/storage/v1/object/public/media/${encodeURIComponent(fileName)};
}

For the self hosted docker-compose file you could even include minio by default to make things especially easy.

This sounds great. Going to look into it!

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Hmm... would prefer to try and expose minio and point to it, but it might have to come to this.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

Would love a PR! Even if it's just a draft of the actual upload logic in the above file, I can connect the other bits up afterwards.

Based on the above state of the screenshot upload function, does it not work to just add Supabase ENVs and create a "media" bucket in your Supabase instance to receive a copy of the screenshot output?

Obviously this is not a fully inclusive solution for the Self-Host case like a running a simple MinIO container, but firecrawl has already opened the door by having Supabase a part of the local docker config for other purposes?

For my specific use case, this would be perfect as the final resting place for the scraped screenshots was going to be stored in our self hosted Supabase stack in a bucket anyways.

@mogery
Copy link
Member

mogery commented Mar 5, 2025

If you have Supa set up, you should just be able to point Firecrawl at it and have it work fine.

@KyTechInc
Copy link

If you have Supa set up, you should just be able to point Firecrawl at it and have it work fine.

I attempted this but i think there is some missing steps or docs on this setup.

I setup my supa anon, url and service key, turned on DB auth and set a TEST_API_KEY but just get 401 un-auth errors to my Supa instance (local dev via supabase cli).

TBH i didn't think this would work as looking through the other source code for all the supabase client logic there seems to be schema/migration or setup needed. And i don't see any migration actions happening when the ENVs are setup and run in the docker compose logs.

Am I missing something here?

@KyTechInc KyTechInc linked a pull request Mar 21, 2025 that will close this issue
@KyTechInc
Copy link

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

On the prod side, the screenshot management is handled either by fire-engine directly, or if it's coming from a different engine (which usually give us data URIs in the screenshot field), it's uploaded in here in the scrapeURL mechanism:

firecrawl/apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts

Lines 9 to 31 in da46736
if (
process.env.USE_DB_AUTHENTICATION === "true" &&
document.screenshot !== undefined &&
document.screenshot.startsWith("data:")
) {
meta.logger.debug("Uploading screenshot to Supabase...");

const fileName = screenshot-${crypto.randomUUID()}.png;

supabase_service.storage
.from("media")
.upload(
fileName,
Buffer.from(document.screenshot.split(",")[1], "base64"),
{
cacheControl: "3600",
upsert: false,
contentType: document.screenshot.split(":")[1].split(";")[0],
},
);

document.screenshot = https://service.firecrawl.dev/storage/v1/object/public/media/${encodeURIComponent(fileName)};
}

For the self hosted docker-compose file you could even include minio by default to make things especially easy.

This sounds great. Going to look into it!

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Hmm... would prefer to try and expose minio and point to it, but it might have to come to this.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

Would love a PR! Even if it's just a draft of the actual upload logic in the above file, I can connect the other bits up afterwards.

FYI, created a draft PR for this: #1372

@wesselhuising
Copy link

How about base64 encode the result and return it as part of the response?

@ocampoje17
Copy link

  • location

I just remove "location" from the option, and it works. Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants