[Self-Host] Screenshots are not supported #1028

Vvegetables · 2024-12-31T06:46:17Z

To Reproduce
Steps to reproduce the issue:

Firstly

system config keep default

Then

run this code

resp = requests.post(
    "http://127.0.0.1:3002/v1/scrape",
    json={
        "url": "https://www.gov.cn/zhengce/zhengceku/2022-08/12/content_5705154.htm",
        "formats": ["html", "screenshot"],
        "headers": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
        },
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)
print(resp.json())

client error log

{'success': False, 'error': "(Internal server error) - All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]."}

server error log

Environment (please complete the following information):

OS: [Windows 11]
Firecrawl Version: main feature
Node.js Version: [20-slim]
Docker Version (if applicable): [27.3.1]

utopia2077 · 2024-12-31T07:56:07Z

same issue, seems that cause by 'screenshot'.

watzon · 2025-02-01T23:52:42Z

Including a list of actions also causes it to fail. Here are the logs from my docker compose service:

2025-02-01 23:51:26 info [queue-worker:processJob]: 🐂 Worker taking job 86663101-8827-4734-8d61-3ad2e111bcfa 
2025-02-01 23:51:26 info [ScrapeURL:]: Scraping URL "http://google.com"... 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine playwright does not meet feature priority threshold 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine fetch does not meet feature priority threshold 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine pdf does not meet feature priority threshold 
2025-02-01 23:51:26 debug [ScrapeURL:]: Engine docx does not meet feature priority threshold 
2025-02-01 23:51:26 warn [ScrapeURL:]: scrapeURL: All scraping engines failed! {"module":"ScrapeURL","scrapeId":"86663101-8827-4734-8d61-3ad2e111bcfa","scrapeURL":"http://google.com","error":{"fallbackList":[],"results":{},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}}
2025-02-01 23:51:26 error [queue-worker:processJob]: 🐂 Job errored 86663101-8827-4734-8d61-3ad2e111bcfa - Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]. {"module":"queue-worker","method":"processJob","jobId":"86663101-8827-4734-8d61-3ad2e111bcfa","scrapeId":"86663101-8827-4734-8d61-3ad2e111bcfa","teamId":"bypass","error":{"fallbackList":[],"results":{},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}}
2025-02-01 23:51:26 error [queue-worker:processJob]: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]. {"module":"queue-worker","method":"processJob","jobId":"86663101-8827-4734-8d61-3ad2e111bcfa","scrapeId":"86663101-8827-4734-8d61-3ad2e111bcfa","teamId":"bypass","fallbackList":[],"results":{},"stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}
2025-02-01 23:51:26 error [queue-worker:processJob]: Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].
    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)
    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)
    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)
    at async processJob (/app/dist/src/services/queue-worker.js:593:26)

My guess would be that the issue stems from here:
https://github.com/mendableai/firecrawl/blob/e0c292f8476be21ee50281c0c1794fd5b2521833/apps/api/src/scraper/scrapeURL/engines/index.ts#L219C7-L219C25

Playwright technically should support screenshots, but maybe it requires additional configuration?

b0o · 2025-02-10T23:09:49Z

Same issue here, has anyone found a workaround?

2025-02-10 23:06:51 info [ScrapeURL:]: Scraping URL "https://<redacted>"...
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine playwright does not meet feature priority threshold
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine fetch does not meet feature priority threshold
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine pdf does not meet feature priority threshold
2025-02-10 23:06:51 debug [ScrapeURL:]: Engine docx does not meet feature priority threshold
2025-02-10 23:06:51 warn [ScrapeURL:]: scrapeURL: All scraping engines failed! {"module":"ScrapeURL","scrapeId":"c68e55f2-385d-46c7-ba20-ac1d8c9d6a80","scrapeURL":"https://<redacted>","error":{"fallbackList":[],"results":{},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n    at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:224:15)\n    at scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:262:30)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async runWebScraper (/app/dist/src/main/runWebScraper.js:66:24)\n    at async startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:11:12)\n    at async processJob (/app/dist/src/services/queue-worker.js:593:26)\n    at async processJobInternal (/app/dist/src/services/queue-worker.js:197:28)"}}

b0o · 2025-02-10T23:35:20Z

After some more testing, it seems to work when called like this:

curl -X POST http://firecrawl-api:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
  "url": "https://example.com",
  "formats": ["html", "markdown"]
}'

But this fails:

curl -X POST http://firecrawl-api:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
  "url": "https://example.com",
  "formats": ["html", "markdown"],
  "location": {
    "country": "us",
    "languages": ["en"]
  }
}'
# {"success":false,"error":"(Internal server error) - All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected]."}

watzon · 2025-02-11T18:44:13Z

After some investigating the codebase it looks like some things just aren't supported by the puppeteer engine, such as screenshots and I guess location? Options seem to be to either use firecrawl directly, or pay for scrapingbee and use that engine.

mogery · 2025-02-20T02:09:01Z

Hi there!

Here's a list of what options aren't supported by the self-hosted version:

Actions (not built into the playwright microservice yet)
Screenshots (this is a hard one -- where do we host the images in a self-hosted environment? all ears for any solutions to this)
location, proxy: "stealth" scrape options (requires proxy management, not built into playwright microservice)
mobile, blockAds scrape options (not hard fail, but doesn't do anything)

Some stuff that requires extra configuration:

Extract / "json" format (requires OPENAI_API_KEY)

And that should be about it! Now that we have a test suite running on the self-hosted version too, you can look at apps/api/src/__tests__/snips/*.test.ts as reference for what is and is not expected to run on self-host.

watzon · 2025-02-20T02:55:24Z

So for hosting screenshots I would personally do it like this (this is without looking at the codebase, so I could be way off base here):

Create a StorageDriver interface
Implement 2 storage drivers to start with, one for local storage and one for S3 compatible APIs
Expose driver options via environment variables
Add a proxy route for file fetching

mogery · 2025-02-20T02:59:01Z

That's a good idea, thinking about whether we can roll how we do it in prod together with this, so there wouldn't be too much code divergence. Local storage is tough since most of the self-hosted environments are dockerized, we would probably need to do static file serving on the playwright service, which means that the playwright service would need to be exposed in order to access the screenshots.

watzon · 2025-02-20T07:01:41Z

Maybe avoiding a local storage option right now would be good. If people really want screenshots I don't think asking them to use an S3 compatible service is too much to ask. For the self hosted docker-compose file you could even include minio by default to make things especially easy.

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

mogery · 2025-02-20T07:15:06Z

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

On the prod side, the screenshot management is handled either by fire-engine directly, or if it's coming from a different engine (which usually give us data URIs in the screenshot field), it's uploaded in here in the scrapeURL mechanism:

firecrawl/apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts

Lines 9 to 31 in da46736

    
           if ( 
        
             process.env.USE_DB_AUTHENTICATION === "true" && 
        
             document.screenshot !== undefined && 
        
             document.screenshot.startsWith("data:") 
        
           ) { 
        
             meta.logger.debug("Uploading screenshot to Supabase..."); 
        
             const fileName = `screenshot-${crypto.randomUUID()}.png`; 
        
             supabase_service.storage 
        
               .from("media") 
        
               .upload( 
        
                 fileName, 
        
                 Buffer.from(document.screenshot.split(",")[1], "base64"), 
        
                 { 
        
                   cacheControl: "3600", 
        
                   upsert: false, 
        
                   contentType: document.screenshot.split(":")[1].split(";")[0], 
        
                 }, 
        
               ); 
        
             document.screenshot = `https://service.firecrawl.dev/storage/v1/object/public/media/${encodeURIComponent(fileName)}`; 
        
           }

For the self hosted docker-compose file you could even include minio by default to make things especially easy.

This sounds great. Going to look into it!

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Hmm... would prefer to try and expose minio and point to it, but it might have to come to this.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

Would love a PR! Even if it's just a draft of the actual upload logic in the above file, I can connect the other bits up afterwards.

KyTechInc · 2025-03-05T19:57:26Z

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

On the prod side, the screenshot management is handled either by fire-engine directly, or if it's coming from a different engine (which usually give us data URIs in the screenshot field), it's uploaded in here in the scrapeURL mechanism:

firecrawl/apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts

Lines 9 to 31 in da46736
if (
process.env.USE_DB_AUTHENTICATION === "true" &&
document.screenshot !== undefined &&
document.screenshot.startsWith("data:")
) {
meta.logger.debug("Uploading screenshot to Supabase...");

const fileName = screenshot-${crypto.randomUUID()}.png;

supabase_service.storage
.from("media")
.upload(
fileName,
Buffer.from(document.screenshot.split(",")[1], "base64"),
{
cacheControl: "3600",
upsert: false,
contentType: document.screenshot.split(":")[1].split(";")[0],
},
);

document.screenshot = https://service.firecrawl.dev/storage/v1/object/public/media/${encodeURIComponent(fileName)};
}

For the self hosted docker-compose file you could even include minio by default to make things especially easy.

This sounds great. Going to look into it!

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Hmm... would prefer to try and expose minio and point to it, but it might have to come to this.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

Would love a PR! Even if it's just a draft of the actual upload logic in the above file, I can connect the other bits up afterwards.

Based on the above state of the screenshot upload function, does it not work to just add Supabase ENVs and create a "media" bucket in your Supabase instance to receive a copy of the screenshot output?

Obviously this is not a fully inclusive solution for the Self-Host case like a running a simple MinIO container, but firecrawl has already opened the door by having Supabase a part of the local docker config for other purposes?

For my specific use case, this would be perfect as the final resting place for the scraped screenshots was going to be stored in our self hosted Supabase stack in a bucket anyways.

mogery · 2025-03-05T20:28:33Z

If you have Supa set up, you should just be able to point Firecrawl at it and have it work fine.

KyTechInc · 2025-03-05T20:52:45Z

If you have Supa set up, you should just be able to point Firecrawl at it and have it work fine.

I attempted this but i think there is some missing steps or docs on this setup.

I setup my supa anon, url and service key, turned on DB auth and set a TEST_API_KEY but just get 401 un-auth errors to my Supa instance (local dev via supabase cli).

TBH i didn't think this would work as looking through the other source code for all the supabase client logic there seems to be schema/migration or setup needed. And i don't see any migration actions happening when the ENVs are setup and run in the docker compose logs.

Am I missing something here?

KyTechInc · 2025-03-21T18:39:28Z

Granted this also depends on how you guys manage filestorage internally, but I'd be pretty surprised if you aren't using some form of object storage.

On the prod side, the screenshot management is handled either by fire-engine directly, or if it's coming from a different engine (which usually give us data URIs in the screenshot field), it's uploaded in here in the scrapeURL mechanism:

firecrawl/apps/api/src/scraper/scrapeURL/transformers/uploadScreenshot.ts

Lines 9 to 31 in da46736
if (
process.env.USE_DB_AUTHENTICATION === "true" &&
document.screenshot !== undefined &&
document.screenshot.startsWith("data:")
) {
meta.logger.debug("Uploading screenshot to Supabase...");

const fileName = screenshot-${crypto.randomUUID()}.png;

supabase_service.storage
.from("media")
.upload(
fileName,
Buffer.from(document.screenshot.split(",")[1], "base64"),
{
cacheControl: "3600",
upsert: false,
contentType: document.screenshot.split(":")[1].split(";")[0],
},
);

document.screenshot = https://service.firecrawl.dev/storage/v1/object/public/media/${encodeURIComponent(fileName)};
}

For the self hosted docker-compose file you could even include minio by default to make things especially easy.

This sounds great. Going to look into it!

Then I'd probably add a route /v1/media/:filename as a proxy and have it lookup the S3 object using the filename.

Hmm... would prefer to try and expose minio and point to it, but it might have to come to this.

Lmk if you want any help with this and I'd be happy to look into making a PR myself.

Would love a PR! Even if it's just a draft of the actual upload logic in the above file, I can connect the other bits up afterwards.

FYI, created a draft PR for this: #1372

wesselhuising · 2025-04-02T10:01:50Z

How about base64 encode the result and return it as part of the response?

ocampoje17 · 2025-04-13T03:10:35Z

location

I just remove "location" from the option, and it works. Thank you so much.

Vvegetables added the self-host label Dec 31, 2024

mogery changed the title ~~[Self-Host] when query self-host:3002/v1/scrape/ and request formats include screenshot would cause 500 error~~ [Self-Host] Screenshots are not supported Feb 20, 2025

mogery mentioned this issue Feb 23, 2025

[Self-Host] Unable to run scape endpoint with screenshot action #1110

Closed

mogery marked this as a duplicate of #1110 Feb 23, 2025

KyTechInc linked a pull request Mar 21, 2025 that will close this issue

Add MinIO self-hosted screenshot storage #1372

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Self-Host] Screenshots are not supported #1028

[Self-Host] Screenshots are not supported #1028

Vvegetables commented Dec 31, 2024

utopia2077 commented Dec 31, 2024

Uh oh!

watzon commented Feb 1, 2025 •

edited

Loading

Uh oh!

b0o commented Feb 10, 2025 •

edited

Loading

Uh oh!

b0o commented Feb 10, 2025

Uh oh!

watzon commented Feb 11, 2025

Uh oh!

mogery commented Feb 20, 2025

Uh oh!

watzon commented Feb 20, 2025

Uh oh!

mogery commented Feb 20, 2025

Uh oh!

watzon commented Feb 20, 2025

Uh oh!

mogery commented Feb 20, 2025

Uh oh!

KyTechInc commented Mar 5, 2025

Uh oh!

mogery commented Mar 5, 2025

Uh oh!

KyTechInc commented Mar 5, 2025

Uh oh!

KyTechInc commented Mar 21, 2025

Uh oh!

wesselhuising commented Apr 2, 2025

Uh oh!

ocampoje17 commented Apr 13, 2025

Uh oh!

[Self-Host] Screenshots are not supported #1028

[Self-Host] Screenshots are not supported #1028

Comments

Vvegetables commented Dec 31, 2024

Firstly

system config keep default

Then

run this code

client error log

server error log

utopia2077 commented Dec 31, 2024

Uh oh!

watzon commented Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b0o commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b0o commented Feb 10, 2025

Uh oh!

watzon commented Feb 11, 2025

Uh oh!

mogery commented Feb 20, 2025

Uh oh!

watzon commented Feb 20, 2025

Uh oh!

mogery commented Feb 20, 2025

Uh oh!

watzon commented Feb 20, 2025

Uh oh!

mogery commented Feb 20, 2025

Uh oh!

KyTechInc commented Mar 5, 2025

Uh oh!

mogery commented Mar 5, 2025

Uh oh!

KyTechInc commented Mar 5, 2025

Uh oh!

KyTechInc commented Mar 21, 2025

Uh oh!

wesselhuising commented Apr 2, 2025

Uh oh!

ocampoje17 commented Apr 13, 2025

Uh oh!

watzon commented Feb 1, 2025 •

edited

Loading

b0o commented Feb 10, 2025 •

edited

Loading