Skip to content

[Feature request] ability to set localStorage/sessionStorage w/o loading a page from target domain #3692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bluepeter opened this issue Dec 19, 2018 · 9 comments

Comments

@bluepeter
Copy link

We run Puppeteer on AWS Lambda to orchestrate multi-step, multi-page crawl sessions on our SaaS Fluxguard (e.g., login, go to dashboard, go to page C). Due to time constraints on Lambda, and other reasons, each page is handled by its own Lambda execution in sequence. We save all browser state (cookies, localStorage, webStorage) in an object store for reuse by subsequent page crawls.

The problem arises when we want to re-use saved local/sessionStorage on subsequent crawls. We cannot set local or session storage w/o first loading a page from the target site via, e.g.:

export const setLocalStorage = async (chromePage, newStorage = {}) =>
  await chromePage.evaluate(newStorage => {
    localStorage.clear();
    for (let key in newStorage) {
      localStorage.setItem(key, newStorage[key]);
    }
  }, newStorage);

We initially loaded the target page twice: first so that we could set storage, and second, once storage was set, to properly load the page w/ appropriate state. However, this is troublesome, as the first load, regardless of whether we disable Javascript/etc, will often pollute the cookie/storage space with new data. It's also messy to have to load the page twice.

Currently, we try/catch loading "innocuous" pages of the target site, such as robots.txt and favicon.ico: we use these then to contextually set storage before loading the target page. This is "fine," but not ideal and introduces its own problems.

It would be great if Puppeteer could set storage generally or for a specific domain without the need to load a page from that domain first.

@aslushnikov
Copy link
Contributor

It would be great if Puppeteer could set storage generally or for a specific domain without the need to load a page from that domain first.

@bluepeter why don't you use chrome profiles for this?

@bluepeter
Copy link
Author

Thanks @aslushnikov! Not super familiar.

We aren't guaranteed that we're re-using the same Lambda container on subsequent crawls, so we need to export and re-import cookies + local/sessionStorage for each crawl.

Possible you think w/ Chrome profiles?

@aslushnikov
Copy link
Contributor

@bluepeter it should. What I mean is a two-step process:

  1. Launch chrome with custom userDataDir and set storage and cookies for the domains you want.
    All the data will be saved to the chrome profile
  2. For the subsequent launches, use this userDataDir so that chrome instances have all the cookies

@bluepeter
Copy link
Author

bluepeter commented Jan 10, 2019

Thanks @aslushnikov ... so it sounds like that would require launching Chrome each time? There's overhead to that, but it may be unavoidable (save for our current solution of going to, e.g., favicon.ico for a domain to clear and set storage to prior crawl data).

@aslushnikov
Copy link
Contributor

@bluepeter I think you can launch it once and then open multiple pages. But yes, this might be suboptimal.

However, this is troublesome, as the first load, regardless of whether we disable Javascript/etc, will often pollute the cookie/storage space with new data. It's also messy to have to load the page twice.

Another way of doing this is using request interception to load dummy page on the correct security origin and use it to pre-setup cookies and local storage:

const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
  r.respond({
    status: 200,
    contentType: 'text/plain',
    body: 'tweak me.'
  });
});
await page.goto('https://pptr.dev');
// Use page to setup cookies and local storage for pptr.dev
// ...

This should be bulletproof comparing to loading favicon or robots.txt.

@bluepeter
Copy link
Author

Thanks... we will try this approach and report back in this issue!

@bluepeter
Copy link
Author

@aslushnikov great recommendation! This approach is working nicely for us. And as you note it's a lot more bulletproof than hitting favicon.ico or whatever instead. Closing!

@andrasivacson
Copy link

Just an FYI I had success using page.evaluateOnNewDocument to solve the above problem

@dannyokec
Copy link

Mehn Why didnt I see this approach on time. I implemented usin a custom dir but it seems the file size for like a thousand browsers is way to heavy. i was practically using a loop to create lots of director for each bot so as to maintain google sessions as i noticed when i appended the cookies and local storage for other website i am able to still be logged in but for google's website i am told to relogin again how ever i get to see it show as signed out meaning the cookies where of indeed appended but how ever since it was polluted it had to be rehashed properly, should i try this bullet proof method i would drop a feedback

@bluepeter I think you can launch it once and then open multiple pages. But yes, this might be suboptimal.

However, this is troublesome, as the first load, regardless of whether we disable Javascript/etc, will often pollute the cookie/storage space with new data. It's also messy to have to load the page twice.

Another way of doing this is using request interception to load dummy page on the correct security origin and use it to pre-setup cookies and local storage:

const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
r.respond({
status: 200,
contentType: 'text/plain',
body: 'tweak me.'
});
});
await page.goto('https://pptr.dev');
// Use page to setup cookies and local storage for pptr.dev
// ...
This should be bulletproof comparing to loading favicon or robots.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants