Skip to content

(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Binit-Dhakal opened this issue May 14, 2023 · 4 comments

Comments

@Binit-Dhakal
Copy link

Description

I am trying to scrape the website "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces" but I cannot proceed any further than the homepage using scrapy_playwright but can do all operations with Playwright. If I click on any of the navigation tabs or click search, I get redirected to the page attached in the image.[the URL is the same as above]. This is not the issue of website blocking us as I can make this work using playwright as soon below.
njcourts_error

Steps to Reproduce

Scrapy-Playwright Code

class NjcourtsSpider(scrapy.Spider):
    """
    Class that scrapes the njcourts.gov.
    """
    name = 'njcourts2'
    # settings to scrape slowly
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1,
        'COOKIES_DEBUG': True,
        'PLAYWRIGHT_PROCESS_REQUEST_HEADERS': None
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces",
            meta={
                'playwright': True,
                "playwright_include_page": True,
            }
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        judgement_num = page.locator("""
            //a[@onclick="return myfaces.oam.submitForm('judgmentSearchForm','judgmentSearchForm:j_id_jsp_1959880460_15');"]
        """)

        print(await judgement_num.count())  # => 1
        await judgement_num.click()

        await page.wait_for_timeout(10000)  # redirect to page to the image attached above

Vanilla Playwright code

from playwright.async_api import async_playwright

playwright = await async_playwright().start()

browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces")

judgement_num = page.locator("""
    //a[@onclick="return myfaces.oam.submitForm('judgmentSearchForm','judgmentSearchForm:j_id_jsp_1959880460_15');"]
""")

print(await judgement_num.count())  # => 1
await judgement_num.click()  # This works

await page.wait_for_timeout(10000)

Versions

playwright-python: 1.32.1
scrapy-playwright: 0.0.26
scrapy: 2.7.1

Additional Information

The site seems to only work for American IPs.

If you cannot reproduce the issue or need more information, please let me know. I will appreciate a lot if you can point me in the right direction from here.

Thank you,
Binit

@Binit-Dhakal
Copy link
Author

Binit-Dhakal commented May 15, 2023

I dug deeper into the issue and found the similar issue/bug in #100 and it seems to be closed after new pull request https://github.com/scrapy-plugins/scrapy-playwright/pull/144/files. But I think the issue still is not resolved.
This is the part of the log file where this happens. Maybe this is the cause?

2023-05-15 09:05:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces) ['playwright']
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (resource type: document, referrer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces)
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Overridden method for Playwright request to https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces: original=POST new=GET
2023-05-15 09:05:36 [scrapy-playwright] DEBUG: [Context=default] Response: <400 https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referrer: None)

@Binit-Dhakal Binit-Dhakal changed the title Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works (Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works May 15, 2023
@Binit-Dhakal
Copy link
Author

I checked the har using browser context

PLAYWRIGHT_CONTEXTS = {
    "har_saver": {
        "record_har_path": "pw.har"
    }
}

The original request should have been POST, but with playwright-scrapy, the request is shown as GET. This is the result of this bug. Is there a way to just not modify the request with playwright-scrapy or is this something necessary for the library to work?

I will appreciate it if you can point me in the right direction and let me know if this is the real issue.

I feel like something is wrong in this conditional and we can just change the request if it is scrapy.Request, else is it necessary to change the request method? I will love to hear why this decision was made.
https://github.com/scrapy-plugins/scrapy-playwright/blob/main/scrapy_playwright/handler.py#L505

Thank you,

@elacuesta
Copy link
Member

elacuesta commented Jul 24, 2023

The code you mentioned in your comment was updated in #177 and has not been released yet. It's likely that it will actually solve your issue, I suspect that your POST request is probably not a navigation request, so it will not trigger the block that overrides the method.

@elacuesta
Copy link
Member

elacuesta commented Jul 24, 2023

#177 was just released as part of v0.0.27.
Closing, feel free to reopen if you continue to experience the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants