(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

Binit-Dhakal · 2023-05-14T01:47:50Z

Description

I am trying to scrape the website "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces" but I cannot proceed any further than the homepage using scrapy_playwright but can do all operations with Playwright. If I click on any of the navigation tabs or click search, I get redirected to the page attached in the image.[the URL is the same as above]. This is not the issue of website blocking us as I can make this work using playwright as soon below.

Steps to Reproduce

Scrapy-Playwright Code

class NjcourtsSpider(scrapy.Spider):
    """
    Class that scrapes the njcourts.gov.
    """
    name = 'njcourts2'
    # settings to scrape slowly
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1,
        'COOKIES_DEBUG': True,
        'PLAYWRIGHT_PROCESS_REQUEST_HEADERS': None
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces",
            meta={
                'playwright': True,
                "playwright_include_page": True,
            }
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        judgement_num = page.locator("""
            //a[@onclick="return myfaces.oam.submitForm('judgmentSearchForm','judgmentSearchForm:j_id_jsp_1959880460_15');"]
        """)

        print(await judgement_num.count())  # => 1
        await judgement_num.click()

        await page.wait_for_timeout(10000)  # redirect to page to the image attached above

Vanilla Playwright code

from playwright.async_api import async_playwright

playwright = await async_playwright().start()

browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces")

judgement_num = page.locator("""
    //a[@onclick="return myfaces.oam.submitForm('judgmentSearchForm','judgmentSearchForm:j_id_jsp_1959880460_15');"]
""")

print(await judgement_num.count())  # => 1
await judgement_num.click()  # This works

await page.wait_for_timeout(10000)

Versions

playwright-python: 1.32.1
scrapy-playwright: 0.0.26
scrapy: 2.7.1

Additional Information

The site seems to only work for American IPs.

If you cannot reproduce the issue or need more information, please let me know. I will appreciate a lot if you can point me in the right direction from here.

Thank you,
Binit

The text was updated successfully, but these errors were encountered:

Binit-Dhakal · 2023-05-15T03:24:53Z

I dug deeper into the issue and found the similar issue/bug in #100 and it seems to be closed after new pull request https://github.com/scrapy-plugins/scrapy-playwright/pull/144/files. But I think the issue still is not resolved.
This is the part of the log file where this happens. Maybe this is the cause?

2023-05-15 09:05:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces) ['playwright']
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Request: <POST https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (resource type: document, referrer: https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces)
2023-05-15 09:05:35 [scrapy-playwright] DEBUG: [Context=default] Overridden method for Playwright request to https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces: original=POST new=GET
2023-05-15 09:05:36 [scrapy-playwright] DEBUG: [Context=default] Response: <400 https://portal.njcourts.gov/webe40/JudgmentWeb/jsp/judgmentSearch.faces> (referrer: None)

Binit-Dhakal · 2023-05-15T10:21:42Z

I checked the har using browser context

PLAYWRIGHT_CONTEXTS = {
    "har_saver": {
        "record_har_path": "pw.har"
    }
}

The original request should have been POST, but with playwright-scrapy, the request is shown as GET. This is the result of this bug. Is there a way to just not modify the request with playwright-scrapy or is this something necessary for the library to work?

I will appreciate it if you can point me in the right direction and let me know if this is the real issue.

I feel like something is wrong in this conditional and we can just change the request if it is scrapy.Request, else is it necessary to change the request method? I will love to hear why this decision was made.
https://github.com/scrapy-plugins/scrapy-playwright/blob/main/scrapy_playwright/handler.py#L505

Thank you,

elacuesta · 2023-07-24T16:10:27Z

The code you mentioned in your comment was updated in #177 and has not been released yet. It's likely that it will actually solve your issue, I suspect that your POST request is probably not a navigation request, so it will not trigger the block that overrides the method.

elacuesta · 2023-07-24T16:18:06Z

#177 was just released as part of v0.0.27.
Closing, feel free to reopen if you continue to experience the behavior.

Binit-Dhakal changed the title ~~Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works~~ (Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works May 15, 2023

elacuesta closed this as completed Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

Binit-Dhakal commented May 14, 2023

Binit-Dhakal commented May 15, 2023 •

edited

Loading

Binit-Dhakal commented May 15, 2023

elacuesta commented Jul 24, 2023 •

edited

Loading

elacuesta commented Jul 24, 2023 •

edited

Loading

(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

(Probable Playwright Request Overridden issue) Scrapy-playwright doesnot seem to work on website "njcourts" but playwright works #199

Comments

Binit-Dhakal commented May 14, 2023

Description

Steps to Reproduce

Scrapy-Playwright Code

Vanilla Playwright code

Versions

Additional Information

Binit-Dhakal commented May 15, 2023 • edited Loading

Binit-Dhakal commented May 15, 2023

elacuesta commented Jul 24, 2023 • edited Loading

elacuesta commented Jul 24, 2023 • edited Loading

Binit-Dhakal commented May 15, 2023 •

edited

Loading

elacuesta commented Jul 24, 2023 •

edited

Loading

elacuesta commented Jul 24, 2023 •

edited

Loading