1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] {scrapy-playwright} ERROR: Playwright page not found

Discussão em 'Python' iniciado por Stack, Setembro 11, 2024.

  1. Stack

    Stack Membro Participativo

    I am trying to do a simple web scrape for a price, title, and URL for all the products on the first page of this website; https://equi.life/pages/search-resu...=products&sort_by=title&sort_order=asc&page=1

    I keep running into this ERROR: Playwright page not found.

    These are my specs:

    HP ENVY x360 Convertible 15-es1xxx
    Processor 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz 1.80 GHz
    Installed RAM 16.0 GB (15.8 GB usable)
    Device ID 1975E5CA-1426-41A2-94D4-CF532B2C84B8
    Product ID 00342-22041-47520-AAOEM
    System type 64-bit operating system, x64-based processor
    Edition Windows 11 Home
    Version 23H2
    Installed on ‎7/‎5/‎2024
    OS build 22631.4112
    Experience Windows Feature Experience Pack 1000.22700.1034.0


    I am using WSL with Ubuntu.

    These are the packages I have:

    appdirs==1.4.4
    attrs==24.2.0
    Automat==24.8.1
    certifi==2024.8.30
    cffi==1.17.1
    charset-normalizer==3.3.2
    constantly==23.10.4
    cryptography==43.0.1
    cssselect==1.2.0
    defusedxml==0.7.1
    filelock==3.16.0
    greenlet==3.0.3
    hyperlink==21.0.0
    idna==3.8
    importlib_metadata==8.4.0
    incremental==24.7.2
    itemadapter==0.9.0
    itemloaders==1.3.1
    jmespath==1.0.1
    lxml==5.3.0
    packaging==24.1
    parsel==1.9.1
    playwright==1.46.0
    Protego==0.3.1
    pyasn1==0.6.1
    pyasn1_modules==0.4.1
    pycparser==2.22
    PyDispatcher==2.0.7
    pyee==11.1.0
    pyOpenSSL==24.2.1
    queuelib==1.7.0
    requests==2.32.3
    requests-file==2.1.0
    Scrapy==2.11.2
    scrapy-playwright==0.0.41
    service-identity==24.1.0
    setuptools==74.1.2
    tldextract==5.1.2
    tqdm==4.66.5
    Twisted==24.7.0
    typing_extensions==4.12.2
    urllib3==2.2.2
    w3lib==2.2.1
    websockets==10.4
    zipp==3.20.1
    zope.interface==7.0.3


    This is the log that includes the error that shows when I run my spider.

    Loading items.py
    2024-09-11 15:28:45 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: webscraper)
    2024-09-11 15:28:45 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.12.3 (main, Jul 31 2024, 17:43:48) [GCC 13.2.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
    2024-09-11 15:28:45 [scrapy.addons] INFO: Enabled addons:
    []
    2024-09-11 15:28:45 [asyncio] DEBUG: Using selector: EpollSelector
    2024-09-11 15:28:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
    2024-09-11 15:28:45 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
    2024-09-11 15:28:45 [scrapy.extensions.telnet] INFO: Telnet Password: 24ad54f81b9ec806
    2024-09-11 15:28:46 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
    'scrapy.extensions.telnet.TelnetConsole',
    'scrapy.extensions.memusage.MemoryUsage',
    'scrapy.extensions.logstats.LogStats']
    2024-09-11 15:28:46 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'webscraper',
    'FEED_EXPORT_ENCODING': 'utf-8',
    'NEWSPIDER_MODULE': 'webscraper.spiders',
    'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
    'SPIDER_MODULES': ['webscraper.spiders'],
    'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
    2024-09-11 15:28:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
    'webscraper.middlewares.WebscraperDownloaderMiddleware',
    'scrapy.downloadermiddlewares.retry.RetryMiddleware',
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
    'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2024-09-11 15:28:47 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
    'webscraper.middlewares.WebscraperSpiderMiddleware',
    'scrapy.spidermiddlewares.referer.RefererMiddleware',
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
    'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2024-09-11 15:28:47 [scrapy.middleware] INFO: Enabled item pipelines:
    ['webscraper.pipelines.WebscraperPipeline']
    2024-09-11 15:28:47 [scrapy.core.engine] INFO: Spider opened
    2024-09-11 15:28:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2024-09-11 15:28:47 [equilife] INFO: Spider opened: equilife
    2024-09-11 15:28:47 [equilife] INFO: Spider opened: equilife
    2024-09-11 15:28:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2024-09-11 15:28:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://equi.life/pages/search-results-page?q=all%20products&tab=products&sort_by=title&sort_order=asc&page=1> (referer: None)
    2024-09-11 15:28:48 [scrapy.core.spidermw] WARNING: Async iterable passed to WebscraperSpiderMiddleware.process_spider_output was downgraded to a non-async one
    2024-09-11 15:28:48 [equilife] ERROR: Playwright page not found
    2024-09-11 15:28:48 [scrapy.core.engine] INFO: Closing spider (finished)
    2024-09-11 15:28:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 301,
    'downloader/request_count': 1,
    'downloader/request_method_count/GET': 1,
    'downloader/response_bytes': 68610,
    'downloader/response_count': 1,
    'downloader/response_status_count/200': 1,
    'elapsed_time_seconds': 0.877688,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2024, 9, 11, 12, 28, 48, 868664, tzinfo=datetime.timezone.utc),
    'httpcompression/response_bytes': 280665,
    'httpcompression/response_count': 1,
    'log_count/DEBUG': 4,
    'log_count/ERROR': 1,
    'log_count/INFO': 12,
    'log_count/WARNING': 1,
    'memusage/max': 66830336,
    'memusage/startup': 66830336,
    'response_received_count': 1,
    'scheduler/dequeued': 1,
    'scheduler/dequeued/memory': 1,
    'scheduler/enqueued': 1,
    'scheduler/enqueued/memory': 1,
    'start_time': datetime.datetime(2024, 9, 11, 12, 28, 47, 990976, tzinfo=datetime.timezone.utc)}
    2024-09-11 15:28:48 [scrapy.core.engine] INFO: Spider closed (finished)


    I uninstalled scrapy and scrapy-playwright many times and whenever I was prompted for missing dependencies, I installed them. I tried force reinstalling with upgrades.

    I tried adding this code snippet to settings.py:

    DOWNLOADER_MIDDLEWARES = {
    'scrapy_playwright.middleware.PlaywrightMiddleware': 543,
    }


    I eventually removed it because I realize it was unnecessary and caused more errors. I looked on youtube, google and chatgpt. I have tried everything I can think of.

    This is my spider.py

    import scrapy
    from scrapy_playwright.page import PageMethod
    from webscraper.items import EquiLifeItem

    class EquilifeSpider(scrapy.Spider):
    name = "equilife"
    allowed_domains = ["equi.life"]
    start_urls = ["https://equi.life/pages/search-results-page?q=all%20products&tab=products&sort_by=title&sort_order=asc&page=1"]

    def start_requests(self):
    for url in self.start_urls:
    yield scrapy.Request(
    url,
    meta={
    'playwright': True,
    'playwright_include_page': True,
    'playwright_page_methods': [
    PageMethod('wait_for_selector', 'div#snize-item clearfix ')
    ]
    },
    callback=self.parse
    )

    async def parse(self, response):
    page = response.meta.get('playwright_page')

    if not page:
    self.logger.error('Playwright page not found')
    return

    try:
    content = await page.content()
    selector = scrapy.Selector(text=content, type='html')
    products = selector.css('a.snize-view-link')

    for product in products:
    product_data = EquiLifeItem()
    product_data['title'] = product.css('span.snize-title::text').get()
    product_data['price'] = product.css('span.snize-price::text').get()
    product_data['url'] = product.css('a').attrib.get('href')
    yield product_data
    except Exception as e:
    self.logger.error(f'Error processing page: {e}')
    finally:
    await page.close()


    This is my settings.py - The playwright configurations are at the bottom.

    BOT_NAME = "webscraper"

    SPIDER_MODULES = ["webscraper.spiders"]
    NEWSPIDER_MODULE = "webscraper.spiders"


    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32

    # Configure a delay for requests for the same website (default: 0)

    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16

    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False

    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False

    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    # "Accept-Language": "en",
    #}

    # Enable or disable spider middlewares

    SPIDER_MIDDLEWARES = {
    "webscraper.middlewares.WebscraperSpiderMiddleware": 543,
    }

    # Enable or disable downloader middlewares

    DOWNLOADER_MIDDLEWARES = {
    "webscraper.middlewares.WebscraperDownloaderMiddleware": 543,
    }

    # Enable or disable extensions

    #EXTENSIONS = {
    # "scrapy.extensions.telnet.TelnetConsole": None,
    #}

    # Configure item pipelines

    ITEM_PIPELINES = {
    "webscraper.pipelines.WebscraperPipeline": 300,
    }

    # Enable and configure the AutoThrottle extension (disabled by default)

    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False

    # Enable and configure HTTP caching (disabled by default)

    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = "httpcache"
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

    # Set settings whose default value is deprecated to a future-proof value
    REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    FEED_EXPORT_ENCODING = "utf-8"

    PLAYWRIGHT_ENABLED = True

    PLAYWRIGHT_BROWSER_TYPE = "chromium"

    PLAYWRIGHT_LAUNCH_OPTIONS = {
    'executable_path': r'C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe',
    'headless': True,
    }


    items.py, middlewares.py and pipelines.py are untouched.

    Continue reading...

Compartilhe esta Página