[Python] {scrapy-playwright} ERROR: Playwright page not found

Stack · Setembro 11, 2024

I am trying to do a simple web scrape for a price, title, and URL for all the products on the first page of this website; https://equi.life/pages/search-resu...=products&sort_by=title&sort_order=asc&page=1

I keep running into this ERROR: Playwright page not found.

These are my specs:

HP ENVY x360 Convertible 15-es1xxx
Processor 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz 1.80 GHz
Installed RAM 16.0 GB (15.8 GB usable)
Device ID 1975E5CA-1426-41A2-94D4-CF532B2C84B8
Product ID 00342-22041-47520-AAOEM
System type 64-bit operating system, x64-based processor
Edition Windows 11 Home
Version 23H2
Installed on ‎7/‎5/‎2024
OS build 22631.4112
Experience Windows Feature Experience Pack 1000.22700.1034.0

I am using WSL with Ubuntu.

These are the packages I have:

appdirs==1.4.4
attrs==24.2.0
Automat==24.8.1
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
constantly==23.10.4
cryptography==43.0.1
cssselect==1.2.0
defusedxml==0.7.1
filelock==3.16.0
greenlet==3.0.3
hyperlink==21.0.0
idna==3.8
importlib_metadata==8.4.0
incremental==24.7.2
itemadapter==0.9.0
itemloaders==1.3.1
jmespath==1.0.1
lxml==5.3.0
packaging==24.1
parsel==1.9.1
playwright==1.46.0
Protego==0.3.1
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
PyDispatcher==2.0.7
pyee==11.1.0
pyOpenSSL==24.2.1
queuelib==1.7.0
requests==2.32.3
requests-file==2.1.0
Scrapy==2.11.2
scrapy-playwright==0.0.41
service-identity==24.1.0
setuptools==74.1.2
tldextract==5.1.2
tqdm==4.66.5
Twisted==24.7.0
typing_extensions==4.12.2
urllib3==2.2.2
w3lib==2.2.1
websockets==10.4
zipp==3.20.1
zope.interface==7.0.3

This is the log that includes the error that shows when I run my spider.

Loading items.py
2024-09-11 15:28:45 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: webscraper)
2024-09-11 15:28:45 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.12.3 (main, Jul 31 2024, 17:43:48) [GCC 13.2.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
2024-09-11 15:28:45 [scrapy.addons] INFO: Enabled addons:
[]
2024-09-11 15:28:45 [asyncio] DEBUG: Using selector: EpollSelector
2024-09-11 15:28:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-09-11 15:28:45 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-09-11 15:28:45 [scrapy.extensions.telnet] INFO: Telnet Password: 24ad54f81b9ec806
2024-09-11 15:28:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2024-09-11 15:28:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'webscraper',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'webscraper.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['webscraper.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-09-11 15:28:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'webscraper.middlewares.WebscraperDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-09-11 15:28:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'webscraper.middlewares.WebscraperSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-09-11 15:28:47 [scrapy.middleware] INFO: Enabled item pipelines:
['webscraper.pipelines.WebscraperPipeline']
2024-09-11 15:28:47 [scrapy.core.engine] INFO: Spider opened
2024-09-11 15:28:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-11 15:28:47 [equilife] INFO: Spider opened: equilife
2024-09-11 15:28:47 [equilife] INFO: Spider opened: equilife
2024-09-11 15:28:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-11 15:28:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://equi.life/pages/search-results-page?q=all%20products&tab=products&sort_by=title&sort_order=asc&page=1> (referer: None)
2024-09-11 15:28:48 [scrapy.core.spidermw] WARNING: Async iterable passed to WebscraperSpiderMiddleware.process_spider_output was downgraded to a non-async one
2024-09-11 15:28:48 [equilife] ERROR: Playwright page not found
2024-09-11 15:28:48 [scrapy.core.engine] INFO: Closing spider (finished)
2024-09-11 15:28:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 301,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 68610,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.877688,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 9, 11, 12, 28, 48, 868664, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 280665,
'httpcompression/response_count': 1,
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 12,
'log_count/WARNING': 1,
'memusage/max': 66830336,
'memusage/startup': 66830336,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 9, 11, 12, 28, 47, 990976, tzinfo=datetime.timezone.utc)}
2024-09-11 15:28:48 [scrapy.core.engine] INFO: Spider closed (finished)

I uninstalled scrapy and scrapy-playwright many times and whenever I was prompted for missing dependencies, I installed them. I tried force reinstalling with upgrades.

I tried adding this code snippet to settings.py:

DOWNLOADER_MIDDLEWARES = {
'scrapy_playwright.middleware.PlaywrightMiddleware': 543,
}

I eventually removed it because I realize it was unnecessary and caused more errors. I looked on youtube, google and chatgpt. I have tried everything I can think of.

This is my spider.py

import scrapy
from scrapy_playwright.page import PageMethod
from webscraper.items import EquiLifeItem

class EquilifeSpider(scrapy.Spider):
name = "equilife"
allowed_domains = ["equi.life"]
start_urls = ["https://equi.life/pages/search-results-page?q=all%20products&tab=products&sort_by=title&sort_order=asc&page=1"]

def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_methods': [
PageMethod('wait_for_selector', 'div#snize-item clearfix ')
]
},
callback=self.parse
)

async def parse(self, response):
page = response.meta.get('playwright_page')

if not page:
self.logger.error('Playwright page not found')
return

try:
content = await page.content()
selector = scrapy.Selector(text=content, type='html')
products = selector.css('a.snize-view-link')

for product in products:
product_data = EquiLifeItem()
product_data['title'] = product.css('span.snize-title::text').get()
product_data['price'] = product.css('span.snize-price::text').get()
product_data['url'] = product.css('a').attrib.get('href')
yield product_data
except Exception as e:
self.logger.error(f'Error processing page: {e}')
finally:
await page.close()

This is my settings.py - The playwright configurations are at the bottom.

BOT_NAME = "webscraper"

SPIDER_MODULES = ["webscraper.spiders"]
NEWSPIDER_MODULE = "webscraper.spiders"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}

# Enable or disable spider middlewares

SPIDER_MIDDLEWARES = {
"webscraper.middlewares.WebscraperSpiderMiddleware": 543,
}

# Enable or disable downloader middlewares

DOWNLOADER_MIDDLEWARES = {
"webscraper.middlewares.WebscraperDownloaderMiddleware": 543,
}

# Enable or disable extensions

#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines

ITEM_PIPELINES = {
"webscraper.pipelines.WebscraperPipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)

#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

PLAYWRIGHT_ENABLED = True

PLAYWRIGHT_BROWSER_TYPE = "chromium"

PLAYWRIGHT_LAUNCH_OPTIONS = {
'executable_path': r'C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe',
'headless': True,
}

items.py, middlewares.py and pipelines.py are untouched.

Continue reading...

Logar ou Criar uma Conta

[Python] {scrapy-playwright} ERROR: Playwright page not found

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] {scrapy-playwright} ERROR: Playwright page not found

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis