[Python] scraping json with scrapy item loaders

Stack · Outubro 8, 2024

I am following this example for scraping a news website. When I check the type returned in his case, the type is a scrapy.selector.unified.SelectorList.

In my case, since the data of interest is enclosed in <script> tags I managed to extract and parse it in the form of a List via the below python code.

fetch('https://newswebsite.com/news/national')

data = re.findall("<script type=.application.ld.json. id=.listing-ld.>{.@graph..+?),.@context.:.http:..schema.org..<.script>", response.body.decode("utf-8"), re.S)

#convert list to string before converting to json
jsonData = json.loads(''.join(data))

Having return a List I cannot keep following the example to implement item loaders

Could you guide me on what python concepts are in use in the below code so I can familiarize myself and be able to adapt it to my use case? Why is the item being loaded in the item loader before being parsed with the css selector (.add_css)?

from itemloaders.processors import TakeFirst, MapCompose
from scrapy.loader import ItemLoader

class ChocolateProductLoader(ItemLoader):

default_output_processor = TakeFirst()
price_in = MapCompose(lambda x: x.split("£")[-1])
url_in = MapCompose(lambda x: 'https://www.chocolate.co.uk' + x )

import scrapy
from chocolatescraper.itemloaders import ChocolateProductLoader
from chocolatescraper.items import ChocolateProduct

class ChocolateSpider(scrapy.Spider):

# The name of the spider
name = 'chocolatespider'

# These are the urls that we will start scraping
start_urls = ['https://www.chocolate.co.uk/collections/all']

def parse(self, response):
products = response.css('product-item')

for product in products:
chocolate = ChocolateProductLoader(item=ChocolateProduct(), selector=product)
chocolate.add_css('name', "a.product-item-meta__title::text")
chocolate.add_css('price', 'span.price', re='<span class="price">\n <span class="visually-hidden">Sale price</span>(.*)</span>')
chocolate.add_css('url', 'div.product-item-meta a::attr(href)')
yield chocolate.load_item()

next_page = response.css('[rel="next"] ::attr(href)').get()

if next_page is not None:
next_page_url = 'https://www.chocolate.co.uk' + next_page
yield response.follow(next_page_url, callback=self.parse)

Continue reading...

Logar ou Criar uma Conta

[Python] scraping json with scrapy item loaders

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] scraping json with scrapy item loaders

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis