1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] scraping json with scrapy item loaders

Discussão em 'Python' iniciado por Stack, Outubro 8, 2024.

  1. Stack

    Stack Membro Participativo

    I am following this example for scraping a news website. When I check the type returned in his case, the type is a scrapy.selector.unified.SelectorList.

    In my case, since the data of interest is enclosed in <script> tags I managed to extract and parse it in the form of a List via the below python code.

    fetch('https://newswebsite.com/news/national')

    data = re.findall("<script type=.application.ld.json. id=.listing-ld.>{.@graph.:(.+?),.@context.:.http:..schema.org..<.script>", response.body.decode("utf-8"), re.S)

    #convert list to string before converting to json
    jsonData = json.loads(''.join(data))



    Having return a List I cannot keep following the example to implement item loaders

    Could you guide me on what python concepts are in use in the below code so I can familiarize myself and be able to adapt it to my use case? Why is the item being loaded in the item loader before being parsed with the css selector (.add_css)?

    from itemloaders.processors import TakeFirst, MapCompose
    from scrapy.loader import ItemLoader

    class ChocolateProductLoader(ItemLoader):

    default_output_processor = TakeFirst()
    price_in = MapCompose(lambda x: x.split("£")[-1])
    url_in = MapCompose(lambda x: 'https://www.chocolate.co.uk' + x )


    import scrapy
    from chocolatescraper.itemloaders import ChocolateProductLoader
    from chocolatescraper.items import ChocolateProduct


    class ChocolateSpider(scrapy.Spider):

    # The name of the spider
    name = 'chocolatespider'

    # These are the urls that we will start scraping
    start_urls = ['https://www.chocolate.co.uk/collections/all']

    def parse(self, response):
    products = response.css('product-item')

    for product in products:
    chocolate = ChocolateProductLoader(item=ChocolateProduct(), selector=product)
    chocolate.add_css('name', "a.product-item-meta__title::text")
    chocolate.add_css('price', 'span.price', re='<span class="price">\n <span class="visually-hidden">Sale price</span>(.*)</span>')
    chocolate.add_css('url', 'div.product-item-meta a::attr(href)')
    yield chocolate.load_item()

    next_page = response.css('[rel="next"] ::attr(href)').get()

    if next_page is not None:
    next_page_url = 'https://www.chocolate.co.uk' + next_page
    yield response.follow(next_page_url, callback=self.parse)

    Continue reading...

Compartilhe esta Página