1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Page through S3 objects matching specific filename using boto3

Discussão em 'Python' iniciado por Stack, Outubro 7, 2024 às 08:52.

  1. Stack

    Stack Membro Participativo

    I have an AWS S3 bucket with a Prefix (or "folder") called /photos. That "contains" a bunch of image files and even fewer EVENT.json files. A naive representation might look like this:

    • my-awesome-events-bucket
      • photos
        • image1.jpg
        • image2.jpg
        • 1_EVENT.json
        • image3.jpg
        • 2_EVENT.json
        • ...

    The EVENT.json files have an object that contains a path reference to an arbitrary amount of image files, which group images into a specific event. Using the example above, image1.jpg and image2.jpg could appear in 1_EVENT.json, and image3.jpg may belong to 2_EVENT.json.

    As the bucket gets larger, I have an interest in paging through the results. I only want to request a page at a time from S3 as I need them. The problem I'm running into is that I want to page specifically by keys that contain the word "EVENT". I'm finding this difficult to accomplish without bringing back ALL the objects and then filtering or iterating the results.

    Using an S3 Paginator, I'm able to get paging working. Assuming my PageSize and MaxItems are set to 6, this is what I might get back for my first page:

    /photos/
    /photos/image1.jpg
    /photos/image2.jpg
    /photos/1_EVENT.json
    /photos/image3.jpg
    /photos/2_EVENT.json


    S3's flat structure means that it's paging through all objects in the bucket according to the Prefix, and limiting and paging according to the pagination parameters. This means that I could easily get multiple EVENT.json files, or none at all, depending on the page.

    So I'm looking for something more along the lines of this:

    /photos/1_EVENT.json
    /photos/2_EVENT.json
    /photos/3_EVENT.json
    /photos/4_EVENT.json
    /photos/5_EVENT.json
    /photos/6_EVENT.json


    without first having to request all objects and then slice the results set in some way; which is exactly what I'm doing currently:

    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(
    Bucket=app.config.get('S3_BUCKET'),
    Prefix="photos/") # Left PaginationConfig MaxItems & PageSize off intentionally
    filtered_iterator = page_iterator.search(
    "Contents[?contains(Key, `EVENT`)][]")
    for page in filtered_iterator:
    # Do stuff.
    pass


    The above is really expensive, with no paging, but it does give me a list of all files containing my "EVENT" search string.

    I specifically want to page results of only EVENT.json objects through S3 using boto3 without the overhead of returning and filtering all objects every request. Is that possible?

    EDIT: I'm already narrowing requests down to just objects with the photos/ Prefix. This is because there are other "folders" in my bucket that also may contain EVENT files. That prevents me from using EVENT or EVENT.json as my Prefix, because the response may be polluted by files from other folders.

    Continue reading...

Compartilhe esta Página