[Python] Custom decode binary data in polars

Discussão em 'Python' iniciado por Stack, Setembro 27, 2024 às 22:32.

  1. Stack

    Stack Membro Participativo

    when working with binary data, I am using custom function in order to decode them. This requires the usage of apply in polars. Due to the element wise processing in this case, the calculation time is increasing significantly, when working with larg data sets.

    I tried to cast the binary data to List(UInt8), but this is not yet implemented.

    exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")

    Is there a more efficiant way of doing it?

    import polars as pl
    import struct
    import io

    data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
    schema = {"binary": pl.Binary, "id":pl.Int16}

    df = pl.DataFrame(data, schema)

    This returns:

    shape: (2, 2)
    │ binary ┆ id │
    │ --- ┆ --- │
    │ binary ┆ i16 │
    │ [binary data] ┆ 1 │
    │ [binary data] ┆ 2 │

    Now when we apply our function to decode the binary column:

    def custom_decode(data):
    bytestream = io.BytesIO(data)
    lst = []

    while bytestream.tell() < 6:
    lst.append(struct.unpack('<H', bytestream.read(2))[0])

    return lst

    df = df.with_columns(
    pl.col('binary').map_elements(lambda x: custom_decode(x))


    shape: (2, 2)
    │ binary ┆ id │
    │ --- ┆ --- │
    │ list[i64] ┆ i16 │
    │ [253, 254, 255] ┆ 1 │
    │ [16, 32, 48] ┆ 2 │

