1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How to get consistent results in tabular PDF parsing with llama-parse?

Discussão em 'Python' iniciado por Stack, Setembro 13, 2024.

  1. Stack

    Stack Membro Participativo

    I was parsing some PDF files using llama in Python with below code:

    import os
    import pandas as pd

    import nest_asyncio
    nest_asyncio.apply()

    os.environ["LLMA_CLOUD_API_KEY"] = "some_key_id"
    key_input = "some_key_id"

    from llama_parse import LlamaParse

    # running llama parsing
    doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
    ).load_data(r"Path\myfile.pdf")


    The results of parsing the same document is different when I run this same code now from then. Difference is of | and line separation for the separations in tabular text.

    Is there a way to get the same old results in llama or to fix some parameters so that it works on same model or same way to always get same consistent results again & again so that I can build Analytics on this based on same code logic?

    Last month's llama results:

    print(doc_parsed[5].text[:1000])


    # Information

    |Name|: Mr. XXX|
    |---|---|
    |Age/Sex|: XX YRS/M|
    |Lab Id.|: 0124080X|
    |Refered By|: Self|
    |Sample Collection On|: 03/Aug/2024 08:30AM|
    |Collected By|: XXX|
    |Sample Lab Rec. On|: 03/Aug/2024 11:50 AM|
    |Collection Mode|: HOME COLLECTION|
    |Reporting On|: 03/Aug/2024 02:48 PM|
    |BarCode|: XXX|

    # Test Results

    |Test Name|Result|Biological Ref. Int.|Unit|
    |---|---|---|---|


    Llama results on same PDF now:

    print(doc_parsed[5].text[:1000])


    # Report

    Name: Mr. XXX

    Age/Sex: XXX YRS/M

    Lab Id: 0124080X

    Referred By: Self

    Sample Collection On: 03/Aug/2024 08:30 AM

    Collected By: XXX

    Sample Lab Rec. On: 03/Aug/2024 11:50 AM

    Collection Mode: HOME COLLECTION

    Reporting On: 03/Aug/2024 02:48 PM

    BarCode: XXX

    # Test Results

    Test Name
    Result
    Biological Ref. Int.
    Unit



    Desired Results:

    # Above part doesn't matter but Test Results should be separated by |
    # Test Results

    |Test Name|Result|Biological Ref. Int.|Unit|


    Is there a change of model at the back causing difference? Can I fix the model to get the consistent results?

    Continue reading...

Compartilhe esta Página