1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Using Pandas and Patches, Extract Hyperlinks From Excel (.xlsx) files, and create a...

Discussão em 'Python' iniciado por Stack, Outubro 4, 2024 às 00:22.

  1. Stack

    Stack Membro Participativo

    Issue:


    I am attempting to incorporate code that allows for pandas (and openpyxl or any working method you are able to provide) to properly input HYPERLINKS into a DataFrame.

    However, despite seeming successful attempts in the GitHub and StackOverflow solutions (GITHUB: Feature request: read_excel to support hyperlinks split into label and hyperlink ), I cannot successfully export HYPERLINKS from Excel files (in INPUT_DIR) into a DataFrame (and ultimately, into a merged Excel file). I don't receive any errors, it just processes it as if I didn't use the additional--intended solution--code provided by https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040. It is possible that the solution exists in the links I provided above, but I am not sure how to implement them with my code (my code is processing multiple excel files into a merged excel file, and the solutions don't seem to incorporate this(?)). Apologies, if my coding is too green to extrapolate a solution.

    Possible root issue: I think it might have to do with the URI relative paths (i.e. ../../../) of the HYPERLINKS I use as opposed to a traditional URL (but not sure?).

    The original excel files I am attempting to merge have HYPERLINKS (e.g. =HYPERLINK("../../../data/POD-00011_2408081740.MP3","../../../data/POD-00011_2408081740.MP3") that work--such that when you click on the cell, it will open/execute the file by going up 3 directories/folders (i.e. ../../../) and then going into "data" or "transcripts_output" directory. I am attempting to merge all the Excel files using pd.read_excel() and pd.concat() (See Method 2. HERE) to merge the Excel files into a single Excel file (with proper HYPERLINKS)

    Referenced Troubleshooting:


    Based on a number of Github and StackOverflow threads:

    https://pastebin.com/0bBMW8fY

    My Setup Environment:


    os platform: Win-64; Win11
    conda: 24.7.1
    python: 3.10
    pandas: 2.2.3
    openpyxl: 3.1.5

    Code:


    import pandas as pd
    import openpyxl

    from pathlib import Path

    def dir_not_empty(_dir: Path) -> bool:
    return any(_dir.iterdir())


    #### Following Script Used From:#################################################################
    # intended to augment pandas read_excel() to accept HYPERLINK cells/data
    # https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040
    def _get_link_if_exists(cell) -> str | None:
    try:
    return cell.hyperlink.target
    except AttributeError:
    return None


    def extract_hyperlinks_from_xlsx(
    file_name: str, sheet_name: str, columns_to_parse: list[str], row_header: int = 1
    ) -> pd.DataFrame:
    df = pd.read_excel(file_name, sheet_name)
    ws = openpyxl.load_workbook(file_name)[sheet_name]
    for column in columns_to_parse:
    row_offset = row_header + 1
    column_index = list(df.columns).index(column) + 1
    df[column] = [
    _get_link_if_exists(ws.cell(row=row_offset + i, column=column_index))
    for i in range(len(df[column]))
    ]
    return df
    ##################################################################################################

    # specifying the path to csv files:
    INPUT_DIR = Path("./output/xlsx_output/xlsx_run_outputs")
    OUTPUT_DIR = Path("./output/xlsx_output/merged_xlsx/merged_podcast_outputs.xlsx")

    # csv files in the path
    # list of excel files we want to merge.
    # pd.read_excel(file_path) reads the
    # excel data into pandas dataframe.

    if dir_not_empty(INPUT_DIR):
    file_list = INPUT_DIR.iterdir()

    # Two following lines are commented out/ removed from script and replaced with
    # code that attempts to utilize script made from
    # https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040:
    # excl_list = [pd.read_excel(file) for file in file_list]
    # print(excl_list)

    # Replacement code (see https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040)
    # meant to allow for HYPERLINKS in pd.read_excel():
    column_list = ["input_urls", "output_srt_urls", "output_json_urls", "output_vtt_urls", "output_txt_urls", "output_tsv_urls"]
    excl_list = [extract_hyperlinks_from_xlsx(file, 'Sheet1', column_list, row_header=1) for file in file_list]
    print(excl_list)

    Actual Output:


    [ Unnamed: 0 input_urls output_srt_urls output_json_urls ... creation_date creation_time word_count file_sizes
    0 0 None None None ... 20240808 1740 13 5036032
    1 1 None None None ... 20240808 1754 6 900096

    [2 rows x 11 columns]]

    Intended Output: Output that has corresponding appropriate HYPERLINKS instead of cells with 'None':


    **APPROPRIATE HYPERLINKS to replace Nones**
    ../../../data/POD-00011_2408081740.MP3 ../../transcripts_output/POD-00011_2408081740.srt ../../transcripts_output/POD-00011_2408081740.json ../../transcripts_output/POD-00011_2408081740.vtt ../../transcripts_output/POD-00011_2408081740.txt ../../transcripts_output/POD-00011_2408081740.tsv
    ../../../data/POD-00012_2408081754.MP3 ../../transcripts_output/POD-00012_2408081754.srt ../../transcripts_output/POD-00012_2408081754.json ../../transcripts_output/POD-00012_2408081754.vtt ../../transcripts_output/POD-00012_2408081754.txt ../../transcripts_output/POD-00012_2408081754.tsv


    Any help would be much appreciated. If my question doesn't meet best practice, please feel free to critique and I will try to make the appropriate adjustments.

    Continue reading...

Compartilhe esta Página