[Python] Using Pandas and Patches, Extract Hyperlinks From Excel (.xlsx) files, and create a...

Stack · Outubro 4, 2024 às 00:22

Issue:

I am attempting to incorporate code that allows for pandas (and openpyxl or any working method you are able to provide) to properly input HYPERLINKS into a DataFrame.

However, despite seeming successful attempts in the GitHub and StackOverflow solutions (GITHUB: Feature request: read_excel to support hyperlinks split into label and hyperlink ), I cannot successfully export HYPERLINKS from Excel files (in INPUT_DIR) into a DataFrame (and ultimately, into a merged Excel file). I don't receive any errors, it just processes it as if I didn't use the additional--intended solution--code provided by https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040. It is possible that the solution exists in the links I provided above, but I am not sure how to implement them with my code (my code is processing multiple excel files into a merged excel file, and the solutions don't seem to incorporate this(?)). Apologies, if my coding is too green to extrapolate a solution.

Possible root issue: I think it might have to do with the URI relative paths (i.e. ../../../) of the HYPERLINKS I use as opposed to a traditional URL (but not sure?).

The original excel files I am attempting to merge have HYPERLINKS (e.g. =HYPERLINK("../../../data/POD-00011_2408081740.MP3","../../../data/POD-00011_2408081740.MP3") that work--such that when you click on the cell, it will open/execute the file by going up 3 directories/folders (i.e. ../../../) and then going into "data" or "transcripts_output" directory. I am attempting to merge all the Excel files using pd.read_excel() and pd.concat() (See Method 2. HERE) to merge the Excel files into a single Excel file (with proper HYPERLINKS)

Referenced Troubleshooting:

Based on a number of Github and StackOverflow threads:

https://pastebin.com/0bBMW8fY

My Setup Environment:

os platform: Win-64; Win11
conda: 24.7.1
python: 3.10
pandas: 2.2.3
openpyxl: 3.1.5

Code:

import pandas as pd
import openpyxl

from pathlib import Path

def dir_not_empty(_dir: Path) -> bool:
return any(_dir.iterdir())

#### Following Script Used From:#################################################################
# intended to augment pandas read_excel() to accept HYPERLINK cells/data
# https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040
def _get_link_if_exists(cell) -> str | None:
try:
return cell.hyperlink.target
except AttributeError:
return None

def extract_hyperlinks_from_xlsx(
file_name: str, sheet_name: str, columns_to_parse: list[str], row_header: int = 1
) -> pd.DataFrame:
df = pd.read_excel(file_name, sheet_name)
ws = openpyxl.load_workbook(file_name)[sheet_name]
for column in columns_to_parse:
row_offset = row_header + 1
column_index = list(df.columns).index(column) + 1
df[column] = [
_get_link_if_exists(ws.cell(row=row_offset + i, column=column_index))
for i in range(len(df[column]))
]
return df
##################################################################################################

# specifying the path to csv files:
INPUT_DIR = Path("./output/xlsx_output/xlsx_run_outputs")
OUTPUT_DIR = Path("./output/xlsx_output/merged_xlsx/merged_podcast_outputs.xlsx")

# csv files in the path
# list of excel files we want to merge.
# pd.read_excel(file_path) reads the
# excel data into pandas dataframe.

if dir_not_empty(INPUT_DIR):
file_list = INPUT_DIR.iterdir()

# Two following lines are commented out/ removed from script and replaced with
# code that attempts to utilize script made from
# https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040:
# excl_list = [pd.read_excel(file) for file in file_list]
# print(excl_list)

# Replacement code (see https://gist.github.com/zachschillaci27/887c272cbdda63316d9a02a770d15040)
# meant to allow for HYPERLINKS in pd.read_excel():
column_list = ["input_urls", "output_srt_urls", "output_json_urls", "output_vtt_urls", "output_txt_urls", "output_tsv_urls"]
excl_list = [extract_hyperlinks_from_xlsx(file, 'Sheet1', column_list, row_header=1) for file in file_list]
print(excl_list)

Actual Output:

[ Unnamed: 0 input_urls output_srt_urls output_json_urls ... creation_date creation_time word_count file_sizes
0 0 None None None ... 20240808 1740 13 5036032
1 1 None None None ... 20240808 1754 6 900096

[2 rows x 11 columns]]

Intended Output: Output that has corresponding appropriate HYPERLINKS instead of cells with 'None':

**APPROPRIATE HYPERLINKS to replace Nones**
../../../data/POD-00011_2408081740.MP3 ../../transcripts_output/POD-00011_2408081740.srt ../../transcripts_output/POD-00011_2408081740.json ../../transcripts_output/POD-00011_2408081740.vtt ../../transcripts_output/POD-00011_2408081740.txt ../../transcripts_output/POD-00011_2408081740.tsv
../../../data/POD-00012_2408081754.MP3 ../../transcripts_output/POD-00012_2408081754.srt ../../transcripts_output/POD-00012_2408081754.json ../../transcripts_output/POD-00012_2408081754.vtt ../../transcripts_output/POD-00012_2408081754.txt ../../transcripts_output/POD-00012_2408081754.tsv

Any help would be much appreciated. If my question doesn't meet best practice, please feel free to critique and I will try to make the appropriate adjustments.

Continue reading...

Logar ou Criar uma Conta

[Python] Using Pandas and Patches, Extract Hyperlinks From Excel (.xlsx) files, and create a...

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] Using Pandas and Patches, Extract Hyperlinks From Excel (.xlsx) files, and create a...

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis