I’m trying every way, but I still find it very difficult to save a dynamically generated XLS file on a website using Selenium for later analysis.
Situation: I’m using Selenium to do web scraping. At some point, he needs to click a button to generate an XLS report. Locally this works as it can create a local directory to save the file. However, in a Huggingface Streamlit hosting environment, it cannot create folders. How can I save the file generated by the website within Huggingface so that it can be analyzed later by Python?
Here is part of the code that saves locally:
# Create the directory to save the XLS file, if it does not already exist
directory = os.path.join(os.getcwd(), "spreadsheets", "01")
print(f"\n\nDownload directory: {directory}")
if not os.path.exists(directory):
os.makedirs(directory)
# Set the WebDriver download directory to the created directory
driver.command_executor._commands["send_command"] = (
"POST",
"/session/$sessionId/chromium/send_command",
)
params = {
"cmd": "Page.setDownloadBehavior",
"params": {"behavior": "allow", "downloadPath": directory},
}
driver.execute("send_command", params)
# Click on the download link
download_link.click()
# Wait for the download to complete
time.sleep(
10
) # Ideally replace with an explicit wait that checks the existence of the file
# Assume the downloaded file is the most recent in the directory
xls_file_path = max(
[os.path.join(directory, f) for f in os.listdir(directory)],
key=os.path.getctime,
)
with open(xls_file_path, 'rb') as f:
result = chardet.detect(f.read())
# Read the file with the detected encoding
df = pd.read_csv(xls_file_path, delimiter="\t",
encoding=result['encoding'])
# Convert the DataFrame to a string in markdown table format
df_string = df.to_markdown()
print(df)
print(f"\n\nConvert the DataFrame to a string in markdown table format
df_string: {df_string}")
print(f"\n\nXLS DATA FROM xls_file_path: {xls_file_path}")