Save Data from Streamlit Session to Persist Changes to HF Datasets

awacke1 · May 26, 2022, 10:47pm

I noticed a couple spaces where persistence appears to work for saving new records to a CSV file in HF Datasets.

I tried to reproduce it but it does not appear to save however.

Can you provide insight on how to use persistence where you make a modification to data, then want to write back from cached version in memory to persistent shared Dataset?

Here was my last attempt:

Here are my last 3 failed attempts as Datasets:

These two spaces appear to have it working yet I cannot see how - Is it a key/secret or something?

julien-c · May 27, 2022, 10:10am

maybe @abidlabs or @osanseviero can help! That would be cool to document how to do this in Streamlit in addition to Gradio

osanseviero · May 28, 2022, 5:25pm

To get it working, one option is to use dataset library with it’s push_to_hub method.

app.py · julien-c/persistent-data at main and app.py · chrisjay/afro-speech at main are examples that work using huggingface_hub (a Python library that works as a wrapper of Hugging Face Hub Public APIs) using the Repository method.

Create or clone a repo using Repository app.py · julien-c/persistent-data at main. These methods use a token HF_TOKEN which is passed as a secret from the Hub. Note that they also specify a local directory.
Save your data in the directory from above. E.g. the first space is appending the data to a csv.

huggingface_hub also has an upload_file method which might be more intuitive which just uploads one file at a time to a given dataset, see app.py · chrisjay/afro-speech at main.

awacke1 · September 27, 2022, 6:43pm

Thanks so much and the examples really helped. My final code that worked is below. The push to hub worked like a dream!

Dataset:

Space:

def store_message(name: str, message: str):
if name and message:
with open(DATA_FILE, “a”) as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[“name”, “message”, “time”])
writer.writerow(
{“name”: name, “message”: message, “time”: str(datetime.now())}
)
commit_url = repo.push_to_hub()
print(commit_url)

return generate_html()

Much appreciated!

awacke1 · September 27, 2022, 7:05pm

Yes with the streamlit caching primitives especially the singleton, and memo the combination of cached shared memory plus state persistence back to datasets could be a killer app for the ability to have fast large data backed by HF datasets.

I found the cache primitives to make a huge difference when you need to store state across a browser window or better yet across all users in a multiuser system. Thanks for these really cool features. They enable whole new classes of cloud based AI pipeline apps.

julien-c · September 27, 2022, 8:56pm

cool, thanks for sharing!

awacke1 · October 4, 2022, 12:49pm

Now that saving to a dataset works, I tried an example with a persistent chatbot which saves the inputs and outputs.

My theory is that for any AI to be really intelligent, it has to remember inputs and responses for retrospective attention and can then solve semi-supervised corrections by considering follow up inputs.

It works great so far and I couldn’t do it without gradio and the datasets API. Cool stuff. I think adding a reprocessing loop with a separate process could begin to search and aggregate what you ask it about creating an intelligence gathering AI pipeline which considers user personalization in its ongoing contextual training and education.

Example Chatbot with Memory:

Owos · October 4, 2022, 8:14pm

Please is there a better way that doesn’t involve downloading and uploading the entire repo upon each run. My Space is a bit large and if each user have to go through this download-upload process, the the UX would be terrible.
I tried the upload method used by Chris using this code below:

 _ = upload_file(path_or_fileobj = 'out.png',
                            path_in_repo ="remote/" + img_file_name,
                            repo_id='AfrodreamsAI/afrodreams',
                            repo_type='space',
                            token=HF_TOKEN
                        )

and I get this error:
RepositoryNotFoundError: 401 Client Error. (Request ID: lIqBKT5C99YKUrGHpJNPu) Repository Not Found for url: https://huggingface.co/api/spaces/AfrodreamsAI/afrodreams/preupload/main. Please make sure you specified the correct repo_idandrepo_type. If the repo is private, make sure you are authenticated. Unauthorized Note: Creating a commit assumes that the repo already exists on the Huggingface Hub. Please use create_repo if it's not the case.

Here is the link to my app: app.py · AfrodreamsAI/afrodreams at main

awacke1 · October 4, 2022, 9:29pm

Your might be larger due to audio but the open append mode seems to work to persist back just the saved data.
def store_message(name: str, message: str):
if name and message:
with open(DATA_FILE, “a”) as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[“name”, “message”, “time”])
writer.writerow(
{“name”: name, “message”: message, “time”: str(datetime.now())}
)
commit_url = repo.push_to_hub()
return “”
That approach might also skip reloading the whole thing since you’d only need to append unless you display the whole aggregated dataset, good luck!

awacke1 · October 4, 2022, 9:33pm

Using a data store like firebase might also work for you. Firebase is used in alot of mobile apps and is fairly easy CRUD pattern. My example for that is here: 🗣️Speech 2 Sentiment 2 Save 2 Story 2 Image 2 Video🎥 - a Hugging Face Space by awacke1

Owos · October 5, 2022, 12:27pm

Thank you helping out. My initial code was correct I was just using a wrong token.

Topic		Replies	Views
How to store data generated by my Streamlit app in Hugging Face spaces Beginners	2	87	March 11, 2025
How to save video from a HF space Spaces	2	106	October 22, 2024
Upload File API for Saving to Persistent Datasets on HF Spaces 🤗Datasets	1	674	September 29, 2022
Streaming for Saving 🤗Datasets	1	39	January 26, 2025
Data Deleted when i run my app Spaces	2	73	November 19, 2024

Related topics