Save Data from Streamlit Session to Persist Changes to HF Datasets

I noticed a couple spaces where persistence appears to work for saving new records to a CSV file in HF Datasets.

I tried to reproduce it but it does not appear to save however.

Can you provide insight on how to use persistence where you make a modification to data, then want to write back from cached version in memory to persistent shared Dataset?

Here was my last attempt:

Here are my last 3 failed attempts as Datasets:

These two spaces appear to have it working yet I cannot see how - Is it a key/secret or something?

3 Likes

maybe @abidlabs or @osanseviero can help! That would be cool to document how to do this in Streamlit in addition to Gradio

1 Like

To get it working, one option is to use dataset library with it’s push_to_hub method.

app.py · julien-c/persistent-data at main and app.py · chrisjay/afro-speech at main are examples that work using huggingface_hub (a Python library that works as a wrapper of Hugging Face Hub Public APIs) using the Repository method.

  1. Create or clone a repo using Repository app.py · julien-c/persistent-data at main. These methods use a token HF_TOKEN which is passed as a secret from the Hub. Note that they also specify a local directory.
  2. Save your data in the directory from above. E.g. the first space is appending the data to a csv.

huggingface_hub also has an upload_file method which might be more intuitive which just uploads one file at a time to a given dataset, see app.py · chrisjay/afro-speech at main.

2 Likes

Thanks so much and the examples really helped. My final code that worked is below. The push to hub worked like a dream!

Dataset:

Space:


def store_message(name: str, message: str):
if name and message:
with open(DATA_FILE, “a”) as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[“name”, “message”, “time”])
writer.writerow(
{“name”: name, “message”: message, “time”: str(datetime.now())}
)
commit_url = repo.push_to_hub()
print(commit_url)

return generate_html()

Much appreciated!

1 Like

Yes with the streamlit caching primitives especially the singleton, and memo the combination of cached shared memory plus state persistence back to datasets could be a killer app for the ability to have fast large data backed by HF datasets.

I found the cache primitives to make a huge difference when you need to store state across a browser window or better yet across all users in a multiuser system. Thanks for these really cool features. They enable whole new classes of cloud based AI pipeline apps.

2 Likes

cool, thanks for sharing!

2 Likes

Now that saving to a dataset works, I tried an example with a persistent chatbot which saves the inputs and outputs.

My theory is that for any AI to be really intelligent, it has to remember inputs and responses for retrospective attention and can then solve semi-supervised corrections by considering follow up inputs.

It works great so far and I couldn’t do it without gradio and the datasets API. Cool stuff. I think adding a reprocessing loop with a separate process could begin to search and aggregate what you ask it about creating an intelligence gathering AI pipeline which considers user personalization in its ongoing contextual training and education.

Example Chatbot with Memory:

Please is there a better way that doesn’t involve downloading and uploading the entire repo upon each run. My Space is a bit large and if each user have to go through this download-upload process, the the UX would be terrible.
I tried the upload method used by Chris using this code below:

 _ = upload_file(path_or_fileobj = 'out.png',
                            path_in_repo ="remote/" + img_file_name,
                            repo_id='AfrodreamsAI/afrodreams',
                            repo_type='space',
                            token=HF_TOKEN
                        )

and I get this error:
RepositoryNotFoundError: 401 Client Error. (Request ID: lIqBKT5C99YKUrGHpJNPu) Repository Not Found for url: https://huggingface.co/api/spaces/AfrodreamsAI/afrodreams/preupload/main. Please make sure you specified the correct repo_idandrepo_type. If the repo is private, make sure you are authenticated. Unauthorized Note: Creating a commit assumes that the repo already exists on the Huggingface Hub. Please use create_repo if it's not the case.

Here is the link to my app: app.py · AfrodreamsAI/afrodreams at main

1 Like

Your might be larger due to audio but the open append mode seems to work to persist back just the saved data.
def store_message(name: str, message: str):
if name and message:
with open(DATA_FILE, “a”) as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[“name”, “message”, “time”])
writer.writerow(
{“name”: name, “message”: message, “time”: str(datetime.now())}
)
commit_url = repo.push_to_hub()
return “”
That approach might also skip reloading the whole thing since you’d only need to append unless you display the whole aggregated dataset, good luck!

Using a data store like firebase might also work for you. Firebase is used in alot of mobile apps and is fairly easy CRUD pattern. My example for that is here: 🗣️Speech 2 Sentiment 2 Save 2 Story 2 Image 2 Video🎥 - a Hugging Face Space by awacke1

Thank you helping out. My initial code was correct I was just using a wrong token.

1 Like