I noticed a couple spaces where persistence appears to work for saving new records to a CSV file in HF Datasets.
I tried to reproduce it but it does not appear to save however.
Can you provide insight on how to use persistence where you make a modification to data, then want to write back from cached version in memory to persistent shared Dataset?
Here was my last attempt:
Here are my last 3 failed attempts as Datasets:
These two spaces appear to have it working yet I cannot see how - Is it a key/secret or something?
Create or clone a repo using Repositoryapp.py · julien-c/persistent-data at main. These methods use a token HF_TOKEN which is passed as a secret from the Hub. Note that they also specify a local directory.
Save your data in the directory from above. E.g. the first space is appending the data to a csv.
huggingface_hub also has an upload_file method which might be more intuitive which just uploads one file at a time to a given dataset, see app.py · chrisjay/afro-speech at main.
Yes with the streamlit caching primitives especially the singleton, and memo the combination of cached shared memory plus state persistence back to datasets could be a killer app for the ability to have fast large data backed by HF datasets.
I found the cache primitives to make a huge difference when you need to store state across a browser window or better yet across all users in a multiuser system. Thanks for these really cool features. They enable whole new classes of cloud based AI pipeline apps.
Now that saving to a dataset works, I tried an example with a persistent chatbot which saves the inputs and outputs.
My theory is that for any AI to be really intelligent, it has to remember inputs and responses for retrospective attention and can then solve semi-supervised corrections by considering follow up inputs.
It works great so far and I couldn’t do it without gradio and the datasets API. Cool stuff. I think adding a reprocessing loop with a separate process could begin to search and aggregate what you ask it about creating an intelligence gathering AI pipeline which considers user personalization in its ongoing contextual training and education.
Please is there a better way that doesn’t involve downloading and uploading the entire repo upon each run. My Space is a bit large and if each user have to go through this download-upload process, the the UX would be terrible.
I tried the upload method used by Chris using this code below:
and I get this error: RepositoryNotFoundError: 401 Client Error. (Request ID: lIqBKT5C99YKUrGHpJNPu) Repository Not Found for url: https://huggingface.co/api/spaces/AfrodreamsAI/afrodreams/preupload/main. Please make sure you specified the correct repo_idandrepo_type. If the repo is private, make sure you are authenticated. Unauthorized Note: Creating a commit assumes that the repo already exists on the Huggingface Hub. Please use create_repo if it's not the case.
Your might be larger due to audio but the open append mode seems to work to persist back just the saved data.
def store_message(name: str, message: str):
if name and message:
with open(DATA_FILE, “a”) as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[“name”, “message”, “time”])
writer.writerow(
{“name”: name, “message”: message, “time”: str(datetime.now())}
)
commit_url = repo.push_to_hub()
return “”
That approach might also skip reloading the whole thing since you’d only need to append unless you display the whole aggregated dataset, good luck!