Twitter datasets - de-hydrated?

Hi all - I have noticed that among the datasets shared on the hub are quite a few which are based on tweets. And quite a few of those seem to contain the tweet texts already.
However, as far as I understand, Twitter does not allow to share the actual text of tweets, so is there already an established way for how to do on-the-fly hydration of tweets as part of the data loader code?
How are people dealing with this in general?

One can prepare a hydrated dataset themself, either by using Twitter API or by third-party scraping libraries. Also check out archive.org twitter stream dumps, those are already containing texts.

Thanks - I know how to hydrate datasets using the Twitter API, but I was wondering about how to best integrate this with a dataloader especially for datasets hosted on the HF dataset hub.
I believe (I am not a lawyer) that it would violate Twitter TOS to host a dataset on the hub that contains actual Tweet texts instead of just the ids, so basically ALL Twitter based datasets on the hub would require hydrating. Making the hydrating part of the loading process may simplify the whole process.

So I was wondering if there is already experience of how to do that, best practices, example code etc?

In particular, I can see the value in a (de facto) standard approach with some sort of local cache of downloaded tweets that is not tied to a single dataset, i.e. if a user wants to use several datasets that are different annotation layers over the same set of tweets then they only need to download the text of each tweet once.

1 Like