Twitter datasets - de-hydrated?

johann-petrak · August 1, 2022, 11:07am

Hi all - I have noticed that among the datasets shared on the hub are quite a few which are based on tweets. And quite a few of those seem to contain the tweet texts already.
However, as far as I understand, Twitter does not allow to share the actual text of tweets, so is there already an established way for how to do on-the-fly hydration of tweets as part of the data loader code?
How are people dealing with this in general?

tomateit · August 4, 2022, 12:39pm

One can prepare a hydrated dataset themself, either by using Twitter API or by third-party scraping libraries. Also check out archive.org twitter stream dumps, those are already containing texts.

johann-petrak · August 4, 2022, 1:14pm

Thanks - I know how to hydrate datasets using the Twitter API, but I was wondering about how to best integrate this with a dataloader especially for datasets hosted on the HF dataset hub.
I believe (I am not a lawyer) that it would violate Twitter TOS to host a dataset on the hub that contains actual Tweet texts instead of just the ids, so basically ALL Twitter based datasets on the hub would require hydrating. Making the hydrating part of the loading process may simplify the whole process.

So I was wondering if there is already experience of how to do that, best practices, example code etc?

ianroberts · August 15, 2022, 11:25am

In particular, I can see the value in a (de facto) standard approach with some sort of local cache of downloaded tweets that is not tied to a single dataset, i.e. if a user wants to use several datasets that are different annotation layers over the same set of tweets then they only need to download the text of each tweet once.

Topic		Replies	Views
Processing time and methods Beginners	2	352	March 21, 2022
[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library 🤗Datasets	292	13866	October 30, 2022
Share your projects! Course	19	3841	February 18, 2025
Dataset info in big tweet data Beginners	4	14	March 19, 2025
Parsing dataset Beginners	0	138	January 18, 2024

Twitter datasets - de-hydrated?

Related topics