Transformed dataset to_json saves cache dataset

Patrick4561 · January 2, 2023, 7:16pm

First of all, first post, I’m not sure if this is the good category.

I am trying to translate the Squad V2 dataset into the Dutch language using Google Translate. It’s working, but I’m trying to save the resulting dataset to my computer in order to upload it to HF. It seems like it saves the Squad dataset that is still in cache, resulting in the English version being saved. See my code below. When I print dataset[0], it does show the dutch translation. cleanup_cache_files() does not solve the issue for me. I’m on Mac Silicon if that makes any difference. Does anybody know why I’m running into this issue?

dataset = load_dataset("squad_v2", split="train") 

dataset = dataset.filter(lambda example, idx: idx < 5000, with_indices=True)

print("Length of training set: ", len(dataset))

def translateIt(text, project_id="..."):
    """Translating Text."""
    translate_client = translate.Client()

    result = translate_client.translate(text, target_language="nl")

    return result["translatedText"]

def transforms(row):
    row["title"] = [translateIt(title) for title in row["title"]]
    row["context"] = [translateIt(context) for context in row["context"]]
    row["question"] = [translateIt(question) for question in row["question"]]
    row["answers"][0]["text"] = [translateIt(answer) for answer in row["answers"][0]["text"]]
    return row

dataset.set_transform(transforms)

print(dataset[0])

dataset.to_csv('dataset.csv')

NimaBoscarino · January 2, 2023, 8:36pm

Welcome! My understanding of the set_transform method is that it’s used to set a transformation that gets applied when __getitem__ is called, i.e. when you do things like dataset[0]. The underlying dataset isn’t actually transformed. Instead, you can use something like .map (Process) to map the dataset. Then you can save that new dataset, or even call .push_to_hub (Share a dataset to the Hub) to upload the mapped dataset directly to the Hub.

Hope this helps!

Patrick4561 · January 3, 2023, 5:43pm

That explains a lot. How can I transform the underlying dataset? I did discover the .map indeed. but it is very slow for me (2.88s/row). I have a dataset of 130k rows. Is there any other way to achieve the same thing without using .map?

NimaBoscarino · January 3, 2023, 6:59pm

The .map method is probably the ideal method for you, but that’s quite slow! I don’t know much about the Google Translate API, so these options may or may not work:

You create a new translate.Client() every time you call translateIt – does that add extra overhead? Would it improve performance if you just instantiate the client once at the global level?
Can the translation client accept batches? You could maybe translate the title, context, questions and answers all at once?
Following up from the point above, you might even be able to leverage batch mapping (Batch mapping) to map several rows at once.

If these things aren’t possible for your use case, or if it’s still extremely slow after this, let me know! Feel free to also ping me on the Discord server.

EDIT: Also wanted to add, it’s difficult to know why the performance is so slow without debugging it a bit. My intuition is that the bottleneck isn’t from .map, but from the individual translations. But if you profile it and find that .map is the bottleneck, then maybe we can find a different option!

Topic		Replies	Views
Using set_transform on a dataset leads to an exception 🤗Datasets	1	566	September 6, 2023
How to prepare local dataset for load_dataset() and mimic its behavior when loading HF's existing online dataset Beginners	5	1462	January 25, 2022
Convert dataset to pytorch dataloader 🤗Datasets	3	7080	April 7, 2023
Streaming dataset and cache 🤗Datasets	5	3552	August 4, 2023
Convert a dataset for a different model. How to know the format? 🤗Datasets	2	322	June 4, 2024

Transformed dataset to_json saves cache dataset

Related topics