Transformed dataset to_json saves cache dataset

First of all, first post, I’m not sure if this is the good category.

I am trying to translate the Squad V2 dataset into the Dutch language using Google Translate. It’s working, but I’m trying to save the resulting dataset to my computer in order to upload it to HF. It seems like it saves the Squad dataset that is still in cache, resulting in the English version being saved. See my code below. When I print dataset[0], it does show the dutch translation. cleanup_cache_files() does not solve the issue for me. I’m on Mac Silicon if that makes any difference. Does anybody know why I’m running into this issue?

dataset = load_dataset("squad_v2", split="train") 

dataset = dataset.filter(lambda example, idx: idx < 5000, with_indices=True)

print("Length of training set: ", len(dataset))

def translateIt(text, project_id="..."):
    """Translating Text."""
    translate_client = translate.Client()

    result = translate_client.translate(text, target_language="nl")

    return result["translatedText"]

def transforms(row):
    row["title"] = [translateIt(title) for title in row["title"]]
    row["context"] = [translateIt(context) for context in row["context"]]
    row["question"] = [translateIt(question) for question in row["question"]]
    row["answers"][0]["text"] = [translateIt(answer) for answer in row["answers"][0]["text"]]
    return row

dataset.set_transform(transforms)

print(dataset[0])

dataset.to_csv('dataset.csv')

Welcome! My understanding of the set_transform method is that it’s used to set a transformation that gets applied when __getitem__ is called, i.e. when you do things like dataset[0]. The underlying dataset isn’t actually transformed. Instead, you can use something like .map (Process) to map the dataset. Then you can save that new dataset, or even call .push_to_hub (Share a dataset to the Hub) to upload the mapped dataset directly to the Hub.

Hope this helps!

That explains a lot. How can I transform the underlying dataset? I did discover the .map indeed. but it is very slow for me (2.88s/row). I have a dataset of 130k rows. Is there any other way to achieve the same thing without using .map?

The .map method is probably the ideal method for you, but that’s quite slow! I don’t know much about the Google Translate API, so these options may or may not work:

  • You create a new translate.Client() every time you call translateIt – does that add extra overhead? Would it improve performance if you just instantiate the client once at the global level?
  • Can the translation client accept batches? You could maybe translate the title, context, questions and answers all at once?
  • Following up from the point above, you might even be able to leverage batch mapping (Batch mapping) to map several rows at once.

If these things aren’t possible for your use case, or if it’s still extremely slow after this, let me know! Feel free to also ping me on the Discord server.

EDIT: Also wanted to add, it’s difficult to know why the performance is so slow without debugging it a bit. My intuition is that the bottleneck isn’t from .map, but from the individual translations. But if you profile it and find that .map is the bottleneck, then maybe we can find a different option!