Running BigBird on TPUs

Neel-Gupta · June 24, 2021, 6:57pm

This is in continuation of my previous issue filed in this forum (here).

I wanted to try and run the run_mlm_flax script with a Colab TPU v2, but I am facing an error:-

Module Name: <module 'run_mlm_flax' from '/content/run_mlm_flax.py'>

WARNING:root:TPU has started up successfully with version pytorch-1.9
Traceback (most recent call last):
  File "xla_spawn.py", line 87, in <module>
    main()
  File "xla_spawn.py", line 83, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
AttributeError: module 'run_mlm_flax' has no attribute '_mp_fn'

To my understanding, the script for spawning processes in each TPU core required a Huggingface Trainer object to work with. However, we apparently do not have a Trainer script for MLM/CLM for Bigbird. Compounding the difficulties is the whole Flax-noTrainer-TPU combination which I do not think BigBird has been tested out fully.

Is there any possible approach to run this on TPUs?

vasudevgupta · June 24, 2021, 7:20pm

Hey, I think you are mixing flax & pytorch scripts. how can you get pytorch error in flax script otherwise?

FlaxBigBird works perfectly on cloud TPU. You can refer this script also.

Also Currently Trainer is for only PyTorch & Tensorflow.

Neel-Gupta · June 24, 2021, 7:29pm

sigh, it’s the third time that I have had to change my scripts again. Why is that with HF half the time you have to change your entire code? The Trainer API was a gamechanger, until it doesn’t work with some models
Why do we need to actually use the Flax version of BigBird to run it on TPUs?

I am not asking you give me a 2-line code solution, but the least HuggingFace could strive to do is to not be sooo fragmented as it is now. At this point, Transformers has become so bloated with different tasks, changing APIs, outdated notebooks that its a nightmare (I honestly can’t even comphrend how Open source devs even contribute here).

Honestly, it would rather be easier if it be married to a single framework like pytorch - allowing simple DataLoaders to interface directly with Huggingface Models and be trained easily in a pytorch-style manner with XLA spawn. I think Huggingface tries to take control of too many parts of the expereince - Datasets for making datasets, and it doesn’t always mesh well.

Or perhaps I am just too dumb, and probably unfair to expect pytorch like simplicity from such a huge lib.

sgugger · June 24, 2021, 7:37pm

As I have said in the other topic, you don’t. Why are you using an example located in examples/flax/ if you don’t want to use FLAX?

There is really no need for the rant after, it’s completely inappropriate when people are replying to you for free on their own time.

Neel-Gupta · June 24, 2021, 8:17pm

It was Patrick Von Platen’s suggestion and PR merge for using the Flax version with BigBird. I asked the same thing, but I didn’t get a reply from him as to what was the reason.

Honestly, if you really think that asking basic questions about the direction of HF is a rant then that clearly shows the attitude of the company as a whole.

Topic		Replies	Views
BUG Confirmation: BigBirdLM not able to use Flax 🤗Transformers	13	1274	June 23, 2021
🤗Transformer with Trainer API on TPU VMs and TPU Pods Beginners	0	414	December 18, 2023
FLAX - Training on Cloud TPU VM Pods (not single TPU devices) Beginners	1	1413	August 2, 2022
Flax - core dump when starting training 🤗Transformers	4	2201	August 4, 2021
Tutorials for using Colab TPUs with Huggingface Transformers? 🤗Transformers	16	20767	June 3, 2024

Running BigBird on TPUs

Related topics