I’m running the language modeling script provided here. I’m training a Roberta-base model and I have an RTX 3090 with 24 Gb, although when training it runs well until 9k steps, then an OOM error is through. The memory usage on training begins at 12Gb, runs a few steps, and keeps growing until OOM error. It seems to be that previous batches aren’t freed from the memory but I am not sure yet.
I implemented my dataset class and passed it to the Trainer, although I am loading all raw data into the RAM, I only tokenized them at the __getitem__
method, so I don’t think this is the actual issue.
Does anyone have some thoughts on this?
My dataset class:
class LMDataset(Dataset):
def __init__(
self,
base_path: str,
tokenizer: AutoTokenizer,
set: str = "train",
):
self.tokenizer = tokenizer
src_file = Path(base_path).joinpath("processed", "{}.csv".format(set))
df = pd.read_csv(src_file, header=0, names=["text"])
self.samples = df["text"].to_list()
def __len__(self):
return len(self.samples)
def _tokenize(
self,
text: str,
padding: Optional[Union[str, bool]] = False,
max_seq_length: Optional[int] = None,
):
return self.tokenizer(
text,
padding=padding,
truncation=True,
max_length=max_seq_length or self.tokenizer.model_max_length,
return_special_tokens_mask=True,
)
def __getitem__(
self,
i,
padding: Optional[Union[str, bool]] = False,
max_seq_length: Optional[int] = None,
):
input_ids = self._tokenize(self.samples[i], padding, max_seq_length)[
"input_ids"
]
return torch.tensor(input_ids, dtype=torch.long)