Hello HuggingFace community,
I’m working on a custom seq2seq transformer model for translating between two sets of token IDs. My input and translation token IDs range from 0 to 8191.
input_ids = [2034, 4043, ...., 3] # length is 2048
translation_input_ids = [3042, 9123, ...., 3285] # length is 2048
I’m using EncoderDecoderModel
with BertConfig
for both the encoder and decoder. My sequences are all of length 2048. Here are my questions:
- Vocab Size: Given that my tokens range from 0 to 8191, do I need to set
vocab_size
to 8194 (i.e., 8192 + 2) to account for additional tokens likedecoder_start_token_id
andpad_token_id
? - Handling
decoder_start_token_id
: How does the model handle thedecoder_start_token_id
? Is it correct to assign it to a value outside my token range (e.g., 8192)? - Padding Token: If I don’t need padding (since all my sequences are the same length), do I still need to provide a
pad_token_id
in the configuration? - Config Choice: Is using
BertConfig
the right approach for this task, or would something likeBartConfig
orT5Config
be more appropriate?
batch_size = 2
vocab_size = 8192 + 2
embedding_size = 1024
context_window_length = 2048
device_num = 0
device = torch.device(f'cuda:{device_num}' if torch.cuda.is_available() else 'cpu')
# Initialize BertConfig for encoder and decoder with vocab_size=8192
config_encoder = BertConfig(
vocab_size=vocab_size,
hidden_size=embedding_size,
num_hidden_layers=6,
num_attention_heads=8,
intermediate_size=embedding_size*4,
max_position_embeddings=context_window_length
)
config_decoder = BertConfig(
vocab_size=vocab_size,
hidden_size=embedding_size,
num_hidden_layers=6,
num_attention_heads=8,
intermediate_size=embedding_size*4,
max_position_embeddings=context_window_length,
is_decoder=True,
add_cross_attention=True
)
config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
model = EncoderDecoderModel(config=config)
model.config.decoder_start_token_id = 8192 # arbitrary number that is not between 0 and 8191
model.config.pad_token_id = 8193 # arbitrary number that is not between 0 and 8191
model.to(device)
class CustomTrainer(Trainer):
def __init__(self, model, args, train_loader, val_loader, **kwargs):
super().__init__(model=model, args=args, **kwargs)
self.train_loader = train_loader
self.val_loader = val_loader
def get_train_dataloader(self):
return self.train_loader
def get_eval_dataloader(self, eval_dataset=None):
return self.val_loader
# Trainer Arguments
training_args = TrainingArguments(
output_dir="./results_bert2bert",
fp16=True,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=7,
logging_dir="./logs",
save_strategy="epoch",
logging_steps=50,
learning_rate=5e-5,
weight_decay=0.01,
dataloader_num_workers=32,
report_to="none", # Ensure wandb is used for logging
run_name="run-mishuk-bert2bert-02",
)
# Initialize Custom Trainer with custom train and validation DataLoader
trainer = CustomTrainer(
model=model,
args=training_args,
train_loader=train_loader,
val_loader=val_loader,
)
Now when I run the model training, it the loss is stuck to 6.5 just after around 1000 iterations, (my epoch size is 19500 iterations) and stays over there.
Is this the correct way to run a seq2seq training without any tokenizer involved? Or am I doing something wrong here?
I do not need the tokenizer here because these sequence are coming directly from a generation process and those are not text.
I would appreciate any insights or suggestions!
Thanks in advance!