Questions about vocab size, decoder start token, padding token, and appropriate config for custom seq2seq transformer model without any tokenizer

Hello HuggingFace community,

I’m working on a custom seq2seq transformer model for translating between two sets of token IDs. My input and translation token IDs range from 0 to 8191.

input_ids = [2034, 4043, ...., 3]  # length is 2048
translation_input_ids = [3042, 9123, ...., 3285]  # length is 2048

I’m using EncoderDecoderModel with BertConfig for both the encoder and decoder. My sequences are all of length 2048. Here are my questions:

  1. Vocab Size: Given that my tokens range from 0 to 8191, do I need to set vocab_size to 8194 (i.e., 8192 + 2) to account for additional tokens like decoder_start_token_id and pad_token_id?
  2. Handling decoder_start_token_id: How does the model handle the decoder_start_token_id? Is it correct to assign it to a value outside my token range (e.g., 8192)?
  3. Padding Token: If I don’t need padding (since all my sequences are the same length), do I still need to provide a pad_token_id in the configuration?
  4. Config Choice: Is using BertConfig the right approach for this task, or would something like BartConfig or T5Config be more appropriate?
batch_size = 2
vocab_size = 8192 + 2
embedding_size = 1024
context_window_length = 2048

device_num = 0
device = torch.device(f'cuda:{device_num}' if torch.cuda.is_available() else 'cpu')

# Initialize BertConfig for encoder and decoder with vocab_size=8192
config_encoder = BertConfig(
    vocab_size=vocab_size, 
    hidden_size=embedding_size, 
    num_hidden_layers=6, 
    num_attention_heads=8, 
    intermediate_size=embedding_size*4, 
    max_position_embeddings=context_window_length
)
config_decoder = BertConfig(
    vocab_size=vocab_size, 
    hidden_size=embedding_size, 
    num_hidden_layers=6, 
    num_attention_heads=8, 
    intermediate_size=embedding_size*4, 
    max_position_embeddings=context_window_length, 
    is_decoder=True, 
    add_cross_attention=True
)

config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
model = EncoderDecoderModel(config=config)
model.config.decoder_start_token_id = 8192  # arbitrary number that is not between 0 and 8191
model.config.pad_token_id = 8193  # arbitrary number that is not between 0 and 8191
model.to(device)


class CustomTrainer(Trainer):
    def __init__(self, model, args, train_loader, val_loader, **kwargs):
        super().__init__(model=model, args=args, **kwargs)
        self.train_loader = train_loader
        self.val_loader = val_loader

    def get_train_dataloader(self):
        return self.train_loader

    def get_eval_dataloader(self, eval_dataset=None):
        return self.val_loader

# Trainer Arguments
training_args = TrainingArguments(
    output_dir="./results_bert2bert",
    fp16=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size, 
    num_train_epochs=7,
    logging_dir="./logs",
    save_strategy="epoch",
    logging_steps=50,
    learning_rate=5e-5,
    weight_decay=0.01,
    dataloader_num_workers=32,
    report_to="none",  # Ensure wandb is used for logging
    run_name="run-mishuk-bert2bert-02",
)

# Initialize Custom Trainer with custom train and validation DataLoader
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_loader=train_loader,
    val_loader=val_loader,
)

Now when I run the model training, it the loss is stuck to 6.5 just after around 1000 iterations, (my epoch size is 19500 iterations) and stays over there.

Is this the correct way to run a seq2seq training without any tokenizer involved? Or am I doing something wrong here?

I do not need the tokenizer here because these sequence are coming directly from a generation process and those are not text.

I would appreciate any insights or suggestions!

Thanks in advance!

1 Like