How to process dataset for BI-Encoder type models

qazisaad · October 1, 2023, 1:55pm

I am trying to train a bi-encoder model to rank query and its relevant content. As bi-encoder models passes the query and then content through the same model one by one. Unlike traditional modeling pipeline it requires to process two streams of text hence there are twice columns for the dataset. (input_ids, attention_masks, type_ids). Huggingface trainer and datasets library is written to work on one input_id column. things such as dynamic padding and group by length only looks at input_ids columns and does not take into consideration scenario where there can be more then one input_ids column. Is there an existing way to support this. If not I am happy to contribute for this feature.

These are the columns I am processing for my dataset (labels, click_input_ids, click_token_type_ids, click_attention_mask, cand_input_ids, cand_token_type_ids, cand_attention_mask).

Topic		Replies	Views
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	501	February 2, 2024
Dataset curation extra parameters Beginners	2	31	January 19, 2025
Train through multiple datasets Beginners	1	1634	June 13, 2022
Pad Tokens & Attention Masks with Data Collators 🤗Transformers	0	57	August 29, 2024
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021

How to process dataset for BI-Encoder type models

Related topics