I was referred here by @lhoestq from this github issue.
Background
I have a large dataset, ds_all_utts, of user utterances. I load it using load_from_disk because I saved it with save_to_disk:
ds_all_utts = load_from_disk(ds_all_utts_fname)
ds_all_utts has 2,732,013 rows and these features:
{'ANY': Value(dtype='int64', id=None),
'COMPLAINTCLARIFICATION': Value(dtype='int64', id=None),
'COMPLAINTMISHEARD': Value(dtype='int64', id=None),
'COMPLAINTPRIVACY': Value(dtype='int64', id=None),
'COMPLAINTREPETITION': Value(dtype='int64', id=None),
'CRITICISM': Value(dtype='int64', id=None),
'NEGATIVENAVIGATION': Value(dtype='int64', id=None),
'OFFENSIVE': Value(dtype='int64', id=None),
'STOP': Value(dtype='int64', id=None),
'embedding': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None),
'frequency': Value(dtype='int64', id=None),
'user_utterance': Value(dtype='string', id=None)}
user_utterance is a short piece of text (usually just a few words), embedding is a 1280-length vector representing that utterance, frequency is an int, and the rest are binary labels (0 or 1) for the utterance. It’s sorted by descending frequency.
I have another Dataset called neuralgen_ds whose rows represent turns of dialogue along with their context. It has 385,580 rows and these features:
{'session_id': Value(dtype='string', id=None),
'treelet': Value(dtype='string', id=None),
'context': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'bot_utt': Value(dtype='string', id=None),
'bot_utt_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'user_utt': Value(dtype='string', id=None),
'user_utt_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'GPT2ED': Value(dtype='bool', id=None),
'__index_level_0__': Value(dtype='int64', id=None)}
Of these, the important one is user_utt, which is the same type of data as ds_all_utts['user_utterance']. Some user utterances appear multiple times in neuralgen_ds; there are 190,602 unique utterances in neuralgen_ds['user_utt'].
What I want to do
For each row of neuralgen_ds, I want to look up the user utterance in ds_all_utts, and copy over certain columns into neuralgen_ds. In particular, I want to copy over embedding and all the capitalized binary labels (ANY, COMPLAINTCLARIFICATION, etc).
My code
First I create a dictionary mapping from a user utterance to its position in ds_all_utts:
ds_all_utts_lookup = {utt: idx for idx, utt in enumerate(ds_all_utts['user_utterance'])}
Then I use .map to add the columns to neuralgen_ds:
cols = ['embedding', 'ANY', 'COMPLAINTCLARIFICATION', 'COMPLAINTMISHEARD', 'COMPLAINTPRIVACY', 'COMPLAINTREPETITION', 'CRITICISM', 'NEGATIVENAVIGATION', 'OFFENSIVE', 'STOP']
def map_fn(examples):
user_utts = examples['user_utt'] # list of str
idxs = [ds_all_utts_lookup[user_utt] for user_utt in user_utts] # list of int
ds_slice = ds_all_utts[idxs] # dict
result = {col: ds_slice[col] for col in cols}
return result
neuralgen_ds = neuralgen_ds.map(map_fn, batched=True, batch_size=100)
The tqdm estimate says this .map will take over 8 hours. Adjusting batch_size doesn’t seem to help. The slowest part of map_fn is this line:
ds_slice = ds_all_utts[idxs] # dict
Other questions
Are you on a SSD or an HDD ?
I’m not sure, but I followed these instructions and got
>>> lsblk -o name,rota
NAME ROTA
sda 1
├─sda1 1
├─sda2 1
├─sda5 1
├─sda6 1
├─sda7 1
└─sda8 1
sdb 1
└─sdb1 1