I was referred here by @lhoestq from this github issue.
Background
I have a large dataset, ds_all_utts
, of user utterances. I load it using load_from_disk
because I saved it with save_to_disk
:
ds_all_utts = load_from_disk(ds_all_utts_fname)
ds_all_utts
has 2,732,013 rows and these features:
{'ANY': Value(dtype='int64', id=None),
'COMPLAINTCLARIFICATION': Value(dtype='int64', id=None),
'COMPLAINTMISHEARD': Value(dtype='int64', id=None),
'COMPLAINTPRIVACY': Value(dtype='int64', id=None),
'COMPLAINTREPETITION': Value(dtype='int64', id=None),
'CRITICISM': Value(dtype='int64', id=None),
'NEGATIVENAVIGATION': Value(dtype='int64', id=None),
'OFFENSIVE': Value(dtype='int64', id=None),
'STOP': Value(dtype='int64', id=None),
'embedding': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None),
'frequency': Value(dtype='int64', id=None),
'user_utterance': Value(dtype='string', id=None)}
user_utterance
is a short piece of text (usually just a few words), embedding
is a 1280-length vector representing that utterance, frequency
is an int, and the rest are binary labels (0 or 1) for the utterance. It’s sorted by descending frequency
.
I have another Dataset called neuralgen_ds
whose rows represent turns of dialogue along with their context. It has 385,580 rows and these features:
{'session_id': Value(dtype='string', id=None),
'treelet': Value(dtype='string', id=None),
'context': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'bot_utt': Value(dtype='string', id=None),
'bot_utt_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'user_utt': Value(dtype='string', id=None),
'user_utt_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'GPT2ED': Value(dtype='bool', id=None),
'__index_level_0__': Value(dtype='int64', id=None)}
Of these, the important one is user_utt
, which is the same type of data as ds_all_utts['user_utterance']
. Some user utterances appear multiple times in neuralgen_ds
; there are 190,602 unique utterances in neuralgen_ds['user_utt']
.
What I want to do
For each row of neuralgen_ds
, I want to look up the user utterance in ds_all_utts
, and copy over certain columns into neuralgen_ds
. In particular, I want to copy over embedding
and all the capitalized binary labels (ANY
, COMPLAINTCLARIFICATION
, etc).
My code
First I create a dictionary mapping from a user utterance to its position in ds_all_utts
:
ds_all_utts_lookup = {utt: idx for idx, utt in enumerate(ds_all_utts['user_utterance'])}
Then I use .map
to add the columns to neuralgen_ds
:
cols = ['embedding', 'ANY', 'COMPLAINTCLARIFICATION', 'COMPLAINTMISHEARD', 'COMPLAINTPRIVACY', 'COMPLAINTREPETITION', 'CRITICISM', 'NEGATIVENAVIGATION', 'OFFENSIVE', 'STOP']
def map_fn(examples):
user_utts = examples['user_utt'] # list of str
idxs = [ds_all_utts_lookup[user_utt] for user_utt in user_utts] # list of int
ds_slice = ds_all_utts[idxs] # dict
result = {col: ds_slice[col] for col in cols}
return result
neuralgen_ds = neuralgen_ds.map(map_fn, batched=True, batch_size=100)
The tqdm estimate says this .map
will take over 8 hours. Adjusting batch_size
doesn’t seem to help. The slowest part of map_fn
is this line:
ds_slice = ds_all_utts[idxs] # dict
Other questions
Are you on a SSD or an HDD ?
I’m not sure, but I followed these instructions and got
>>> lsblk -o name,rota
NAME ROTA
sda 1
├─sda1 1
├─sda2 1
├─sda5 1
├─sda6 1
├─sda7 1
└─sda8 1
sdb 1
└─sdb1 1