Chapter 7 questions

The exercises in this course no longer seem practical as there have been too many changes in the datasets, models, and python module functions between when it was written and today.

I’m now spending more time debugging problems than I am learning the intended topic of each section. I think from here to the end of the course I will just read through the information instead of attempting to solve the bugs in the exercise.

Example, got an error in the code evaluating the rouge score.

AttributeError: 'numpy.float64' object has no attribute 'mid'

This is because rouge metric no longer returns a collection of low, mid, high aggregate scores as expected by the code in the course. It now returns a simple dict of rouge types. ROUGE - a Hugging Face Space by evaluate-metric

Also, the Amazon Reviews dataset is not longer available. I spent quite a bit of time recreating a similar dataset form the Wikipedia dataset. So I could continue the course exercise.

# Run once. Can take a very long time.
# Make sure the saved files presist.

from datasets import load_dataset, Dataset, load_from_disk
import random

# Amazon Review Dataset is defunct, need a replacement
# According to the wikipedia dataset, it has the title and text needed where the title can be
# assumed to be the summary of the text. It won't be as good as the original, but will allow
# the exercises of the course to move forward
# https://huggingface.co/datasets/wikimedia/wikipedia
# But it does not have the train, test, and validation splits, and is HUGE.
# So manually faking the splits, renaming columns, and adding a missing column with random values

spanish_dataset_raw = load_dataset(
    path="wikimedia/wikipedia",
    name="20231101.es",
    trust_remote_code=True
)

english_dataset_raw = load_dataset(
    path="wikimedia/wikipedia",
    name="20231101.en",
    trust_remote_code=True
)

# english_dataset = english_dataset_raw
# spanish_dataset = spanish_dataset_raw

# At the time of this writing the english wikipedia dataset was 6.4 million records
# But the amazon reviews dataset was only 200,000/5,000/5,000 for train/vald/test

# Get a smaller portion of records, split into "test" & "train"
english_dataset = english_dataset_raw["train"].train_test_split(test_size=10_000, train_size=200_000)
# Divide the test split in half for "test" and valudation
english_dataset_test_split = english_dataset["test"].train_test_split(test_size=0.5, train_size=0.5)
# Assemble the the various splits into one dictionary
english_dataset['test'] = english_dataset_test_split['test'];
english_dataset['validation'] = english_dataset_test_split['train'];



# Repeat for the spanish dataset
spanish_dataset = spanish_dataset_raw["train"].train_test_split(test_size=10_000, train_size=200_000)
# Divide the test split in half for "test" and valudation
spanish_dataset_test_split = spanish_dataset["test"].train_test_split(test_size=0.5, train_size=0.5)
# Assemble the the various splits into one dictionary
spanish_dataset['test'] = spanish_dataset_test_split['test'];
spanish_dataset['validation'] = spanish_dataset_test_split['train'];

# add the missing product_category column
product_categories = ["home","apparel","wireless","other","beauty","drugstore","kitchen","toy","sports","automotive","lawn_and_garden","home_improvement","pet_products","digital_ebook_purchase","pc","electronics","office_product","shoes","grocery","book"]

def add_product_category(example):
    example["product_category"] = random.choice(product_categories)
    return example
    
english_dataset = english_dataset.map(add_product_category)
spanish_dataset = spanish_dataset.map(add_product_category)

# Rename columns to match course data
english_dataset = english_dataset.rename_column("text", "review_body")
english_dataset = english_dataset.rename_column("title", "review_title")
spanish_dataset = spanish_dataset.rename_column("text", "review_body")
spanish_dataset = spanish_dataset.rename_column("title", "review_title")

english_dataset['train'][0]

english_dataset.save_to_disk("english_dataset")
spanish_dataset.save_to_disk("spanish_dataset")

# I'm working in Kaggle so I made sure to save a version of the Notebook so that the files presisted between loads

next cell

# If the cell above has been executed and the dataset are available in the notebook presistent disk memory
# Then do not run the cell above, Run this cell to load from disk

english_dataset = load_from_disk("english_dataset")
spanish_dataset = load_from_disk("spanish_dataset")
1 Like