Chapter 3 questions

Hi, I used python3.11 with vscode, and tried to run the following code locally, but it was unsuccessful.
The same code can be run on colab. I don’t know how to solve this problem, any help would be appreciated. Thanks!

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
Downloading and preparing dataset None/ax to file://C:/Users/qq575/.cache/huggingface/datasets/parquet/ax-1f514e37fba474dd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 752.16it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 150.19it/s] 
Traceback (most recent call last):
  File "e:\python_projects\learning ML\3.1.2.py", line 3, in <module>
    raw_datasets = load_dataset("glue", "mrpc")
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\load.py", line 1797, in load_dataset
    split_info = self.info.splits[split_generator.name]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\splits.py", line 530, in __getitem__
    instructions = make_file_instructions(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\arrow_reader.py", line 112, in make_file_instructions
    name2filenames = {
                     ^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\arrow_reader.py", line 113, in <dictcomp>
    info.name: filenames_for_dataset_split(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\naming.py", line 70, in filenames_for_dataset_split
    prefix = filename_prefix_for_split(dataset_name, split)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\naming.py", line 54, in filename_prefix_for_split
    if os.path.basename(name) != name:
       ^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen ntpath>", line 244, in basename
  File "<frozen ntpath>", line 213, in split
TypeError: expected str, bytes or os.PathLike object, not NoneType

As another person said above, upgrading datasets>=2.17.0 solved the problem

1 Like

your solution is really helpful,thanks a lot

Hi, I’m new to the community. Just a simple question. In Chapter 3 β€œA full training” section, there is only Pytorch version. Does this section also have a tensorflow version as other sections? Thanks!

I’ve got the same error and could not find a solution

I am trying to train the model with pre-processed data (β€œglue”, β€œsst2”). I have transformed the dataset for two sentence and label. Now, I am trying to generate Dataset and I am caught in a error.

To generate custom dataset

from datasets import Dataset,ClassLabel,Value

features = ({
  "sentence1": Value("string"),  # String type for sentence1
  "sentence2": Value("string"),  # String type for sentence2
  "label": ClassLabel(names=["not_equivalent", "equivalent"]),  # ClassLabel definition
  "idx": Value("int32"),
})
custom_dataset = Dataset.from_dict(train_pairs)
custom_dataset = custom_dataset.cast(features)
custom_dataset

Consider when I generate train_pairs
train_pairs. Sample


{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
'label': <ClassLabel.not_equivalent: 0>, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found

So I changed it to integer

{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
'label': 0, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found

I am trying to achieve as below [ this is original code which is preprocessing glue+mrpc

any clue how can i fix this.

Please tell me how to segment each face of the assembly line workpiece. For example, if you see three faces in a picture, you can segment the three faces.

Doing this fixed the error for me:

!pip install -U transformers
!pip install -U accelerate

Hello, I recently fine tuned my model using the training loop on the GLUE SST2 datasets. Here’s my code:

from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
import torch
from tqdm.auto import tqdm

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
device = torch.device(β€œcuda”) if torch.cuda.is_available() else torch.device(β€œcpu”)
model.to(device)

optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)

num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
β€œlinear”,
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):

  for batch in train_dl:
      outputs = model(**batch)
      loss = outputs.loss
      accelerator.backward(loss)
      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar.update(1)

I wondered if I can use the test datasets to run inference on it. Can you show me how?

I am doing the NLP course in Hugging Faces.
In chapter 3 - Fine-tuning a pretrained model,
subchapter 3 - Fine-tuning a model with the Trainer API,

Under the β€œEvaluation” section, I’m trying to run the following code as specified in the course:

import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

But when I run it, I’m getting the following ValueError

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 4
      1 import evaluate
      3 metric = evaluate.load("glue", "mrpc")
----> 4 metric.compute(predictions=preds, references=predictions.label_ids)
Summary
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 4
      1 import evaluate
      3 metric = evaluate.load("glue", "mrpc")
----> 4 metric.compute(predictions=preds, references=predictions.label_ids)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/evaluate/module.py:465, in EvaluationModule.compute(self, predictions, references, **kwargs)
    462 if self.process_id == 0:
    463     self.data.set_format(type=self.info.format)
--> 465     inputs = {input_name: self.data[input_name] for input_name in self._feature_names()}
    466     with temp_seed(self.seed):
    467         output = self._compute(**inputs, **compute_kwargs)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2866, in Dataset.__getitem__(self, key)
   2864 def __getitem__(self, key):  # noqa: F811
   2865     """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2866     return self._getitem(key)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2851, in Dataset._getitem(self, key, **kwargs)
   2849 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
   2850 pa_subtable = query_table(self._data, key, indices=self._indices)
-> 2851 formatted_output = format_table(
   2852     pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
   2853 )
   2854 return formatted_output

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:633, in format_table(table, key, formatter, format_columns, output_all_columns)
    631 python_formatter = PythonFormatter(features=formatter.features)
    632 if format_columns is None:
--> 633     return formatter(pa_table, query_type=query_type)
    634 elif query_type == "column":
    635     if key in format_columns:

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:399, in Formatter.__call__(self, pa_table, query_type)
    397     return self.format_row(pa_table)
    398 elif query_type == "column":
--> 399     return self.format_column(pa_table)
    400 elif query_type == "batch":
    401     return self.format_batch(pa_table)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/np_formatter.py:94, in NumpyFormatter.format_column(self, pa_table)
     93 def format_column(self, pa_table: pa.Table) -> np.ndarray:
---> 94     column = self.numpy_arrow_extractor().extract_column(pa_table)
     95     column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
     96     column = self.recursive_tensorize(column)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:162, in NumpyArrowExtractor.extract_column(self, pa_table)
    161 def extract_column(self, pa_table: pa.Table) -> np.ndarray:
--> 162     return self._arrow_array_to_numpy(pa_table[pa_table.column_names[0]])

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:197, in NumpyArrowExtractor._arrow_array_to_numpy(self, pa_array)
    191     if any(
    192         (isinstance(x, np.ndarray) and (x.dtype == object or x.shape != array[0].shape))
    193         or (isinstance(x, float) and np.isnan(x))
    194         for x in array
    195     ):
    196         return np.array(array, copy=False, dtype=object)
--> 197 return np.array(array, copy=False)

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

I am assuming this is because of an issue within the dataset library function because of changes in numpy after the library was originally written. I can’t change the underlying code to that library. It was probably written by hugging faces people. So how can I get past this?

To add on, in the next subchapter ( " A full training"),
in the β€œPrepare for training” section,

When I run the below code as specified in the course,

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

I am getting a similar error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 for batch in train_dataloader:
      2     break
      3 {k: v.shape for k, v in batch.items()}
Summary
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 for batch in train_dataloader:
      2     break
      3 {k: v.shape for k, v in batch.items()}

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:675, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
    674     index = self._next_index()  # may raise StopIteration
--> 675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:
    677         data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py:49, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
     47 if self.auto_collation:
     48     if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
---> 49         data = self.dataset.__getitems__(possibly_batched_index)
     50     else:
     51         data = [self.dataset[idx] for idx in possibly_batched_index]

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2870, in Dataset.__getitems__(self, keys)
   2868 def __getitems__(self, keys: List) -> List:
   2869     """Can be used to get a batch using a list of integers indices."""
-> 2870     batch = self.__getitem__(keys)
   2871     n_examples = len(batch[next(iter(batch))])
   2872     return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2866, in Dataset.__getitem__(self, key)
   2864 def __getitem__(self, key):  # noqa: F811
   2865     """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2866     return self._getitem(key)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2851, in Dataset._getitem(self, key, **kwargs)
   2849 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
   2850 pa_subtable = query_table(self._data, key, indices=self._indices)
-> 2851 formatted_output = format_table(
   2852     pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
   2853 )
   2854 return formatted_output

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:633, in format_table(table, key, formatter, format_columns, output_all_columns)
    631 python_formatter = PythonFormatter(features=formatter.features)
    632 if format_columns is None:
--> 633     return formatter(pa_table, query_type=query_type)
    634 elif query_type == "column":
    635     if key in format_columns:

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:401, in Formatter.__call__(self, pa_table, query_type)
    399     return self.format_column(pa_table)
    400 elif query_type == "batch":
--> 401     return self.format_batch(pa_table)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/torch_formatter.py:110, in TorchFormatter.format_batch(self, pa_table)
    109 def format_batch(self, pa_table: pa.Table) -> Mapping:
--> 110     batch = self.numpy_arrow_extractor().extract_batch(pa_table)
    111     batch = self.python_features_decoder.decode_batch(batch)
    112     batch = self.recursive_tensorize(batch)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:165, in NumpyArrowExtractor.extract_batch(self, pa_table)
    164 def extract_batch(self, pa_table: pa.Table) -> dict:
--> 165     return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:197, in NumpyArrowExtractor._arrow_array_to_numpy(self, pa_array)
    191     if any(
    192         (isinstance(x, np.ndarray) and (x.dtype == object or x.shape != array[0].shape))
    193         or (isinstance(x, float) and np.isnan(x))
    194         for x in array
    195     ):
    196         return np.array(array, copy=False, dtype=object)
--> 197 return np.array(array, copy=False)

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

Do you know any solutions to this issue? I can’t seem to proceed with building my own fine-tuned models if I’m not able to run this or evaluate the model’s accuracy.

Hello! Please add a Tensorflow option for β€œA full training”. Thanks!

In chapter 3 there is a reference to chapter 10 but it is not existing in the course. Is it expected to have further chapters for the nlp course. HF made a fantastic job for beginners to understand NLP. Nowadays there are some significant development like LLM apps, RAG, etc.

Where is chapter 10? It’s referenced but I didn’t find it.

@sgugger Hello Sir. for those of us who only took the course with tensorflow, does chapter 3 stop at the β€˜Fine-tuning a model with Keras’ section? How should we manage training loops and accelerators with tf as is done in the β€˜a full training’ section of pytorch? thank you

The following code:
from datasets import load_dataset
raw_datasets = load_dataset(β€œglue”, β€œmrpc”)
raw_datasets
gives me the error:
No (supported) data files or dataset script found in glue.
----> 3 raw_datasets = load_dataset(β€œglue”, β€œmrpc”)
Any ideas how to solve this?

Hi, For the finetuning using keras APIs, why did you train the whole model?
I was expecting to freeze the trained model & only train the new classification head.

I believe so as Tensorflow natively depends on Keras to build the training loop using the model class. Also, parallelism is natively supported without using extra libraries

Screenshot 2024-11-26 231210
While running the code, the above error shows up.
When i tried to find the model in the hugging face models section, a 404 error showed up.
Suggest some other models which can be used instead of bert-base-uncased.

I’ve been exploring fine-tuning strategies for pre-trained models like BERT and BART, and I’ve noticed some interesting points that I’d love to get your insights on.

In several research papers, freezing layers during fine-tuning is highlighted as an important technique for preventing catastrophic forgetting, reducing computational cost, and mitigating overfitting, especially with smaller datasets. However, I haven’t seen this practice as widely adopted in recent practical examples and tutorials.

Additionally, I noticed that Hugging Face’s fine-tuning course, which I found to be very informative, doesn’t explicitly mention freezing layers. Considering the course was published around in 2021, when freezing layers was a common practice for models like BERT-base_uncased, I’m curious about the rationale behind this.

I understand that deep learning best practices evolve rapidly, and I’m eager to learn more about the current recommendations for fine-tuning. Could you shed some light on the relevance of freezing layers in contemporary workflows? Is it still considered a valuable technique, or have advancements in model architectures and training strategies made it less critical?

Any insights you can share on this topic would be greatly appreciated, as it would help me and other learners gain a more comprehensive understanding of fine-tuning strategies and ensure we’re adopting the most effective approaches.

Thank you