Chapter 3 questions

Hi, I used python3.11 with vscode, and tried to run the following code locally, but it was unsuccessful.
The same code can be run on colab. I don’t know how to solve this problem, any help would be appreciated. Thanks!

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
Downloading and preparing dataset None/ax to file://C:/Users/qq575/.cache/huggingface/datasets/parquet/ax-1f514e37fba474dd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 752.16it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 150.19it/s] 
Traceback (most recent call last):
  File "e:\python_projects\learning ML\3.1.2.py", line 3, in <module>
    raw_datasets = load_dataset("glue", "mrpc")
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\load.py", line 1797, in load_dataset
    split_info = self.info.splits[split_generator.name]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\splits.py", line 530, in __getitem__
    instructions = make_file_instructions(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\arrow_reader.py", line 112, in make_file_instructions
    name2filenames = {
                     ^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\arrow_reader.py", line 113, in <dictcomp>
    info.name: filenames_for_dataset_split(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\naming.py", line 70, in filenames_for_dataset_split
    prefix = filename_prefix_for_split(dataset_name, split)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\python3_11\Lib\site-packages\datasets\naming.py", line 54, in filename_prefix_for_split
    if os.path.basename(name) != name:
       ^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen ntpath>", line 244, in basename
  File "<frozen ntpath>", line 213, in split
TypeError: expected str, bytes or os.PathLike object, not NoneType

As another person said above, upgrading datasets>=2.17.0 solved the problem

1 Like

your solution is really helpful,thanks a lot

Hi, I’m new to the community. Just a simple question. In Chapter 3 β€œA full training” section, there is only Pytorch version. Does this section also have a tensorflow version as other sections? Thanks!

I’ve got the same error and could not find a solution

I am trying to train the model with pre-processed data (β€œglue”, β€œsst2”). I have transformed the dataset for two sentence and label. Now, I am trying to generate Dataset and I am caught in a error.

To generate custom dataset

from datasets import Dataset,ClassLabel,Value

features = ({
  "sentence1": Value("string"),  # String type for sentence1
  "sentence2": Value("string"),  # String type for sentence2
  "label": ClassLabel(names=["not_equivalent", "equivalent"]),  # ClassLabel definition
  "idx": Value("int32"),
})
custom_dataset = Dataset.from_dict(train_pairs)
custom_dataset = custom_dataset.cast(features)
custom_dataset

Consider when I generate train_pairs
train_pairs. Sample


{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
'label': <ClassLabel.not_equivalent: 0>, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found

So I changed it to integer

{'sentence1': "that 's far too tragic to merit such superficial treatment ", 
'sentence2': "that 's far too tragic to merit such superficial treatment ", 
'label': 0, 
'idx': 5}

/usr/local/lib/python3.10/dist-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

**TypeError**: expected bytes, int found

I am trying to achieve as below [ this is original code which is preprocessing glue+mrpc

any clue how can i fix this.

Please tell me how to segment each face of the assembly line workpiece. For example, if you see three faces in a picture, you can segment the three faces.

Doing this fixed the error for me:

!pip install -U transformers
!pip install -U accelerate

Hello, I recently fine tuned my model using the training loop on the GLUE SST2 datasets. Here’s my code:

from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
import torch
from tqdm.auto import tqdm

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
device = torch.device(β€œcuda”) if torch.cuda.is_available() else torch.device(β€œcpu”)
model.to(device)

optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
)

num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
β€œlinear”,
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):

  for batch in train_dl:
      outputs = model(**batch)
      loss = outputs.loss
      accelerator.backward(loss)
      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar.update(1)

I wondered if I can use the test datasets to run inference on it. Can you show me how?

I am doing the NLP course in Hugging Faces.
In chapter 3 - Fine-tuning a pretrained model,
subchapter 3 - Fine-tuning a model with the Trainer API,

Under the β€œEvaluation” section, I’m trying to run the following code as specified in the course:

import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

But when I run it, I’m getting the following ValueError

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 4
      1 import evaluate
      3 metric = evaluate.load("glue", "mrpc")
----> 4 metric.compute(predictions=preds, references=predictions.label_ids)
Summary
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 4
      1 import evaluate
      3 metric = evaluate.load("glue", "mrpc")
----> 4 metric.compute(predictions=preds, references=predictions.label_ids)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/evaluate/module.py:465, in EvaluationModule.compute(self, predictions, references, **kwargs)
    462 if self.process_id == 0:
    463     self.data.set_format(type=self.info.format)
--> 465     inputs = {input_name: self.data[input_name] for input_name in self._feature_names()}
    466     with temp_seed(self.seed):
    467         output = self._compute(**inputs, **compute_kwargs)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2866, in Dataset.__getitem__(self, key)
   2864 def __getitem__(self, key):  # noqa: F811
   2865     """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2866     return self._getitem(key)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2851, in Dataset._getitem(self, key, **kwargs)
   2849 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
   2850 pa_subtable = query_table(self._data, key, indices=self._indices)
-> 2851 formatted_output = format_table(
   2852     pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
   2853 )
   2854 return formatted_output

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:633, in format_table(table, key, formatter, format_columns, output_all_columns)
    631 python_formatter = PythonFormatter(features=formatter.features)
    632 if format_columns is None:
--> 633     return formatter(pa_table, query_type=query_type)
    634 elif query_type == "column":
    635     if key in format_columns:

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:399, in Formatter.__call__(self, pa_table, query_type)
    397     return self.format_row(pa_table)
    398 elif query_type == "column":
--> 399     return self.format_column(pa_table)
    400 elif query_type == "batch":
    401     return self.format_batch(pa_table)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/np_formatter.py:94, in NumpyFormatter.format_column(self, pa_table)
     93 def format_column(self, pa_table: pa.Table) -> np.ndarray:
---> 94     column = self.numpy_arrow_extractor().extract_column(pa_table)
     95     column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
     96     column = self.recursive_tensorize(column)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:162, in NumpyArrowExtractor.extract_column(self, pa_table)
    161 def extract_column(self, pa_table: pa.Table) -> np.ndarray:
--> 162     return self._arrow_array_to_numpy(pa_table[pa_table.column_names[0]])

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:197, in NumpyArrowExtractor._arrow_array_to_numpy(self, pa_array)
    191     if any(
    192         (isinstance(x, np.ndarray) and (x.dtype == object or x.shape != array[0].shape))
    193         or (isinstance(x, float) and np.isnan(x))
    194         for x in array
    195     ):
    196         return np.array(array, copy=False, dtype=object)
--> 197 return np.array(array, copy=False)

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

I am assuming this is because of an issue within the dataset library function because of changes in numpy after the library was originally written. I can’t change the underlying code to that library. It was probably written by hugging faces people. So how can I get past this?

To add on, in the next subchapter ( " A full training"),
in the β€œPrepare for training” section,

When I run the below code as specified in the course,

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

I am getting a similar error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 for batch in train_dataloader:
      2     break
      3 {k: v.shape for k, v in batch.items()}
Summary
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 for batch in train_dataloader:
      2     break
      3 {k: v.shape for k, v in batch.items()}

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/torch/utils/data/dataloader.py:675, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
    674     index = self._next_index()  # may raise StopIteration
--> 675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:
    677         data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py:49, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
     47 if self.auto_collation:
     48     if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
---> 49         data = self.dataset.__getitems__(possibly_batched_index)
     50     else:
     51         data = [self.dataset[idx] for idx in possibly_batched_index]

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2870, in Dataset.__getitems__(self, keys)
   2868 def __getitems__(self, keys: List) -> List:
   2869     """Can be used to get a batch using a list of integers indices."""
-> 2870     batch = self.__getitem__(keys)
   2871     n_examples = len(batch[next(iter(batch))])
   2872     return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2866, in Dataset.__getitem__(self, key)
   2864 def __getitem__(self, key):  # noqa: F811
   2865     """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2866     return self._getitem(key)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/arrow_dataset.py:2851, in Dataset._getitem(self, key, **kwargs)
   2849 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
   2850 pa_subtable = query_table(self._data, key, indices=self._indices)
-> 2851 formatted_output = format_table(
   2852     pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
   2853 )
   2854 return formatted_output

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:633, in format_table(table, key, formatter, format_columns, output_all_columns)
    631 python_formatter = PythonFormatter(features=formatter.features)
    632 if format_columns is None:
--> 633     return formatter(pa_table, query_type=query_type)
    634 elif query_type == "column":
    635     if key in format_columns:

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:401, in Formatter.__call__(self, pa_table, query_type)
    399     return self.format_column(pa_table)
    400 elif query_type == "batch":
--> 401     return self.format_batch(pa_table)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/torch_formatter.py:110, in TorchFormatter.format_batch(self, pa_table)
    109 def format_batch(self, pa_table: pa.Table) -> Mapping:
--> 110     batch = self.numpy_arrow_extractor().extract_batch(pa_table)
    111     batch = self.python_features_decoder.decode_batch(batch)
    112     batch = self.recursive_tensorize(batch)

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:165, in NumpyArrowExtractor.extract_batch(self, pa_table)
    164 def extract_batch(self, pa_table: pa.Table) -> dict:
--> 165     return {col: self._arrow_array_to_numpy(pa_table[col]) for col in pa_table.column_names}

File ~/Desktop/Hugging Faces/transformers-course/.env/lib/python3.12/site-packages/datasets/formatting/formatting.py:197, in NumpyArrowExtractor._arrow_array_to_numpy(self, pa_array)
    191     if any(
    192         (isinstance(x, np.ndarray) and (x.dtype == object or x.shape != array[0].shape))
    193         or (isinstance(x, float) and np.isnan(x))
    194         for x in array
    195     ):
    196         return np.array(array, copy=False, dtype=object)
--> 197 return np.array(array, copy=False)

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

Do you know any solutions to this issue? I can’t seem to proceed with building my own fine-tuned models if I’m not able to run this or evaluate the model’s accuracy.

Hello! Please add a Tensorflow option for β€œA full training”. Thanks!

In chapter 3 there is a reference to chapter 10 but it is not existing in the course. Is it expected to have further chapters for the nlp course. HF made a fantastic job for beginners to understand NLP. Nowadays there are some significant development like LLM apps, RAG, etc.

Where is chapter 10? It’s referenced but I didn’t find it.

@sgugger Hello Sir. for those of us who only took the course with tensorflow, does chapter 3 stop at the β€˜Fine-tuning a model with Keras’ section? How should we manage training loops and accelerators with tf as is done in the β€˜a full training’ section of pytorch? thank you

The following code:
from datasets import load_dataset
raw_datasets = load_dataset(β€œglue”, β€œmrpc”)
raw_datasets
gives me the error:
No (supported) data files or dataset script found in glue.
----> 3 raw_datasets = load_dataset(β€œglue”, β€œmrpc”)
Any ideas how to solve this?