Using hyperparameter-search in Trainer

I have a question, if I want to test diffrent learning rate should I write : ā€œlearning_rateā€: tune.loguniform(1e-4, 2e-5, 5e-5,1e-5, 1e-2), or tune.loguniform(1e-4, 1e-2) will used diffrent learning rate

Hello,

I am using this code to find the best parameters for my model.

from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint

scheduler = PopulationBasedTraining(
    mode = "max",
    metric='exact_match', # mean_accuracy
    perturbation_interval=2,
    hyperparam_mutations={
        "weight_decay": lambda: uniform(0.0, 0.3),
        "learning_rate": lambda: uniform(1e-5, 5e-5),
        "per_gpu_train_batch_size": [3, 4, 5],
        "num_train_epochs": [10,11,12],
        "warmup_steps":lambda: randint(0, 500)
    }
)

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="ray",
    n_trials=4,
    keep_checkpoints_num=2,
    scheduler=scheduler
)

However, I am having this miskate. Do you have an advice?

/usr/local/lib/python3.7/dist-packages/pyarrow/io.pxi in pyarrow.lib.Buffer.__reduce_ex__()

AttributeError: module 'pickle' has no attribute 'PickleBuffer'

Some people recommend to use python 3.8 instead of python 3.7, however, this workaround did not help me to resolve the issue. I am working in Google Colab.

Thanks in advance.

3 Likes

I have a strange behaviour when I am using custom HP function.
The results are the same on all trails and epoches.

default example:


def compute_metrics(eval_preds):
  metric = load_metric("f1")
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  #evaluate(labels, predictions)
  return metric.compute(predictions=predictions, references=labels,average='weighted')

args = TrainingArguments(
    MODEL_NAME,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=TR_BATCH_SIZE,
    per_device_eval_batch_size=TEST_BATCH_SIZE,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    push_to_hub=False,
)
train_dataset = tokenized_train["train"].shard(index=1, num_shards=10) 
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=tokenized_test['train'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

the results are :
image

but when I am using custom :

def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 3),
        "seed": trial.suggest_int("seed", 1, 40),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [1, 2, 4,6, 8]),
    }
trainer.hyperparameter_search(direction="maximize", hp_space=my_hp_space)

image

This helped me on Google Colab:
!pip install pickle5
Then
import pickle5 as pickle
After the first run there will be the pickle warning to restart the notebook and the same error. After the second ā€œRestart and run allā€ the ray tune hyperparameter search begins.

Hey @sgugger , do you know if itā€™s possible to use cross validation with optuna for the hyperparameter-search ?
I found this which resemble what Iā€™m looking for. I was wondering if it is implemented inside the Trainer ?
https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.OptunaSearchCV.html

Thanks !

1 Like

Hi!

Iā€™m trying to use trainer.hyperparameter_search() with a wav2vec2 model and the Ray backend, but Iā€™m experiencing some issues. Do you see any issue why this shouldnā€™t work for a wav2vec2 model (I notice that most previous posts concern text models and not speech models)? Below are more details on my issue(s) and a ā€œminimalā€ example for recreating the issue.

I have basically taken this Fine-Tune Wav2Vec2 for English ASR with :hugs: Transformers tutorial and added the hyperparameter tuning step in the end according to the Hyperparameter Search with Transformers and Ray Tune tutorial.

I initially encountered a FileNotFoundError which I donā€™t understand, as I have quadruplechecked that the file is in the correct place and that the relative path is correctly written. I have no issues with loading the data if Iā€™m not using hyperparameter_search:

Traceback FileNotFoundError
2022-07-08 12:51:44,934 ERROR trial_runner.py:883 -- Trial _objective_b72b4_00000: Error processing event.
Traceback (most recent call last):
  File "wav2vec2_finetuning_ASR.py", line 90, in <module>
    main(args)
  File "wav2vec2_finetuning_ASR.py", line 29, in main
    r = model.train(training_args)
  File "/home/jovyan/work/private/robustASR/robustASR/ModelWrapper.py", line 136, in train
    best_trial = self._trainer.hyperparameter_search(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2218, in hyperparameter_search
    best_run = backend_dict[backend](self, n_trials, direction, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/integrations.py", line 294, in run_hp_search_ray
    analysis = ray.tune.run(
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/tune.py", line 718, in run
    runner.step()
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 778, in step
    self._wait_and_handle_event(next_trial)
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 755, in _wait_and_handle_event
    raise e
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 736, in _wait_and_handle_event
    self._on_executor_error(trial, result[ExecutorEvent.KEY_EXCEPTION])
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 884, in _on_executor_error
    raise e
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ImplicitFunc.train() (pid=111865, ip=172.29.0.4, repr=_objective)
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
    result = self.step()
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
    self._report_thread_runner_error(block=True)
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
    raise e
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
    self._entrypoint()
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
    return self._trainable_func(
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
    output = fn()
  File "/opt/conda/lib/python3.8/site-packages/transformers/integrations.py", line 288, in dynamic_modules_import_trainable
    return trainable(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/utils/trainable.py", line 409, in inner
    fn_kwargs[k] = parameter_registry.get(prefix + k)
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/registry.py", line 225, in get
    return ray.get(self.references[k])
ray.exceptions.RaySystemError: System error: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
traceback: Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
  File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 237, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
  File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 182, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
  File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 987, in __setstate__
    table = _memory_mapped_arrow_table_from_file(path)
  File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 49, in _memory_mapped_arrow_table_from_file
    memory_mapped_stream = pa.memory_map(filename)
  File "pyarrow/io.pxi", line 883, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 844, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) 2022-07-08 12:51:44,915 ERROR serialization.py:342 -- [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) Traceback (most recent call last):
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
(_objective pid=111865)     obj = self._deserialize_object(data, metadata, object_ref)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 237, in _deserialize_object
(_objective pid=111865)     return self._deserialize_msgpack_data(data, metadata_fields)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
(_objective pid=111865)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 182, in _deserialize_pickle5_data
(_objective pid=111865)     obj = pickle.loads(in_band)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 987, in __setstate__
(_objective pid=111865)     table = _memory_mapped_arrow_table_from_file(path)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 49, in _memory_mapped_arrow_table_from_file
(_objective pid=111865)     memory_mapped_stream = pa.memory_map(filename)
(_objective pid=111865)   File "pyarrow/io.pxi", line 883, in pyarrow.lib.memory_map
(_objective pid=111865)   File "pyarrow/io.pxi", line 844, in pyarrow.lib.MemoryMappedFile._open
(_objective pid=111865)   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
(_objective pid=111865)   File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
(_objective pid=111865) FileNotFoundError: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) 2022-07-08 12:51:44,916 ERROR function_runner.py:286 -- Runner Thread raised error.
(_objective pid=111865) Traceback (most recent call last):
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
(_objective pid=111865)     self._entrypoint()
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
(_objective pid=111865)     return self._trainable_func(
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(_objective pid=111865)     return method(self, *_args, **_kwargs)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(_objective pid=111865)     output = fn()
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/transformers/integrations.py", line 288, in dynamic_modules_import_trainable
(_objective pid=111865)     return trainable(*args, **kwargs)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/utils/trainable.py", line 409, in inner
(_objective pid=111865)     fn_kwargs[k] = parameter_registry.get(prefix + k)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/registry.py", line 225, in get
(_objective pid=111865)     return ray.get(self.references[k])
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(_objective pid=111865)     return func(*args, **kwargs)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
(_objective pid=111865)     raise value
(_objective pid=111865) ray.exceptions.RaySystemError: System error: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) traceback: Traceback (most recent call last):
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
(_objective pid=111865)     obj = self._deserialize_object(data, metadata, object_ref)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 237, in _deserialize_object
(_objective pid=111865)     return self._deserialize_msgpack_data(data, metadata_fields)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
(_objective pid=111865)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 182, in _deserialize_pickle5_data
(_objective pid=111865)     obj = pickle.loads(in_band)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 987, in __setstate__
(_objective pid=111865)     table = _memory_mapped_arrow_table_from_file(path)
(_objective pid=111865)   File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 49, in _memory_mapped_arrow_table_from_file
(_objective pid=111865)     memory_mapped_stream = pa.memory_map(filename)
(_objective pid=111865)   File "pyarrow/io.pxi", line 883, in pyarrow.lib.memory_map
(_objective pid=111865)   File "pyarrow/io.pxi", line 844, in pyarrow.lib.MemoryMappedFile._open
(_objective pid=111865)   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
(_objective pid=111865)   File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
(_objective pid=111865) FileNotFoundError: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory

When I try to recreate this error on another system I instead get this TypeError saying that Schedulers canā€™t be pickled. The thing is that Iā€™m not using any scheduler, so I donā€™t see where this is coming from?

Traceback TypeError
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [67], in <cell line: 1>()
----> 1 trainer.hyperparameter_search(direction="maximize",backend="ray",n_trials=5,fail_fast="raise",resources_per_trial={'cpu':1})

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:2085, in Trainer.hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
   2077 self.compute_objective = default_compute_objective if compute_objective is None else compute_objective
   2079 backend_dict = {
   2080     HPSearchBackend.OPTUNA: run_hp_search_optuna,
   2081     HPSearchBackend.RAY: run_hp_search_ray,
   2082     HPSearchBackend.SIGOPT: run_hp_search_sigopt,
   2083     HPSearchBackend.WANDB: run_hp_search_wandb,
   2084 }
-> 2085 best_run = backend_dict[backend](self, n_trials, direction, **kwargs)
   2087 self.hp_search_backend = None
   2088 return best_run

File /usr/local/lib/python3.8/dist-packages/transformers/integrations.py:268, in run_hp_search_ray(trainer, n_trials, direction, **kwargs)
    256     if isinstance(
    257         kwargs["scheduler"], (ASHAScheduler, MedianStoppingRule, HyperBandForBOHB, PopulationBasedTraining)
    258     ) and (not trainer.args.do_eval or trainer.args.evaluation_strategy == IntervalStrategy.NO):
    259         raise RuntimeError(
    260             "You are using {cls} as a scheduler but you haven't enabled evaluation during training. "
    261             "This means your trials will not report intermediate results to Ray Tune, and "
   (...)
    265             "Trainer `args`.".format(cls=type(kwargs["scheduler"]).__name__)
    266         )
--> 268 trainable = ray.tune.with_parameters(_objective, local_trainer=trainer)
    270 @functools.wraps(trainable)
    271 def dynamic_modules_import_trainable(*args, **kwargs):
    272     """
    273     Wrapper around `tune.with_parameters` to ensure datasets_modules are loaded on each Actor.
    274 
   (...)
    277     Assumes that `_objective`, defined above, is a function.
    278     """

File /usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py:348, in with_parameters(trainable, **kwargs)
    346 prefix = f"{str(trainable)}_"
    347 for k, v in kwargs.items():
--> 348     parameter_registry.put(prefix + k, v)
    350 trainable_name = getattr(trainable, "__name__", "tune_with_parameters")
    352 if inspect.isclass(trainable):
    353     # Class trainable

File /usr/local/lib/python3.8/dist-packages/ray/tune/registry.py:208, in _ParameterRegistry.put(self, k, v)
    206 self.to_flush[k] = v
    207 if ray.is_initialized():
--> 208     self.flush()

File /usr/local/lib/python3.8/dist-packages/ray/tune/registry.py:220, in _ParameterRegistry.flush(self)
    218         self.references[k] = v
    219     else:
--> 220         self.references[k] = ray.put(v)
    221 self.to_flush.clear()

File /usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/ray/worker.py:1872, in put(value, _owner)
   1870 with profiling.profile("ray.put"):
   1871     try:
-> 1872         object_ref = worker.put_object(value, owner_address=serialize_owner_address)
   1873     except ObjectStoreFullError:
   1874         logger.info(
   1875             "Put failed since the value was either too large or the "
   1876             "store was full of pinned objects."
   1877         )

File /usr/local/lib/python3.8/dist-packages/ray/worker.py:305, in Worker.put_object(self, value, object_ref, owner_address)
    300 if self.mode == LOCAL_MODE:
    301     assert (
    302         object_ref is None
    303     ), "Local Mode does not support inserting with an ObjectRef"
--> 305 serialized_value = self.get_serialization_context().serialize(value)
    306 # This *must* be the first place that we construct this python
    307 # ObjectRef because an entry with 0 local references is created when
    308 # the object is Put() in the core worker, expecting that this python
    309 # reference will be created. If another reference is created and
    310 # removed before this one, it will corrupt the state in the
    311 # reference counter.
    312 return ray.ObjectRef(
    313     self.core_worker.put_serialized_object_and_increment_local_ref(
    314         serialized_value, object_ref=object_ref, owner_address=owner_address
   (...)
    317     skip_adding_local_ref=True,
    318 )

File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:413, in SerializationContext.serialize(self, value)
    411     return RawSerializedObject(value)
    412 else:
--> 413     return self._serialize_to_msgpack(value)

File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:391, in SerializationContext._serialize_to_msgpack(self, value)
    389 if python_objects:
    390     metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
--> 391     pickle5_serialized_object = self._serialize_to_pickle5(
    392         metadata, python_objects
    393     )
    394 else:
    395     pickle5_serialized_object = None

File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:353, in SerializationContext._serialize_to_pickle5(self, metadata, value)
    351 except Exception as e:
    352     self.get_and_clear_contained_object_refs()
--> 353     raise e
    354 finally:
    355     self.set_out_of_band_serialization()

File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:348, in SerializationContext._serialize_to_pickle5(self, metadata, value)
    346 try:
    347     self.set_in_band_serialization()
--> 348     inband = pickle.dumps(
    349         value, protocol=5, buffer_callback=writer.buffer_callback
    350     )
    351 except Exception as e:
    352     self.get_and_clear_contained_object_refs()

File /usr/local/lib/python3.8/dist-packages/ray/cloudpickle/cloudpickle_fast.py:73, in dumps(obj, protocol, buffer_callback)
     69 with io.BytesIO() as file:
     70     cp = CloudPickler(
     71         file, protocol=protocol, buffer_callback=buffer_callback
     72     )
---> 73     cp.dump(obj)
     74     return file.getvalue()

File /usr/local/lib/python3.8/dist-packages/ray/cloudpickle/cloudpickle_fast.py:620, in CloudPickler.dump(self, obj)
    618 def dump(self, obj):
    619     try:
--> 620         return Pickler.dump(self, obj)
    621     except RuntimeError as e:
    622         if "recursion" in e.args[0]:

File /usr/local/lib/python3.8/dist-packages/apscheduler/schedulers/base.py:90, in BaseScheduler.__getstate__(self)
     89 def __getstate__(self):
---> 90     raise TypeError("Schedulers cannot be serialized. Ensure that you are not passing a "
     91                     "scheduler instance as an argument to a job, or scheduling an instance "
     92                     "method where the instance contains a scheduler as an attribute.")

TypeError: Schedulers cannot be serialized. Ensure that you are not passing a scheduler instance as an argument to a job, or scheduling an instance method where the instance contains a scheduler as an attribute.

Example code for recreation
from datasets import load_metric , load_dataset , load_from_disk
import numpy as np

from transformers import (
    Wav2Vec2ForCTC,
    TrainingArguments,
    Trainer,
    Wav2Vec2CTCTokenizer,
    Wav2Vec2FeatureExtractor,
    Wav2Vec2Processor
)

import torch , json

from dataclasses import dataclass
from typing import Dict, List, Optional, Union



@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

df = load_dataset('librispeech_asr','clean')

#df.save_to_disk('data/ls_clean')
#df = load_from_disk("data/ls_clean")

df = df.remove_columns(["id", "chapter_id", "speaker_id"])

def extract_all_chars(batch):
    all_text = " ".join(batch["text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

vocabs = df.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=df.column_names["test"])

vocab_list = list(set(vocabs["train.100"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]) | set(vocabs["train.360"]["vocab"][0]) | set(vocabs["validation"]["vocab"][0]))

vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

with open('vocab_test.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

tokenizer = Wav2Vec2CTCTokenizer("./vocab_test.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

df = df.map(prepare_dataset, remove_columns=df.column_names["test"], num_proc=4)

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
wer_metric = load_metric("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return wer


def model_init():
    return Wav2Vec2ForCTC.from_pretrained(
        "facebook/wav2vec2-base", 
        ctc_loss_reduction="mean", 
        pad_token_id=processor.tokenizer.pad_token_id,
    )
    #model.freeze_feature_encoder()
    #return model


training_args = TrainingArguments(
  output_dir='/raytest',
  group_by_length=True,
  per_device_train_batch_size=8,
  per_device_eval_batch_size=8,
  evaluation_strategy="steps",
  num_train_epochs=20,
  fp16=True,
  save_steps=500,
  eval_steps=500,
  logging_steps=500,
  learning_rate=1e-4,
  weight_decay=0.005,
  warmup_steps=1000,
  save_total_limit=2,
)


trainer = Trainer(
    args=training_args,
    tokenizer=processor.feature_extractor,
    train_dataset=df["train.100"],
    eval_dataset=df["test"],
    data_collator=data_collator,
    model_init=model_init,
    compute_metrics=compute_metrics,
)

trainer.hyperparameter_search(
    direction="minimize",
    backend="ray",
    n_trials=5,
    fail_fast="raise",
    resources_per_trial={'cpu':1}
    )

Thankful for any response.

Hereā€™s a recent blog post by @matteopilotto about using W&B Sweeps with HF transformers. http://wandb.me/hf-sweeps

You can use hyperparameter_search(backend='wandb'...) or you can use the W&B logger and use Sweeps to control the search and you get this plot to understand your metrics:

To use W&B Sweeps, you define a config with your search params and then create the sweep with wandb.sweep(config, project='your-project-name').

def train(config=None):
  with wandb.init(config=config):
    # set sweep configuration
    config = wandb.config
    training_args = TrainingArguments(
        output_dir='vit-sweeps',
	    report_to='wandb',  # Turn on Weights & Biases logging
        num_train_epochs=config.epochs,
        learning_rate=config.learning_rate,
        weight_decay=config.weight_decay,
        per_device_train_batch_size=config.batch_size,
        ...
    )

3 Likes

The strange results are actually results of the inability of the network to learn anything because of the learning rate, which is very high in your cases as you can see.

Transformers need a much lower finetuning learning rate (e.g 5e-5)

Hi @scottire

Im new using wandb sweeps with transformers. I tried to use the example in the blog post and adjust it to multi-class text classification. However, I always run into two errors:

- With data collator function:
Run b4wuzajc errored: TypeError('expected Tensor as element 0 in argument 0, but got str')

- Without data collator function:
ValueError("Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (text_cleanin this case) have excessive nesting (inputs typelistwhere typeint is expected).")

Here is the notebook. Happy to share some example data.
Would be really grateful if someone can tell me whats going on! Thanks guys.

1 Like

Hi everyone,

I fixed it. There was only one line of code missing:

tokenized_datasets = tokenized_datasets.remove_columns(["text_clean", "target", "label_name"])

Just had to remove the unwanted columns from the tokenized dataset.

2 Likes

Hello, I am getting this error when I want to use hyperparameter search :

"File ā€œ/uw/test/expe_5/expe_5/traitements1/entrainement_test.pyā€, line 553, in
trainer, outdir = prepare_fine_tuning(PRE_TRAINED_MODEL_NAME, train_dataset, val_dataset, tokenizer, sigle, train_name, datatype)
File ā€œ/uw/test/expe_5/expe_5/traitements1/entrainement_test.pyā€, line 402, in prepare_fine_tuning
trainer = Trainer(
File ā€œ/uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/trainer.pyā€, line 366, in init
model = model.to(args.device)
AttributeError: ā€˜functionā€™ object has no attribute ā€˜toā€™

def model_init():

	set_seed=42
	num_labels=3
	
	return CamembertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels=num_labels)
	
	
def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 5),
        "seed": trial.suggest_int("seed", 1, 40),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32, 64]),
    }


trainer = Trainer(
				model= model_init,                         # the instantiated šŸ¤— Transformers model to be trained
				args=training_args,                  # training arguments, defined above
				train_dataset=train_dataset,         # training dataset
				eval_dataset=val_dataset,            # evaluation dataset
				tokenizer=tokenizer,
				callbacks=[EarlyStoppingCallback(3, 0.0)], # early stopping if results dont improve after 3 epochs	
				compute_metrics=compute_metrics       # #callbacks=[EarlyStoppingCallback(3, 0.0)] # early stopping if results dont improve after 3 epochs        
			)
			
			best_run = trainer.hyperparameter_search(direction="maximize",
			n_trials=5,
			keep_checkpoints_num=1,
			hp_space=my_hp_space
			)

Use model_init as a keyword rather than model in Trainer.

Donā€™t do this: Trainer(model=model_init)
Do this: Trainer(model_init=model_init)

Is it at all possible to pass in the model itself to the hp_space? For example if you want to test out different models such as ā€œbase-base-uncasedā€ or ā€œroberta-baseā€ or ā€œmpnet-base-v2ā€ etc.

I currently have it set up like this:

epochs = trial.suggest_categorical("epochs", EPOCHS)
    batch_size = trial.suggest_categorical("batch_size", BATCH_SIZE)
    learning_rate = trial.suggest_categorical("learning_rate", LEARNING_RATES)
    scheduler = trial.suggest_categorical("scheduler", SCHEDULERS)
    model_name = trial.suggest_categorical("model_name", MODEL_NAMES)
    
    hp_space = {
        "model_name": model_name,
        "batch_size": batch_size,
        "learning_rate": learning_rate,
        "scheduler": scheduler,
        "epochs": epochs,
    }

I am not sure how to pass this in correctly to Trainer

EDIT: Seems like indeed (mentioned in one of your comments above) just passing the backend to wandb instead of ray worked;i.e

   trainer.hyperparameter_search(
                        direction="maximize",
                        backend="wandb",
                        n_trials=4,
                        keep_checkpoints_num=1,
                        scheduler=get_scheduler())

I tested the provided example and managed to run it successfully.

However I do not understand completely how you would integrate W&B Sweeps with a PopulationBasedTraining() for example.

Could you shed some light on that please?

Thank you.

*** I did read this article : Weights & Biases, but in this case I just see how the configuration for the different hyperparameter strategies is defined, not how to integrate it with wandb.config() and wandb.agent() like in the example you posted.

To be more precise:

from ray import tune
from ray.tune.schedulers import PopulationBasedTraining



def get_scheduler():
    #Creating the PBT scheduler
    scheduler = PopulationBasedTraining(
        mode = "max",
        metric='eval_f1',
        perturbation_interval=2,
        hyperparam_mutations={
            "weight_decay": tune.uniform(0.0, 0.3),
            "learning_rate": tune.uniform(1e-5, 5e-5),
            "per_device_train_batch_size": tune.choice([8, 16, 24, 32, 48]),
            "num_train_epochs": tune.choice([5]),
            "warmup_steps": tune.choice(range(0, 500))
        }
    )
    return scheduler


...



    trainer.hyperparameter_search(
                        direction="maximize",
                        backend="ray",
                        n_trials=4,
                        keep_checkpoints_num=1,
                        scheduler=get_scheduler())

How should I integrate W&B for Sweeps analysis?

Hey @Calin , right now it doesnā€™t look like you can use 2 backends with the Trainerā€™s hyperparameter_search

Maybe you can try set report_to = "wandb" in the TrainerArguments?

1 Like

That worked indeed ā€” The only issue is that I cannot manage to name my wandb sweep id mentioned like in the tutorial, meaning that all the trainings fall into one ā€œuncategorizedā€ directory in my W&B project base.

Have you tried setting the project name with the WANDB_PROJECT environment variable?

WANDB_PROJECT=amazon_sentiment_analysis

Hi, I am able to manually setup Cross Validation as well as Hyper-Parameter Tuning using Optuna separately but I am have difficulty figuring out how to perform HPO with CV. Any help is appreciated. Thanks

do you have any answer? I had the same question with you?

Did you manage to figure this out? Iā€™m also having trouble making it work