I have a question, if I want to test diffrent learning rate should I write : ālearning_rateā: tune.loguniform(1e-4, 2e-5, 5e-5,1e-5, 1e-2), or tune.loguniform(1e-4, 1e-2) will used diffrent learning rate
Hello,
I am using this code to find the best parameters for my model.
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import uniform
from random import randint
scheduler = PopulationBasedTraining(
mode = "max",
metric='exact_match', # mean_accuracy
perturbation_interval=2,
hyperparam_mutations={
"weight_decay": lambda: uniform(0.0, 0.3),
"learning_rate": lambda: uniform(1e-5, 5e-5),
"per_gpu_train_batch_size": [3, 4, 5],
"num_train_epochs": [10,11,12],
"warmup_steps":lambda: randint(0, 500)
}
)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=4,
keep_checkpoints_num=2,
scheduler=scheduler
)
However, I am having this miskate. Do you have an advice?
/usr/local/lib/python3.7/dist-packages/pyarrow/io.pxi in pyarrow.lib.Buffer.__reduce_ex__()
AttributeError: module 'pickle' has no attribute 'PickleBuffer'
Some people recommend to use python 3.8 instead of python 3.7, however, this workaround did not help me to resolve the issue. I am working in Google Colab.
Thanks in advance.
I have a strange behaviour when I am using custom HP function.
The results are the same on all trails and epoches.
default example:
def compute_metrics(eval_preds):
metric = load_metric("f1")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
#evaluate(labels, predictions)
return metric.compute(predictions=predictions, references=labels,average='weighted')
args = TrainingArguments(
MODEL_NAME,
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=TR_BATCH_SIZE,
per_device_eval_batch_size=TEST_BATCH_SIZE,
num_train_epochs=5,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model='f1',
push_to_hub=False,
)
train_dataset = tokenized_train["train"].shard(index=1, num_shards=10)
trainer = Trainer(
model_init=model_init,
args=args,
train_dataset=train_dataset,
eval_dataset=tokenized_test['train'],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
the results are :
but when I am using custom :
def my_hp_space(trial):
return {
"learning_rate": trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True),
"num_train_epochs": trial.suggest_int("num_train_epochs", 1, 3),
"seed": trial.suggest_int("seed", 1, 40),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [1, 2, 4,6, 8]),
}
trainer.hyperparameter_search(direction="maximize", hp_space=my_hp_space)
This helped me on Google Colab:
!pip install pickle5
Then
import pickle5 as pickle
After the first run there will be the pickle warning to restart the notebook and the same error. After the second āRestart and run allā the ray tune hyperparameter search begins.
Hey @sgugger , do you know if itās possible to use cross validation with optuna for the hyperparameter-search ?
I found this which resemble what Iām looking for. I was wondering if it is implemented inside the Trainer ?
https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.OptunaSearchCV.html
Thanks !
Hi!
Iām trying to use trainer.hyperparameter_search()
with a wav2vec2 model and the Ray backend, but Iām experiencing some issues. Do you see any issue why this shouldnāt work for a wav2vec2 model (I notice that most previous posts concern text models and not speech models)? Below are more details on my issue(s) and a āminimalā example for recreating the issue.
I have basically taken this Fine-Tune Wav2Vec2 for English ASR with Transformers tutorial and added the hyperparameter tuning step in the end according to the Hyperparameter Search with Transformers and Ray Tune tutorial.
I initially encountered a FileNotFoundError which I donāt understand, as I have quadruplechecked that the file is in the correct place and that the relative path is correctly written. I have no issues with loading the data if Iām not using hyperparameter_search
:
Traceback FileNotFoundError
2022-07-08 12:51:44,934 ERROR trial_runner.py:883 -- Trial _objective_b72b4_00000: Error processing event.
Traceback (most recent call last):
File "wav2vec2_finetuning_ASR.py", line 90, in <module>
main(args)
File "wav2vec2_finetuning_ASR.py", line 29, in main
r = model.train(training_args)
File "/home/jovyan/work/private/robustASR/robustASR/ModelWrapper.py", line 136, in train
best_trial = self._trainer.hyperparameter_search(
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2218, in hyperparameter_search
best_run = backend_dict[backend](self, n_trials, direction, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/integrations.py", line 294, in run_hp_search_ray
analysis = ray.tune.run(
File "/opt/conda/lib/python3.8/site-packages/ray/tune/tune.py", line 718, in run
runner.step()
File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 778, in step
self._wait_and_handle_event(next_trial)
File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 755, in _wait_and_handle_event
raise e
File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 736, in _wait_and_handle_event
self._on_executor_error(trial, result[ExecutorEvent.KEY_EXCEPTION])
File "/opt/conda/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 884, in _on_executor_error
raise e
File "/opt/conda/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
future_result = ray.get(ready_future)
File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ImplicitFunc.train() (pid=111865, ip=172.29.0.4, repr=_objective)
File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
result = self.step()
File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
self._report_thread_runner_error(block=True)
File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
raise e
File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
self._entrypoint()
File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
return self._trainable_func(
File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
output = fn()
File "/opt/conda/lib/python3.8/site-packages/transformers/integrations.py", line 288, in dynamic_modules_import_trainable
return trainable(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/ray/tune/utils/trainable.py", line 409, in inner
fn_kwargs[k] = parameter_registry.get(prefix + k)
File "/opt/conda/lib/python3.8/site-packages/ray/tune/registry.py", line 225, in get
return ray.get(self.references[k])
ray.exceptions.RaySystemError: System error: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
traceback: Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 237, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 182, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 987, in __setstate__
table = _memory_mapped_arrow_table_from_file(path)
File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 49, in _memory_mapped_arrow_table_from_file
memory_mapped_stream = pa.memory_map(filename)
File "pyarrow/io.pxi", line 883, in pyarrow.lib.memory_map
File "pyarrow/io.pxi", line 844, in pyarrow.lib.MemoryMappedFile._open
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) 2022-07-08 12:51:44,915 ERROR serialization.py:342 -- [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) Traceback (most recent call last):
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
(_objective pid=111865) obj = self._deserialize_object(data, metadata, object_ref)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 237, in _deserialize_object
(_objective pid=111865) return self._deserialize_msgpack_data(data, metadata_fields)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
(_objective pid=111865) python_objects = self._deserialize_pickle5_data(pickle5_data)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 182, in _deserialize_pickle5_data
(_objective pid=111865) obj = pickle.loads(in_band)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 987, in __setstate__
(_objective pid=111865) table = _memory_mapped_arrow_table_from_file(path)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 49, in _memory_mapped_arrow_table_from_file
(_objective pid=111865) memory_mapped_stream = pa.memory_map(filename)
(_objective pid=111865) File "pyarrow/io.pxi", line 883, in pyarrow.lib.memory_map
(_objective pid=111865) File "pyarrow/io.pxi", line 844, in pyarrow.lib.MemoryMappedFile._open
(_objective pid=111865) File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
(_objective pid=111865) File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
(_objective pid=111865) FileNotFoundError: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) 2022-07-08 12:51:44,916 ERROR function_runner.py:286 -- Runner Thread raised error.
(_objective pid=111865) Traceback (most recent call last):
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
(_objective pid=111865) self._entrypoint()
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
(_objective pid=111865) return self._trainable_func(
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(_objective pid=111865) return method(self, *_args, **_kwargs)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(_objective pid=111865) output = fn()
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/transformers/integrations.py", line 288, in dynamic_modules_import_trainable
(_objective pid=111865) return trainable(*args, **kwargs)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/tune/utils/trainable.py", line 409, in inner
(_objective pid=111865) fn_kwargs[k] = parameter_registry.get(prefix + k)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/tune/registry.py", line 225, in get
(_objective pid=111865) return ray.get(self.references[k])
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(_objective pid=111865) return func(*args, **kwargs)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
(_objective pid=111865) raise value
(_objective pid=111865) ray.exceptions.RaySystemError: System error: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
(_objective pid=111865) traceback: Traceback (most recent call last):
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
(_objective pid=111865) obj = self._deserialize_object(data, metadata, object_ref)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 237, in _deserialize_object
(_objective pid=111865) return self._deserialize_msgpack_data(data, metadata_fields)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_msgpack_data
(_objective pid=111865) python_objects = self._deserialize_pickle5_data(pickle5_data)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/ray/serialization.py", line 182, in _deserialize_pickle5_data
(_objective pid=111865) obj = pickle.loads(in_band)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 987, in __setstate__
(_objective pid=111865) table = _memory_mapped_arrow_table_from_file(path)
(_objective pid=111865) File "/opt/conda/lib/python3.8/site-packages/datasets/table.py", line 49, in _memory_mapped_arrow_table_from_file
(_objective pid=111865) memory_mapped_stream = pa.memory_map(filename)
(_objective pid=111865) File "pyarrow/io.pxi", line 883, in pyarrow.lib.memory_map
(_objective pid=111865) File "pyarrow/io.pxi", line 844, in pyarrow.lib.MemoryMappedFile._open
(_objective pid=111865) File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
(_objective pid=111865) File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
(_objective pid=111865) FileNotFoundError: [Errno 2] Failed to open local file 'data/ls_clean/train.100/cache-cf0d2969e1a61b07.arrow'. Detail: [errno 2] No such file or directory
When I try to recreate this error on another system I instead get this TypeError saying that Schedulers canāt be pickled. The thing is that Iām not using any scheduler, so I donāt see where this is coming from?
Traceback TypeError
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [67], in <cell line: 1>()
----> 1 trainer.hyperparameter_search(direction="maximize",backend="ray",n_trials=5,fail_fast="raise",resources_per_trial={'cpu':1})
File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:2085, in Trainer.hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
2077 self.compute_objective = default_compute_objective if compute_objective is None else compute_objective
2079 backend_dict = {
2080 HPSearchBackend.OPTUNA: run_hp_search_optuna,
2081 HPSearchBackend.RAY: run_hp_search_ray,
2082 HPSearchBackend.SIGOPT: run_hp_search_sigopt,
2083 HPSearchBackend.WANDB: run_hp_search_wandb,
2084 }
-> 2085 best_run = backend_dict[backend](self, n_trials, direction, **kwargs)
2087 self.hp_search_backend = None
2088 return best_run
File /usr/local/lib/python3.8/dist-packages/transformers/integrations.py:268, in run_hp_search_ray(trainer, n_trials, direction, **kwargs)
256 if isinstance(
257 kwargs["scheduler"], (ASHAScheduler, MedianStoppingRule, HyperBandForBOHB, PopulationBasedTraining)
258 ) and (not trainer.args.do_eval or trainer.args.evaluation_strategy == IntervalStrategy.NO):
259 raise RuntimeError(
260 "You are using {cls} as a scheduler but you haven't enabled evaluation during training. "
261 "This means your trials will not report intermediate results to Ray Tune, and "
(...)
265 "Trainer `args`.".format(cls=type(kwargs["scheduler"]).__name__)
266 )
--> 268 trainable = ray.tune.with_parameters(_objective, local_trainer=trainer)
270 @functools.wraps(trainable)
271 def dynamic_modules_import_trainable(*args, **kwargs):
272 """
273 Wrapper around `tune.with_parameters` to ensure datasets_modules are loaded on each Actor.
274
(...)
277 Assumes that `_objective`, defined above, is a function.
278 """
File /usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py:348, in with_parameters(trainable, **kwargs)
346 prefix = f"{str(trainable)}_"
347 for k, v in kwargs.items():
--> 348 parameter_registry.put(prefix + k, v)
350 trainable_name = getattr(trainable, "__name__", "tune_with_parameters")
352 if inspect.isclass(trainable):
353 # Class trainable
File /usr/local/lib/python3.8/dist-packages/ray/tune/registry.py:208, in _ParameterRegistry.put(self, k, v)
206 self.to_flush[k] = v
207 if ray.is_initialized():
--> 208 self.flush()
File /usr/local/lib/python3.8/dist-packages/ray/tune/registry.py:220, in _ParameterRegistry.flush(self)
218 self.references[k] = v
219 else:
--> 220 self.references[k] = ray.put(v)
221 self.to_flush.clear()
File /usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
File /usr/local/lib/python3.8/dist-packages/ray/worker.py:1872, in put(value, _owner)
1870 with profiling.profile("ray.put"):
1871 try:
-> 1872 object_ref = worker.put_object(value, owner_address=serialize_owner_address)
1873 except ObjectStoreFullError:
1874 logger.info(
1875 "Put failed since the value was either too large or the "
1876 "store was full of pinned objects."
1877 )
File /usr/local/lib/python3.8/dist-packages/ray/worker.py:305, in Worker.put_object(self, value, object_ref, owner_address)
300 if self.mode == LOCAL_MODE:
301 assert (
302 object_ref is None
303 ), "Local Mode does not support inserting with an ObjectRef"
--> 305 serialized_value = self.get_serialization_context().serialize(value)
306 # This *must* be the first place that we construct this python
307 # ObjectRef because an entry with 0 local references is created when
308 # the object is Put() in the core worker, expecting that this python
309 # reference will be created. If another reference is created and
310 # removed before this one, it will corrupt the state in the
311 # reference counter.
312 return ray.ObjectRef(
313 self.core_worker.put_serialized_object_and_increment_local_ref(
314 serialized_value, object_ref=object_ref, owner_address=owner_address
(...)
317 skip_adding_local_ref=True,
318 )
File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:413, in SerializationContext.serialize(self, value)
411 return RawSerializedObject(value)
412 else:
--> 413 return self._serialize_to_msgpack(value)
File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:391, in SerializationContext._serialize_to_msgpack(self, value)
389 if python_objects:
390 metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
--> 391 pickle5_serialized_object = self._serialize_to_pickle5(
392 metadata, python_objects
393 )
394 else:
395 pickle5_serialized_object = None
File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:353, in SerializationContext._serialize_to_pickle5(self, metadata, value)
351 except Exception as e:
352 self.get_and_clear_contained_object_refs()
--> 353 raise e
354 finally:
355 self.set_out_of_band_serialization()
File /usr/local/lib/python3.8/dist-packages/ray/serialization.py:348, in SerializationContext._serialize_to_pickle5(self, metadata, value)
346 try:
347 self.set_in_band_serialization()
--> 348 inband = pickle.dumps(
349 value, protocol=5, buffer_callback=writer.buffer_callback
350 )
351 except Exception as e:
352 self.get_and_clear_contained_object_refs()
File /usr/local/lib/python3.8/dist-packages/ray/cloudpickle/cloudpickle_fast.py:73, in dumps(obj, protocol, buffer_callback)
69 with io.BytesIO() as file:
70 cp = CloudPickler(
71 file, protocol=protocol, buffer_callback=buffer_callback
72 )
---> 73 cp.dump(obj)
74 return file.getvalue()
File /usr/local/lib/python3.8/dist-packages/ray/cloudpickle/cloudpickle_fast.py:620, in CloudPickler.dump(self, obj)
618 def dump(self, obj):
619 try:
--> 620 return Pickler.dump(self, obj)
621 except RuntimeError as e:
622 if "recursion" in e.args[0]:
File /usr/local/lib/python3.8/dist-packages/apscheduler/schedulers/base.py:90, in BaseScheduler.__getstate__(self)
89 def __getstate__(self):
---> 90 raise TypeError("Schedulers cannot be serialized. Ensure that you are not passing a "
91 "scheduler instance as an argument to a job, or scheduling an instance "
92 "method where the instance contains a scheduler as an attribute.")
TypeError: Schedulers cannot be serialized. Ensure that you are not passing a scheduler instance as an argument to a job, or scheduling an instance method where the instance contains a scheduler as an attribute.
Example code for recreation
from datasets import load_metric , load_dataset , load_from_disk
import numpy as np
from transformers import (
Wav2Vec2ForCTC,
TrainingArguments,
Trainer,
Wav2Vec2CTCTokenizer,
Wav2Vec2FeatureExtractor,
Wav2Vec2Processor
)
import torch , json
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
@dataclass
class DataCollatorCTCWithPadding:
"""
Data collator that will dynamically pad the inputs received.
Args:
processor (:class:`~transformers.Wav2Vec2Processor`)
The processor used for proccessing the data.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
among:
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
maximum acceptable input length for the model if that argument is not provided.
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
different lengths).
max_length (:obj:`int`, `optional`):
Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
max_length_labels (:obj:`int`, `optional`):
Maximum length of the ``labels`` returned list and optionally padding length (see above).
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
7.5 (Volta).
"""
processor: Wav2Vec2Processor
padding: Union[bool, str] = True
max_length: Optional[int] = None
max_length_labels: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
pad_to_multiple_of_labels: Optional[int] = None
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lenghts and need
# different padding methods
input_features = [{"input_values": feature["input_values"]} for feature in features]
label_features = [{"input_ids": feature["labels"]} for feature in features]
batch = self.processor.pad(
input_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
with self.processor.as_target_processor():
labels_batch = self.processor.pad(
label_features,
padding=self.padding,
max_length=self.max_length_labels,
pad_to_multiple_of=self.pad_to_multiple_of_labels,
return_tensors="pt",
)
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
batch["labels"] = labels
return batch
df = load_dataset('librispeech_asr','clean')
#df.save_to_disk('data/ls_clean')
#df = load_from_disk("data/ls_clean")
df = df.remove_columns(["id", "chapter_id", "speaker_id"])
def extract_all_chars(batch):
all_text = " ".join(batch["text"])
vocab = list(set(all_text))
return {"vocab": [vocab], "all_text": [all_text]}
vocabs = df.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=df.column_names["test"])
vocab_list = list(set(vocabs["train.100"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]) | set(vocabs["train.360"]["vocab"][0]) | set(vocabs["validation"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
with open('vocab_test.json', 'w') as vocab_file:
json.dump(vocab_dict, vocab_file)
tokenizer = Wav2Vec2CTCTokenizer("./vocab_test.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
def prepare_dataset(batch):
audio = batch["audio"]
# batched output is "un-batched" to ensure mapping is correct
batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
with processor.as_target_processor():
batch["labels"] = processor(batch["text"]).input_ids
return batch
df = df.map(prepare_dataset, remove_columns=df.column_names["test"], num_proc=4)
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
wer_metric = load_metric("wer")
def compute_metrics(pred):
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)
pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
pred_str = processor.batch_decode(pred_ids)
# we do not want to group tokens when computing the metrics
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
wer = wer_metric.compute(predictions=pred_str, references=label_str)
return wer
def model_init():
return Wav2Vec2ForCTC.from_pretrained(
"facebook/wav2vec2-base",
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
)
#model.freeze_feature_encoder()
#return model
training_args = TrainingArguments(
output_dir='/raytest',
group_by_length=True,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="steps",
num_train_epochs=20,
fp16=True,
save_steps=500,
eval_steps=500,
logging_steps=500,
learning_rate=1e-4,
weight_decay=0.005,
warmup_steps=1000,
save_total_limit=2,
)
trainer = Trainer(
args=training_args,
tokenizer=processor.feature_extractor,
train_dataset=df["train.100"],
eval_dataset=df["test"],
data_collator=data_collator,
model_init=model_init,
compute_metrics=compute_metrics,
)
trainer.hyperparameter_search(
direction="minimize",
backend="ray",
n_trials=5,
fail_fast="raise",
resources_per_trial={'cpu':1}
)
Thankful for any response.
Hereās a recent blog post by @matteopilotto about using W&B Sweeps with HF transformers. http://wandb.me/hf-sweeps
You can use hyperparameter_search(backend='wandb'...)
or you can use the W&B logger and use Sweeps to control the search and you get this plot to understand your metrics:
To use W&B Sweeps, you define a config with your search params and then create the sweep with wandb.sweep(config, project='your-project-name')
.
def train(config=None):
with wandb.init(config=config):
# set sweep configuration
config = wandb.config
training_args = TrainingArguments(
output_dir='vit-sweeps',
report_to='wandb', # Turn on Weights & Biases logging
num_train_epochs=config.epochs,
learning_rate=config.learning_rate,
weight_decay=config.weight_decay,
per_device_train_batch_size=config.batch_size,
...
)
The strange results are actually results of the inability of the network to learn anything because of the learning rate, which is very high in your cases as you can see.
Transformers need a much lower finetuning learning rate (e.g 5e-5
)
Hi @scottire
Im new using wandb sweeps with transformers. I tried to use the example in the blog post and adjust it to multi-class text classification. However, I always run into two errors:
- With data collator function:
Run b4wuzajc errored: TypeError('expected Tensor as element 0 in argument 0, but got str')
- Without data collator function:
ValueError("Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (
text_cleanin this case) have excessive nesting (inputs type
listwhere type
int is expected).")
Here is the notebook. Happy to share some example data.
Would be really grateful if someone can tell me whats going on! Thanks guys.
Hi everyone,
I fixed it. There was only one line of code missing:
tokenized_datasets = tokenized_datasets.remove_columns(["text_clean", "target", "label_name"])
Just had to remove the unwanted columns from the tokenized dataset.
Hello, I am getting this error when I want to use hyperparameter search :
"File ā/uw/test/expe_5/expe_5/traitements1/entrainement_test.pyā, line 553, in
trainer, outdir = prepare_fine_tuning(PRE_TRAINED_MODEL_NAME, train_dataset, val_dataset, tokenizer, sigle, train_name, datatype)
File ā/uw/test/expe_5/expe_5/traitements1/entrainement_test.pyā, line 402, in prepare_fine_tuning
trainer = Trainer(
File ā/uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/trainer.pyā, line 366, in init
model = model.to(args.device)
AttributeError: āfunctionā object has no attribute ātoā
def model_init():
set_seed=42
num_labels=3
return CamembertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels=num_labels)
def my_hp_space(trial):
return {
"learning_rate": trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True),
"num_train_epochs": trial.suggest_int("num_train_epochs", 1, 5),
"seed": trial.suggest_int("seed", 1, 40),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16, 32, 64]),
}
trainer = Trainer(
model= model_init, # the instantiated š¤ Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
tokenizer=tokenizer,
callbacks=[EarlyStoppingCallback(3, 0.0)], # early stopping if results dont improve after 3 epochs
compute_metrics=compute_metrics # #callbacks=[EarlyStoppingCallback(3, 0.0)] # early stopping if results dont improve after 3 epochs
)
best_run = trainer.hyperparameter_search(direction="maximize",
n_trials=5,
keep_checkpoints_num=1,
hp_space=my_hp_space
)
Use model_init
as a keyword rather than model
in Trainer.
Donāt do this: Trainer(model=model_init)
Do this: Trainer(model_init=model_init)
Is it at all possible to pass in the model itself to the hp_space? For example if you want to test out different models such as ābase-base-uncasedā or āroberta-baseā or āmpnet-base-v2ā etc.
I currently have it set up like this:
epochs = trial.suggest_categorical("epochs", EPOCHS)
batch_size = trial.suggest_categorical("batch_size", BATCH_SIZE)
learning_rate = trial.suggest_categorical("learning_rate", LEARNING_RATES)
scheduler = trial.suggest_categorical("scheduler", SCHEDULERS)
model_name = trial.suggest_categorical("model_name", MODEL_NAMES)
hp_space = {
"model_name": model_name,
"batch_size": batch_size,
"learning_rate": learning_rate,
"scheduler": scheduler,
"epochs": epochs,
}
I am not sure how to pass this in correctly to Trainer
EDIT: Seems like indeed (mentioned in one of your comments above) just passing the backend to wandb instead of ray worked;i.e
trainer.hyperparameter_search(
direction="maximize",
backend="wandb",
n_trials=4,
keep_checkpoints_num=1,
scheduler=get_scheduler())
I tested the provided example and managed to run it successfully.
However I do not understand completely how you would integrate W&B Sweeps with a PopulationBasedTraining() for example.
Could you shed some light on that please?
Thank you.
*** I did read this article : Weights & Biases, but in this case I just see how the configuration for the different hyperparameter strategies is defined, not how to integrate it with wandb.config() and wandb.agent() like in the example you posted.
To be more precise:
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
def get_scheduler():
#Creating the PBT scheduler
scheduler = PopulationBasedTraining(
mode = "max",
metric='eval_f1',
perturbation_interval=2,
hyperparam_mutations={
"weight_decay": tune.uniform(0.0, 0.3),
"learning_rate": tune.uniform(1e-5, 5e-5),
"per_device_train_batch_size": tune.choice([8, 16, 24, 32, 48]),
"num_train_epochs": tune.choice([5]),
"warmup_steps": tune.choice(range(0, 500))
}
)
return scheduler
...
trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=4,
keep_checkpoints_num=1,
scheduler=get_scheduler())
How should I integrate W&B for Sweeps analysis?
Hey @Calin , right now it doesnāt look like you can use 2 backends with the Trainerās hyperparameter_search
Maybe you can try set report_to = "wandb"
in the TrainerArguments?
That worked indeed ā The only issue is that I cannot manage to name my wandb sweep id mentioned like in the tutorial, meaning that all the trainings fall into one āuncategorizedā directory in my W&B project base.
Have you tried setting the project name with the WANDB_PROJECT environment variable?
WANDB_PROJECT=amazon_sentiment_analysis
Hi, I am able to manually setup Cross Validation as well as Hyper-Parameter Tuning using Optuna separately but I am have difficulty figuring out how to perform HPO with CV. Any help is appreciated. Thanks