Chapter 3 questions

kyars · December 29, 2024, 4:45pm

I got not equivalent for the 15th entry of the training dataset and
equivalent for the 87th entry in the activity

kyars · December 29, 2024, 8:49pm

from datasets import load_dataset

raw_datasets = load_dataset(“glue”, “sst2”)
from transformers import DataCollatorWithPadding

sentences = len(raw_datasets[‘train’].features)

print(sentences)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def tokenizer_function(example):
if ‘sentence’ in example: # Check if ‘sentence’ field exists
return tokenizer(example[‘sentence’], truncation=True)
elif ‘sentence1’ in example and ‘sentence2’ in example: # Check for ‘sentence1’ and ‘sentence2’
return tokenizer(example[‘sentence1’], example[‘sentence2’], truncation=True)
else:
raise ValueError(“Invalid dataset format: Example must contain either ‘sentence’ or ‘sentence1’ and ‘sentence2’ fields.”)

tokenized_datasets = raw_datasets.map(tokenizer_function, batched=True)

try:
tokenized_datasets = tokenized_datasets.remove_columns([“idx”, “sentence1”, “sentence2”])
except ValueError:
pass
try:
tokenized_datasets = tokenized_datasets.remove_columns([“idx”, “sentence”])
except ValueError:
pass

tokenized_datasets = tokenized_datasets.rename_column (“label” , “labels”)
tokenized_datasets = tokenized_datasets.with_format(“torch”)
batch = data_collator(tokenized_datasets[‘train’][:8])

[len(x) for x in tokenized_datasets[‘train’][:8][‘input_ids’]]
{k: v.shape for k, v in batch.items()}

my solution to exercise

kyars · December 29, 2024, 11:06pm

I’m getting You must call wandb.init() before wandb.log() error when I attempt to run
trainer.train()
I tried to import wandb as gemini recombined, but then it asked for a key upon pip install, so I do not believe that is the correct solution. Any help is appreciated
full message:

Error Traceback (most recent call last)

in <cell line: 1>() ----> 1 trainer.train()

7 frames

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/preinit.py in preinit_wrapper(*args, **kwargs) 34 ) → Callable: 35 def preinit_wrapper(*args: Any, **kwargs: Any) → Any: —> 36 raise wandb.Error(f"You must call wandb.init() before {name}()") 37 38 preinit_wrapper.name = str(name)

bird-of-paradise · January 10, 2025, 11:17pm

Hi NLP Course Team,

I’ve been exploring the fine-tuning content in your NLP course and creating educational materials in this space. The PEFT library tutorial is very comprehensive for implementation aspects. While working through the material, I noticed an opportunity to deepen the theoretical foundations - explaining the ‘why’ behind the ‘what’ and ‘how’ of these methods:

Deriving PEFT methods from first principles
Mathematical intuition behind different approaches when applicable
Trade-offs between various fine-tuning methods
Building foundations for innovation in PEFT

I’ve been writing about these topics on LinkedIn and creating tutorials that bridge theory with implementation. Understanding the underlying principles can help us not just implement existing methods, but also innovate and develop their own approaches.

I strongly believe in the power of knowledge sharing and collaboration, and my goal is to be part of something greater than myself—building tools and fostering communities that inspire others to learn, innovate, and create.

Looking forward to your thoughts!

yosoufe · January 22, 2025, 6:25am

Calling the trainer.train() function in the collab asks for the wandb API key. Why is that needed? Is it free? Could someone explain a bit about it. (I might have missed it during the course.)

John6666 · January 22, 2025, 6:33am

If you don’t specify the report_to= option, you get that error. Just a strange specification.

ablfzl · March 10, 2025, 8:35pm

i got this error accoridng this block of code, could please guide me ?

import tensorflow as tf

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

ERROR:

ValueError: Could not interpret optimizer identifier: <keras.src.optimizers.adam.Adam object at 0x7e80ed478950>

John6666 · March 11, 2025, 3:34am

It seems that you need to change the import statement depending on the version of Keras.

#from keras.optimizers import Adam
from keras.optimizers import adam_v2

ablfzl · March 11, 2025, 1:37pm

Oh, thank you bro

ablfzl · March 12, 2025, 7:37pm

sorry, but

from keras.optimizers import adam_v2

didn’t work for me

John6666 · March 13, 2025, 11:51am

Legacy Keras issue…?

github.com/keras-team/keras

ValueError: Could not interpret optimizer identifier: <keras.src.optimizers.adam.Adam object at 0x79d9071160e0>

opened 05:56AM - 07 Mar 24 UTC

closed 01:49AM - 09 May 24 UTC

YikunHan42

type:support stat:awaiting response from contributor stale

```python import tensorflow as tf from datasets import load_dataset from tran…sformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding from tensorflow.keras.optimizers import Adam from tensorflow.keras.optimizers.schedules import PolynomialDecay from tensorflow.keras.losses import SparseCategoricalCrossentropy def prepare_imdb_dataset(tokenizer): """ Prepares the IMDB dataset for training and validation. Args: tokenizer: The tokenizer to use for text tokenization. Returns: A tuple containing the tokenized training and validation datasets. """ imdb = load_dataset("imdb") train_set = imdb['train'].map(lambda x: tokenizer(x['text'], truncation=True), batched=True) test_set = imdb['test'].map(lambda x: tokenizer(x['text'], truncation=True), batched=True) return train_set, test_set tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) train_set, test_set = prepare_imdb_dataset(tokenizer) data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf") tf_train_dataset = train_set.to_tf_dataset( columns=["attention_mask", "input_ids"], label_cols=["label"], shuffle=True, collate_fn=data_collator, batch_size=8, ) tf_validation_dataset = test_set.to_tf_dataset( columns=["attention_mask", "input_ids"], label_cols=["label"], shuffle=False, collate_fn=data_collator, batch_size=8, ) batch_size = 16 num_epochs = 1 num_train_steps = len(tf_train_dataset) * num_epochs lr_scheduler = PolynomialDecay( initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps ) optimizer = Adam(learning_rate=lr_scheduler) loss = SparseCategoricalCrossentropy(from_logits=True) model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"]) model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=5) ``` ``` Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias'] - This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model). Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Map: 100% 25000/25000 [00:23<00:00, 1086.84 examples/s] Map: 100% 25000/25000 [00:20<00:00, 1304.86 examples/s] --------------------------------------------------------------------------- ValueError Traceback (most recent call last) [<ipython-input-17-ac80246ded67>](https://localhost:8080/#) in <cell line: 55>() 53 optimizer = Adam(learning_rate=lr_scheduler) 54 loss = SparseCategoricalCrossentropy(from_logits=True) ---> 55 model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"]) 56 57 model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=5) 2 frames [/usr/local/lib/python3.10/dist-packages/tf_keras/src/optimizers/__init__.py](https://localhost:8080/#) in get(identifier, **kwargs) 332 ) 333 else: --> 334 raise ValueError( 335 f"Could not interpret optimizer identifier: {identifier}" 336 ) ValueError: Could not interpret optimizer identifier: <keras.src.optimizers.adam.Adam object at 0x79d9071160e0> ```

ArseniyPerchik · March 31, 2025, 6:54am

the example in chapter 3 has an error:
the AdamW function needs to imported as following:
from torch.optim import AdamW
and not from the transformers library

John6666 · March 31, 2025, 7:30am

It seems to have disappeared…

Rtdon8363737 · April 4, 2025, 5:18pm

while #dynamic Padding, why the code not gone through error,
as tokenized_datasets do not has any “labels” field instead it has “label” field,

but in .to_tf_dataset() method, why setting
label_cols= [“labels”]
working,

it must be from these ==> [‘sentence1’, ‘sentence2’, ‘label’, ‘idx’, ‘input_ids’, ‘token_type_ids’, ‘attention_mask’],

Rtdon8363737 · April 4, 2025, 6:31pm

after setting learning_rate with PlynomialDecay ==> accuracy after 3 epoch is just 62%,
it is not accepctable, how to improve it, enve traditional language model like viterbi algo achieved more then this.

John6666 · April 5, 2025, 4:50am

This is because DataCollatorWithPadding returns a dict with labels as its elements. If you want to handle other data appropriately, you will need to write your own collate_fn (Data Collator).

github.com/huggingface/transformers

src/transformers/data/data_collator.py

v4.50.0


      
              return_tensors (`str`, *optional*, defaults to `"pt"`):
                  The type of Tensor to return. Allowable values are "np", "pt" and "tf".
          """
          
          tokenizer: PreTrainedTokenizerBase
          padding: Union[bool, str, PaddingStrategy] = True
          max_length: Optional[int] = None
          pad_to_multiple_of: Optional[int] = None
          return_tensors: str = "pt"
          
          def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
              batch = pad_without_fast_tokenizer_warning(
                  self.tokenizer,
                  features,
                  padding=self.padding,
                  max_length=self.max_length,
                  pad_to_multiple_of=self.pad_to_multiple_of,
                  return_tensors=self.return_tensors,
              )
              if "label" in batch:
                  batch["labels"] = batch["label"]

Rtdon8363737 · April 5, 2025, 7:13pm

why section-4: “a full training” not available for TensorFlow,
can i skip this section, i never worked with torch
what do i do?

need guidence

AndyZAnderson · April 11, 2025, 3:03am

The AdamW function has moved from the transformers library, it now needs to be imported as follows:
from torch.optim import AdamW

robintomar · May 3, 2025, 4:15am

Hi guys, Can anyone help with this ? Getting this error when running the code in colab notebook .

TypeError                                 Traceback (most recent call last)
<ipython-input-20-33ccf89368f8> in <cell line: 0>()
      1 from transformers import TrainingArguments, Trainer
----> 2 training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
      3 model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
      4 
      5 trainer = Trainer(

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

For the second line of this code

training_args = TrainingArguments("test-trainer", eval_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

John6666 · May 3, 2025, 7:15am

It seems that an issue from five years ago has resurfaced three weeks ago.

github.com/huggingface/transformers

TrainingArguments error : TypeError: init() got an unexpected keyword argument 'evaluation_strategy'

opened 11:31AM - 22 Oct 20 UTC

closed 04:26AM - 02 Jan 21 UTC

Fourha

wontfix

# ❓ Questions & Help ## Details when I use TrainingArguments (transf…ormer 3,3,1) , it emerge the error TypeError: __init__() got an unexpected keyword argument 'evaluation_strategy'. I wonder why I 've got this error. these are my code: > training_args = TrainingArguments( > output_dir="./no_num_pretrain_model", > overwrite_output_dir=True, > num_train_epochs=epochs, > per_device_train_batch_size=16, > per_device_eval_batch_size=32, > do_train = True, > do_eval = True, > evaluation_strategy="steps", > logging_steps = 10, > save_steps=2000, > eval_steps=10, > ) > > > trainer = Trainer( > model=model, > args=training_args, > data_collator=data_collator, > train_dataset=train_dataset, > eval_dataset=val_dataset, # evaluation dataset > optimizers =(optimizer,scheduler) > ) **A link to original question on the forum/Stack Overflow**:

Topic		Replies	Views
Implementation source code for AutoModelForSeq2SeqLM Beginners	0	977	January 5, 2022
BART from finetuned BERT Intermediate	2	472	September 9, 2021
EnocederDecoder training/prediction with two tokenizers Beginners	1	779	October 22, 2024
How create BERT2Rand Encoder-Decoder model Models	2	1088	March 16, 2021
How to train an EncoderDecoderModel with different pretrained encoder and decoder? 🤗Transformers	2	418	April 2, 2024

sentences = len(raw_datasets[‘train’].features)

print(sentences)

7 frames

Related topics