Hi everyone, I have the following code below but it’s not saving the model. Can anyone give me suggestions or tell me why it’s not saving it?
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
import pandas as pd
from app.utils import data_utils
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, DistilBertForSequenceClassification, get_scheduler
from tqdm.auto import tqdm
import evaluate
# Fine-tuning DistilBERT for news sentiment analysis (using Kaggle news sentiment analysis) -> 3 labels
# 0 = Negative
# 1 = Neutral
# 2 = Positive
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
# Other test news sentiment
# ds = load_dataset("sara-nabhani/ML-news-sentiment")
# Tokenize Function
def tokenize_function(data):
return tokenizer(data["text"], padding="max_length", truncation=True)
# https://www.kaggle.com/datasets/clovisdalmolinvieira/news-sentiment-analysis?resource=download
# dataset = load_dataset("yelp_review_full")
# print(dataset)
# Load in dataset and shape it into pandas dataframe
df = data_utils.load_raw_data("kaggle_news_sentiment_analysis.csv")
df["text"] = df["Title"] + ": " + df["Description"]
df = df[["text", "Sentiment"]]
replacements = {"negative": 0, "neutral": 1, "positive": 2}
df["Sentiment"] = df["Sentiment"].map(replacements).fillna(df["Sentiment"])
df.rename(columns={"Sentiment": "labels"}, inplace=True)
print(df)
# Create train and test datasets
train = df.head(3000)
test = df.tail(500)
# Convert from pandas dataframe to huggingface dataframe
train = Dataset.from_pandas(train)
print(train)
test = Dataset.from_pandas(test)
print(test)
# Tokenize the datasets
tokenized_train = train.map(tokenize_function, batched=True)
tokenized_train = tokenized_train.remove_columns(["text"])
tokenized_train.set_format(type="torch")
print(tokenized_train)
tokenized_test = test.map(tokenize_function, batched=True)
tokenized_test = tokenized_test.remove_columns(["text"])
tokenized_test.set_format(type="torch")
print(tokenized_test)
# Create data loaders
train_dataloader = DataLoader(tokenized_train, shuffle=True, batch_size=10)
test_dataloader = DataLoader(tokenized_test, batch_size=10)
# Initialize optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)
# Initialize learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)
# Specify device to train on
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
# Training loop
print("STARTING TRAINING LOOP ==============")
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch) # Pass batch through model and get output
loss = outputs.loss # Calculate loss from output
loss.backward() # Backpropagation using losses to calculate gradient
optimizer.step() # Update parameters
lr_scheduler.step() #
optimizer.zero_grad() # Refresh gradient
progress_bar.update(1) # Update progress bar
# Save model
model.save_pretrained(save_directory="./saved_models/distilbert")
1 Like
This may fix the problem. If it doesn’t, there may be a bug in one of the libraries.
# Save model
model.to("cpu") # Added
model.save_pretrained(save_directory="./saved_models/distilbert")
opened 06:12AM - 12 Mar 24 UTC
closed 12:33PM - 19 Mar 24 UTC
### System Info
Transformers == 4.38.2
Platform == TPU V4 on GKE
Python == 3.… 10
### Who can help?
@ArthurZucker
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)
### Reproduction
I ran some tests on a GKE Cluster with TPU V4 with 4 nodes.
https://gist.github.com/moficodes/1492228c80a3c08747a973b519cc7cda
This run fails with
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 13, in storage_ptr
return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "//fsdp.py", line 112, in <module>
model.save_pretrained(new_model_id)
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2448, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten
shared_pointers = _find_shared_tensors(tensors)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 17, in storage_ptr
return tensor.storage().data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 956, in data_ptr
return self._data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 960, in _data_ptr
return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.
### Expected behavior
Save the model and push to hugging face.
I’ll try it out after this, but I was wondering if it was possible to train the model with native pytorch and still save it using HuggingFace’s save_pretained() function. Would that also be another possible issue? I currently don’t get any error messages; it just doesn’t save it.
1 Like
Usually, if save_pretrained fails, an error message is displayed in many cases, so it smells like a bug…
Basically, what is saved with HF’s save_pretrained is the torch model’s state_dict and the configuration json, so converting a model trained with native torch to HF format is usually not that difficult.
The only problem is when the state_dict keys change.
opened 07:56AM - 22 Apr 24 UTC
enhancement
I'm finding this repo to be a user friendly, extensible, memory efficient soluti… on for training/fine-tuning models. However, when it comes to inference, there is a usability gap that could be solved by converting the model into a format that can be loaded by HF's [`from_pretrained()`](https://huggingface.co/docs/transformers/v4.40.0/en/model_doc/auto#transformers.AutoModel.from_pretrained) function.
The specific thing I want to do is load a model fine-tuned with torchtune into a [Gradio chatbot, complete with token streaming](https://www.gradio.app/guides/creating-a-chatbot-fast#example-using-a-local-open-source-llm-with-hugging-face). I imagine many other downstream tasks would be made easier with this functionality as well.
Would it be reasonable to add the following options to the checkpointer?
- Save full model weights in such a way that they can be loaded with [`from_pretrained()`](https://huggingface.co/docs/transformers/v4.40.0/en/model_doc/auto#transformers.AutoModel.from_pretrained).
- Save a LoRA adapter in such a way that it can be loaded into a model with [`get_peft_model()`](https://huggingface.co/docs/peft/v0.10.0/en/package_reference/peft_model#peft.get_peft_model).
If this seems like a valid addition, and isn't a huge lift, I would be happy to give it a try.
system
Closed
November 10, 2024, 4:25am
6
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.