Ssues with GPU Configuration and Model Deployment on Hugging Face Spaces

It is very sad that I can not get any answer to this:

Dear M.,

as a newbie it is extremely demotivating

not only to have problems with the fine-tuning,

where I am not helped,

but also to have to pay for my wasted time.

So I will be leaving Huggingface again very soon.

Best regards from

Andreas H. Drescher

May 1 - May 31

Current period

Subscriptions

paid

$9.00

Spaces

$37.00

A100 large

9 hours and 15 minutes

$37.00

Subject: Issues with GPU Configuration and Model Deployment on Hugging Face Spaces

Dear Megan,

I hope this message finds you well.

I am writing to seek assistance with a few persistent issues I’m encountering while attempting to use the A100 GPU on Hugging Face Spaces for my project. Despite following the provided guidelines and successfully upgrading to the A100, I have faced several obstacles that are preventing me from progressing.

Here are the details:

  1. CUDA Availability:
  • The CUDA availability check in my app.py script confirms that CUDA is available and correctly identifies the NVIDIA A100 GPU. This step works as expected.
  • Logs:

Is CUDA available: True
CUDA device: NVIDIA A100-SXM4-40GB

  1. Model and Tokenizer Loading:
  • The model and tokenizer from the Hugging Face Hub (LeoLM/leo-hessianai-70B-chat) load without any issues.
  • Successful login to Hugging Face using the token is also confirmed.
  1. Dataset and Tokenization:
  • The dataset loads and tokenizes correctly. No errors appear in this process.
  • Logs indicate successful loading and tokenization of the dataset.
  1. Training and UI Issues:
  • The main issue arises when attempting to train the model or push it to the Hub. The training process does not start, and the UI does not display the expected output or buttons correctly.
  • Logs show successful execution up to the point where the training should begin, but there is no further progress or output.
  1. Potential Loop or Restart Issues:
  • Occasionally, the application appears to restart or refresh unexpectedly, causing the process to start over. This might be linked to the “Page not found” errors I encounter intermittently.

My current app.py script and requirements.txt are as follows:

app.py:

import torch
import os
import streamlit as st
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
import logging

logging.basicConfig(level=logging.INFO)

CUDA-Verfügbarkeit prüfen

print(f"Is CUDA available: {torch.cuda.is_available()}“)
print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}”)

Login to Hugging Face using the token from environment variables

hf_token = os.getenv(‘HF_TOKEN’)
if hf_token:
login(hf_token)
st.write(“Successfully logged in to Hugging Face”)
else:
st.error(“Hugging Face token not found in environment variables”)
raise ValueError(“Hugging Face token not found in environment variables”)

Lade das Modell und den Tokenizer

try:
model = AutoModelForCausalLM.from_pretrained(“LeoLM/leo-hessianai-70B-chat”)
tokenizer = AutoTokenizer.from_pretrained(“LeoLM/leo-hessianai-70B-chat”)
except Exception as e:
st.error(f"Error loading model or tokenizer: {e}“)
logging.error(f"Error loading model or tokenizer: {e}”)
raise e

Lade die Tupel-Dateien als Hugging Face-Dataset

try:
dataset = load_dataset(“text”, data_files={“train”: “data/*.txt”})

Tokenisiere die Tupel

def tokenize_function(examples):
return tokenizer(examples[“text”], truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=[“text”])
except Exception as e:
st.error(f"Error loading or tokenizing dataset: {e}“)
logging.error(f"Error loading or tokenizing dataset: {e}”)
raise e

Erstelle einen DataCollator, der die Sequenzen für das Language Modeling vorbereitet

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Definiere die TrainingArguments

training_args = TrainingArguments(
output_dir=“./results”,
evaluation_strategy=“epoch”,
learning_rate=2e-5,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
num_train_epochs=1,
weight_decay=0.01,
logging_dir=“./logs”,
logging_steps=10,
save_total_limit=2,
save_steps=5000,
push_to_hub=False,
)

Erstelle den Trainer

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets[“train”],
data_collator=data_collator,
)

Streamlit UI

st.title(‘Finetune LeoLM Model’)

if st.button(‘Start Training’):
with st.spinner(‘Training in progress…’):
try:
trainer.train()
st.success(“Training completed”)
except Exception as e:
st.error(f"Error during training: {e}“)
logging.error(f"Error during training: {e}”)

if st.button(‘Push to Hub’):
with st.spinner(‘Pushing model to Hub…’):
try:
trainer.push_to_hub()
st.success(“Model pushed to Hugging Face Hub”)
except Exception as e:
st.error(f"Error pushing model to hub: {e}“)
logging.error(f"Error pushing model to hub: {e}”)

requirements.txt:

–extra-index-url https://download.pytorch.org/whl/cu113
torch
transformers==4.28.1
datasets==2.12.0
streamlit==1.22.0
huggingface-hub==0.14.1

I would greatly appreciate any assistance or guidance you can provide to resolve these issues. Thank you for your time and support.

Best regards, Andreas