We have setup an aws ec2 instance and initialized https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct and mistralai/Pixtral-12B-2409 · Hugging Face models.
when we create ami from these vm and create new instance, the transformers library isn’t able to load the pretrained models.
When i delete .cache in newly created instances, then it’s able to download and load models.
is it an expected beahaviour? as the instance configuration is exactly the same
1 Like
The symptoms are different but the problem could be similar to this.
Hi, I’m trying to train a model using a HuggingFace estimator in SageMaker but I keep getting this error after a few minutes:
[1,15]: File “pyarrow/ipc.pxi”, line 365, in pyarrow.lib._CRecordBatchWriter.write_batch
[1,15]: File “pyarrow/error.pxi”, line 97, in pyarrow.lib.check_status
[1,15]:OSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device
[1,15]:
I’m not sure what is triggering this problem because the volume size is high (volume_size=1024)
…
Space doesn’t seem to be issue , I have enough free space on vm. The moment I delete .cache it redownloads and starts working.
Your error is coming from caching the dataset. Datasets is caching the dataset on disk to work with it properly. The default cache_dir is ~/.cache/huggingface/datasets
. This directory seems not to be on the mounted EBS volume.
I think it’s not that way, it’s around here.
I’m not trying to cache dataset, it’s just model weights and the ami has the weights in the root for itself. The primary purpose is to boot the process as fast as possible
Perhaps unresolved issue?
opened 05:46PM - 21 Feb 24 UTC
closed 08:04AM - 05 Apr 24 UTC
### System Info
I am ubuntu, torch = "2.0.0"
The following code always re-do… wnload the models instead of re-using the cached files:
` adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0",
torch_dtype=torch.float16,
varient="fp16").to(accelerator.device)
model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
logger.info(f"model loading..")
euler_a = EulerAncestralDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
logger.info(f"model loading...")
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
logger.info(f"model loading....")
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
model_id,
vae=vae,
adapter=adapter,
scheduler=euler_a,
torch_dtype=torch.float16,
variant="fp16"
).to(accelerator.device)
pipe.enable_xformers_memory_efficient_attention()
logger.info(f"model weights loading")
pipe.load_lora_weights(
"stabilityai/stable-diffusion-xl-base-1.0", weight_name="sd_xl_offset_example-lora_1.0.safetensors"
)`
I also tried with the `cache_dir` param but same result.
<img width="1377" alt="Screenshot 2024-02-21 at 11 32 22 AM" src="https://github.com/huggingface/transformers/assets/13439477/562c4f1e-b033-4c03-883b-9dcf3b6a579a">
### Who can help?
_No response_
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)
### Reproduction
` adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0",
torch_dtype=torch.float16,
varient="fp16").to(accelerator.device)
model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
logger.info(f"model loading..")
euler_a = EulerAncestralDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
logger.info(f"model loading...")
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
logger.info(f"model loading....")
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
model_id,
vae=vae,
adapter=adapter,
scheduler=euler_a,
torch_dtype=torch.float16,
variant="fp16"
).to(accelerator.device)
pipe.enable_xformers_memory_efficient_attention()
logger.info(f"model weights loading")
pipe.load_lora_weights(
"stabilityai/stable-diffusion-xl-base-1.0", weight_name="sd_xl_offset_example-lora_1.0.safetensors"
)`
### Expected behavior
use the cached models/files and not download any files from the internet
I wonder if it’s possible to avoid this by manually clearing the cache before execution.
I have seen this issue in the GitHub issues
I’m not facing this , but in my case it starts loading weights and remains stuck at 0% when I spin up new instance