python: 3.9.7
datasets: 2.1.0
accelerate: 0.9.0
Iβve written a datasets
loading script. At the onset, the size of my home directory is 25GB. I run the datasets-cli test command datasets-cli test path/to/my/dataset/full_dataset --save_infos --all_configs
.
Q1) Is it necessary to run the datasets-cli dummy-data command in order to be able to load a dataset from a loading_script (i.e. using load_dataset(path/to/my/dataset/full_dataset)
?
After running the datasets-cli test command, the size of my home directory becomes 41GB and the files in ~.cache/huggingface/
are:
home/aclifton/.cache/huggingface/datasets
full_data
_home_aclifton_.cache_huggingface_datasets_full_data_default_0.0.0_7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542.incomplete.lock
_home_aclifton_.cache_huggingface_datasets_full_data_default_0.0.0_7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542.lock
/home/aclifton/.cache/huggingface/datasets/full_data/default/0.0.0/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542
dataset_info.json full_data-full.arrow
/home/aclifton/.cache/huggingface/modules
datasets_modules __init__.py
/home/aclifton/.cache/huggingface/modules/datasets_modules
datasets __init__.py __pycache__
/home/aclifton/.cache/huggingface/modules/datasets_modules/datasets
full_dataset full_dataset.lock __init__.py __pycache__
/home/aclifton/.cache/huggingface/modules/datasets_modules/datasets/full_dataset
7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542 __init__.py __pycache__
/home/aclifton/.cache/huggingface/modules/datasets_modules/datasets/full_dataset/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542
dataset_infos.json full_dataset.json full_dataset.py __init__.py __pycache__
I run the following code:
from accelerate import Accelerator
from datasets import load_dataset
class MyAccelerator:
def __init__(self, experiment_tracker):
self.accelerator = Accelerator(log_with=experiment_tracker)
self.accelerator.init_trackers('my_project')
class MyClass:
def __init__(
self,
dataset_path: str,
accelerator_obj = None):
self.data_file_dir = dataset_path
if accelerator_obj is not None:
self.accel_obj = accelerator_obj
self.dataset = load_dataset(self.data_file_dir)
def do_a_thing(self, labels_col_name: str):
def _tmp_do_a_thing(example, labels_col: str):
example[labels_col] = ['cool' for label in example[labels_col]]
return example
self.dataset = self.dataset.map(_tmp_do_a_thing,
fn_kwargs={'labels_col': labels_col_name},
batched=True,
num_proc=4
)
dataset_dir = 'path/to/my/dataset/full_dataset'
tracker = 'wandb'
my_accelerator = MyAccelerator(tracker)
my_class = MyClass(dataset_dir, accelerator_obj=my_accelerator)
my_class.do_a_thing('labels')
After running that code, the size of my home directory is ____. The output is:
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Using custom data configuration default
Reusing dataset full_data (/home/aclifton/.cache/huggingface/datasets/full_data/default/0.0.0/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542)
0%| | 0/1 [00:00<?, ?it/s]wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 138.12it/s]
Using custom data configuration default
Reusing dataset full_data (/home/aclifton/.cache/huggingface/datasets/full_data/default/0.0.0/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542)
0%| | 0/1 [00:00<?, ?it/s]wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 128.70it/s]
Using custom data configuration default
Reusing dataset full_data (/home/aclifton/.cache/huggingface/datasets/full_data/default/0.0.0/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542)
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 108.42it/s]
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Using custom data configuration default
Reusing dataset full_data (/home/aclifton/.cache/huggingface/datasets/full_data/default/0.0.0/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542)
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 98.82it/s]
#0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [15:48<00:00, 1.42ba/s]
#0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [16:53<00:00, 1.33ba/s]
#1: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [17:01<00:00, 1.32ba/s]
#0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [17:43<00:00, 1.27ba/s]
#1: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [18:24<00:00, 1.22ba/s]
#0: 96%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1291/1348 [18:27<00:30, 1.86ba/swandb: Waiting for W&B process to finish... (success).βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1235/1348 [18:27<01:11, 1.57ba/s]
wandb:
#0: 96%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1293/1348 [18:28<00:29, 1.86ba/swandb: You can sync this run to the cloud by running:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1237/1348 [18:28<01:09, 1.59ba/s]
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220802_125826-3mxrkyjz
wandb: Find logs at: ./wandb/offline-run-20220802_125826-3mxrkyjz/logs
#1: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [18:30<00:00, 1.21ba/s]
#0: 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1311/1348 [18:38<00:19, 1.87ba/s]wandb: Waiting for W&B process to finish... (success).ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1252/1348 [18:37<00:58, 1.63ba/s]
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped) wwandb: ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1253/1348 [18:38<00:58, 1.61ba/s]
#0: 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1314/1348 [18:39<00:18, 1.87ba/s]wandb: You can sync this run to the cloud by running:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1255/1348 [18:39<00:57, 1.62ba/s]
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220802_125826-2lpxfnpk
wandb: Find logs at: ./wandb/offline-run-20220802_125826-2lpxfnpk/logs
#0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [18:57<00:00, 1.19ba/s]
#1: 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1283/1348 [18:56<00:40, 1.62ba/swandb: Waiting for W&B process to finish... (success).ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1293/1348 [19:03<00:35, 1.54ba/s]
wandb:
wandb: You can sync this run to the cloud by running:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1295/1348 [19:04<00:33, 1.58ba/s]
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220802_125826-2lw39b7a
wandb: Find logs at: ./wandb/offline-run-20220802_125826-2lw39b7a/logs
#1: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [19:36<00:00, 1.15ba/s]
wandb: Waiting for W&B process to finish... (success).βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1348/1348 [19:36<00:00, 2.13ba/s]
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220802_125826-3nudbfln
wandb: Find logs at: ./wandb/offline-run-20220802_125826-3nudbfln/logs
The files in ~.cache/huggingface are:
home/aclifton/.cache/huggingface/datasets
full_data
_home_aclifton_.cache_huggingface_datasets_full_data_default_0.0.0_7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542.incomplete.lock
_home_aclifton_.cache_huggingface_datasets_full_data_default_0.0.0_7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542.lock
/home/aclifton/.cache/huggingface/datasets/full_data/default/0.0.0/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542
cache-543ca50f512f451b.arrow cache-c433fad29c12001e.arrow cache-e159d7c56a2a81a2.arrow cache-f05fabd9c721d745.arrow dataset_info.json full_data-full.arrow
/home/aclifton/.cache/huggingface/modules
datasets_modules __init__.py
/home/aclifton/.cache/huggingface/modules/datasets_modules
datasets __init__.py __pycache__
/home/aclifton/.cache/huggingface/modules/datasets_modules/datasets
full_dataset full_dataset.lock __init__.py __pycache__
/home/aclifton/.cache/huggingface/modules/datasets_modules/datasets/full_dataset
7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542 __init__.py __pycache__
/home/aclifton/.cache/huggingface/modules/datasets_modules/datasets/full_dataset/7d846c1ada953e8a7f4733f43d00690bcd762473b720dfd4f8e7fde8b4bc2542
dataset_infos.json full_dataset.json full_dataset.py __init__.py __pycache__
It looks like using the map()
method on a dataset creates some copies of that dataset or some type of cached files. Iβm not sure if this is the best way to run operations on a datasets
object in the accelerate
library as it seems to also depend on the num_proc
parameter (more cached files for higher num_proc
).
Q2) Is there a way to avoid the creation of these files and keep the size of my home directory the same? Iβm not particularly constrained to this workflow. I can always run my operations outside of the accelerate
library when I create the initial full dataset using my loading script.
Any advice is much appreciated. Thanks in advance!!