How to run llama2 7b chat locally with 3060 6GB Ram

I am trying to use llama2 7b for one of my RAG project but I don’t really know how can I run it locally with 3060 6GB linux machine. Any help would be highly appreciated. Thanks

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845206/work/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Traceback (most recent call last):
  File "/media/aryan/sandisk_ex/llama2/llama/example_chat_completion.py", line 104, in <module>
    fire.Fire(main)
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/example_chat_completion.py", line 35, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/generation.py", line 119, in build
    model = Transformer(model_args)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/model.py", line 443, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/model.py", line 375, in __init__
    self.attention = Attention(args)
                     ^^^^^^^^^^^^^^^
  File "/media/aryan/sandisk_ex/llama2/llama/llama/model.py", line 228, in __init__
    self.wo = RowParallelLinear(
              ^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/fairscale/nn/model_parallel/layers.py", line 349, in __init__
    self.weight = Parameter(torch.Tensor(self.out_features, self.input_size_per_partition))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 5.77 GiB of which 39.12 MiB is free. Process 35536 has 17.52 MiB memory in use. Including non-PyTorch memory, this process has 5.12 GiB memory in use. Of the allocated memory 5.00 GiB is allocated by PyTorch, and 1.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-03-09 00:21:33,658] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 62595) of binary: /home/aryan/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/home/aryan/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aryan/miniconda3/envs/pytorch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-09_00:21:33
  host      : ar
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 62595)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================