Transformer Hangsup

System Info

  • transformers version: 4.42.4
  • Platform: Linux-5.4.0-105-generic-x86_64-with-glibc2.31
  • Python version: 3.12.4
  • Huggingface_hub version: 0.23.5
  • Safetensors version: 0.4.3
  • Accelerate version: 0.32.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1+cu121 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: “Just created the Class wrapped around transformer pipeline.”
  • RAM 120GB
  • Architecture: x86_64
  • CPU op-mode(s): 32-bit, 64-bit
  • Byte Order: Little Endian
  • Address sizes: 40 bits physical, 48 bits virtual
  • CPU(s): 24
  • On-line CPU(s) list: 0-23
  • Thread(s) per core: 1
  • Core(s) per socket: 24
  • Socket(s): 1
  • NUMA node(s): 1
  • Vendor ID: AuthenticAMD
  • CPU family: 23
  • Model: 49
  • Model name: AMD EPYC 7282 16-Core Processor
  • Stepping: 0
  • CPU MHz: 2794.748
  • BogoMIPS: 5589.49
  • Hypervisor vendor: KVM
  • Virtualization type: full
  • L1d cache: 1.5 MiB
  • L1i cache: 1.5 MiB
  • L2 cache: 12 MiB
  • L3 cache: 16 MiB
  • NUMA node0 CPU(s): 0-23
  • Vulnerability Itlb multihit: Not affected
  • Vulnerability L1tf: Not affected
  • Vulnerability Mds: Not affected
  • Vulnerability Meltdown: Not affected
  • Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  • Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  • Vulnerability Spectre v2: Mitigation; LFENCE, IBPB conditional, STIBP disabled, RSB filling
  • Vulnerability Srbds: Not affected
  • Vulnerability Tsx async abort: Not affected
  • Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3
    fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibpb stibp vmmcall fsgs
    base tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 wbnoinvd arat umip arch_capabilities

Who can help?

@Narsil @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

#
# IMPORTS
#
import os  # CORE: OS library for managing the environment.
import torch # PIP: PyTorch library for managing tensors.
import optuna  # PIP: Optuna library for hyperparameter optimization.
import transformers # PIP: Hugging Face Transformers library for managing the model and tokenizer.
from dotsinspace_debugger import Debugger  # PIP: Debugger library for handling errors.
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments # PIP: Hugging Face Transformers library for managing the model and tokenizer.


#
# SIBLINGS
#
from .Dataset import LocalDataset


#
# CLASS
#
#
# IMPORTS
#
import torch  # PIP: PyTorch library for managing tensors.
import transformers  # PIP: Hugging Face Transformers library for managing the model and tokenizer.
from dotsinspace_debugger import Debugger  # PIP: Debugger library for handling errors.
from transformers import AutoTokenizer  # PIP: Hugging Face Transformers library for managing the model and tokenizer.
import asyncio  # CORE: asyncio library for handling asynchronous operations.
import functools  # CORE: functools library for handling functional programming.
from concurrent.futures import ThreadPoolExecutor  # CORE: ThreadPoolExecutor for running blocking code in an asynchronous manner.


#
# CLASS
#
class Groot:
    # Constructor.
    def __init__(self, directory):
        # Check if MPS is available, else use CUDA if available, otherwise use CPU
        if torch.backends.mps.is_available():
            # Set device to MPS
            self.device = torch.device('mps')
        elif torch.cuda.is_available():
            # Set device to CUDA
            self.device = torch.device('cuda')
        else:
            # Set device to CPU
            self.device = torch.device('cpu')

        # Property assignment.
        self.baseModel = 'tiiuae/falcon-40b'
        self.Tokenizer = None
        self.Pipeline = None
        self.Executor = ThreadPoolExecutor()
        
    # Function to asynchronously load the pipeline
    # and tokenizer.
    async def Load(self):
        # Variable assignment.
        functionName = 'Root -> Load'
        Debug = Debugger(name=functionName)

        # Error handling.
        try:
            # Style guide.
            Debug.info(f"Loading root.")

            # Load tokenizer asynchronously.
            self.Tokenizer = await self.AsyncLoadTokenizer(self.baseModel)
            
            # Load pipeline asynchronously.
            self.Pipeline = await self.AsyncLoadPipeline('text-generation', model=self.baseModel, tokenizer=self.Tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto",)

            # Style guide.
            Debug.info(f"Loaded root.")

            # Return success.
            return {
                'message': 'Root has been loaded.',
                'status': 'LOAD_SUCCESSFUL'
            }
        except Exception as error:
            # Report failure.
            raise error

    # Helper function for async pipeline execution
    async def AsyncLoadTokenizer(self, *args, **kwargs):
        # Get the event loop.
        loop = asyncio.get_event_loop()

        # Variable assignment.
        FunctionToolPipeline = functools.partial(AutoTokenizer.from_pretrained, *args, **kwargs)

        # Run the function in a separate thread.
        return await loop.run_in_executor(self.Executor, FunctionToolPipeline)
    
    # Helper function for async pipeline execution
    async def AsyncLoadPipeline(self, *args, **kwargs):
        # Get the event loop.
        loop = asyncio.get_event_loop()

        # Variable assignment.
        FunctionToolPipeline = functools.partial(transformers.pipeline, *args, **kwargs)

        # Run the function in a separate thread.
        return await loop.run_in_executor(self.Executor, FunctionToolPipeline)
    
    # Helper function for async pipeline execution
    async def AsyncExecutePipeline(self, *args, **kwargs):
        # Get the event loop.
        loop = asyncio.get_event_loop()

        # Variable assignment.
        FunctionToolPipeline = functools.partial(self.Pipeline, *args, **kwargs)

        # Run the function in a separate thread.
        return await loop.run_in_executor(self.Executor, FunctionToolPipeline)

    # Function to talk to root asynchronously.
    async def Talk(self, message):
        # Variable assignment.
        functionName = 'Root -> Talk'
        Debug = Debugger(name=functionName)

        # Error handling.
        try:
            # Style guide.
            Debug.info(f"Root is responding.")
            
            # If pipeline does not exist.
            # then load pipeline.
            if not self.Pipeline:
                # Load pipeline.
                await self.Load()

            # Get response from root asynchronously.
            response = await self.AsyncExecutePipeline(message, max_length=200, do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=self.Tokenizer.eos_token_id, truncation=True)
            print(response)
            # Return response.
            return {
                'response': response,
                'status': 'READ_SUCCESSFUL'
            }
        except Exception as error:
            # Report failure.
            raise error

When Talk function is called it outputs nothing. am i missing something here. + Note i didn’t trained it or something this class is broken and fixed around using falcon itself and checking out its response without training.

Expected behavior

It should output at least few tokens.