Announcing ConvaiCausalLM: A Foundational Hindi Causal Language Model (102M)(YAHH! SMALL)

sharing our new model - ConvaiCausalLM, a 102M parameter Hindi language model we’ve been working on at Convai Innovations! :tada:

Most LLMs focus on English, but Hindi is spoken by 600+ million people worldwide and deserves more attention! That’s why we built this foundational model for Hindi text generation from scratch.

Where to find it:

  • Model: convaiinnovations/hindi-causal-lm
  • Tokenizer: convaiinnovations/hindi-embedding-foundational-model (important - you need this specific tokenizer!)

Quick facts about the model:

  • ~102 million parameters
  • Decoder-only Transformer (Pre-LayerNorm)
  • Uses Grouped Query Attention (GQA) with 16 query heads and 4 KV heads
  • Standard learned positional embeddings (not RoPE)
  • Context length: 512 tokens
  • Vocab size: 16,000 tokens

We built this model with a custom SentencePiece tokenizer specifically designed for Hindi. This was one of our key design decisions to better handle the nuances of Hindi script and vocabulary.

Tech details (for the nerds among us :grinning_face_with_smiling_eyes:):

  • Hidden layers: 12
  • Hidden size: 768
  • Intermediate size: 3072
  • Activation: SiLU (SwiGLU) in the feed-forward layers
  • Standard torch.nn.LayerNorm for normalization

What can you use it for?

  • Basic Hindi text generation
  • Fine-tuning for specific Hindi NLP tasks (summarization, QA, etc.)
  • Experimenting with a relatively lightweight Hindi LM

How to use it:

The easiest way to get started is:

  1. Clone the repo:

bash

git clone https://huggingface.co/convaiinnovations/hindi-causal-lm
cd hindi-causal-lm
  1. Then you can load and run the model with our native implementation:

python

import torch
from hindi_embeddings import SentencePieceTokenizerWrapper
from convaicausallm_model import ConvaiCausalLM, ConvaiCausalLMConfig
from safetensors.torch import load_file
import os 

class HindiLLMGenerator:
    def __init__(self, model_path, device=None):
        # Set device
        if device is None:
            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        else:
            self.device = torch.device(device)
            
        print(f"Using device: {self.device}")
        
        # Load tokenizer
        tokenizer_path = os.path.join(model_path, "tokenizer.model")
        self.tokenizer = SentencePieceTokenizerWrapper(tokenizer_path)
        
        # Load model config
        config_path = os.path.join(model_path, "config.json")
        import json
        with open(config_path, 'r') as f:
            config_dict = json.load(f)
            
        self.config = ConvaiCausalLMConfig(**config_dict)
        
        # Load model - try safetensors first, fall back to PyTorch bin if needed
        safetensors_path = os.path.join(model_path, "model.safetensors")
        pytorch_path = os.path.join(model_path, "pytorch_model.bin")
        
        self.model = ConvaiCausalLM(self.config)
        
        # Check which format is available and load accordingly
        if os.path.exists(safetensors_path):
            print(f"Loading model from SafeTensors")
            state_dict = load_file(safetensors_path, device="cpu")
            self.model.load_state_dict(state_dict)
        elif os.path.exists(pytorch_path):
            print(f"Loading model from PyTorch bin")
            self.model.load_state_dict(torch.load(pytorch_path, map_location="cpu"))
        
        # Move model to device and set to evaluation mode
        self.model.to(self.device)
        self.model.eval()
    
    def generate(self, prompt, max_length=100, temperature=0.8, top_k=50, top_p=0.9, 
                 repetition_penalty=1.1, do_sample=True):
        # Tokenize the prompt
        input_ids = self.tokenizer.sp_model.EncodeAsIds(prompt)
        input_tensor = torch.tensor([input_ids], dtype=torch.long).to(self.device)
        
        # Start with the input tensor
        output_sequence = input_tensor.clone()
        
        # Generate tokens one by one
        for _ in range(max_length - len(input_ids)):
            with torch.no_grad():
                # Get the model's output for the current sequence
                outputs = self.model(output_sequence)
                next_token_logits = outputs[0, -1, :]
                
                # Apply temperature
                if temperature > 0:
                    next_token_logits = next_token_logits / temperature
                
                # Apply repetition penalty
                if repetition_penalty > 1.0:
                    for token_id in output_sequence[0].tolist():
                        next_token_logits[token_id] /= repetition_penalty
                
                # Filter with top-k sampling
                if top_k > 0:
                    top_k_values, top_k_indices = torch.topk(next_token_logits, top_k)
                    next_token_logits = torch.full_like(next_token_logits, float('-inf'))
                    next_token_logits.scatter_(0, top_k_indices, top_k_values)
                
                # Filter with top-p/nucleus sampling
                if top_p < 1.0 and do_sample:
                    sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
                    cumulative_probs = torch.cumsum(torch.softmax(sorted_logits, dim=-1), dim=-1)
                    
                    # Remove tokens with cumulative probability above the threshold
                    sorted_indices_to_remove = cumulative_probs > top_p
                    # Shift the indices to the right to keep the first token above the threshold
                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                    sorted_indices_to_remove[..., 0] = 0
                    
                    indices_to_remove = sorted_indices[sorted_indices_to_remove]
                    next_token_logits[indices_to_remove] = float('-inf')
                
                # Sample or choose the next token
                if do_sample:
                    probs = torch.softmax(next_token_logits, dim=-1)
                    next_token = torch.multinomial(probs, num_samples=1)
                else:
                    next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(0)
                
                # Add the next token to the sequence
                output_sequence = torch.cat([output_sequence, next_token.unsqueeze(0)], dim=1)
                
                # Check if we've generated an end token
                if next_token.item() == self.tokenizer.eos_token_id:
                    break
        
        # Decode the generated sequence
        generated_ids = output_sequence[0].tolist()
        generated_text = self.tokenizer.sp_model.DecodeIds(generated_ids)
        
        return generated_text

# Example usage
if __name__ == "__main__":
    generator = HindiLLMGenerator(".")  # Use current directory after cloning
    result = generator.generate("भारत एक विशाल देश है")
    print(result)

Or alternatively, you can use the transformers library as shown in my previous example if you prefer that approach.

Limitations to keep in mind:

  • It’s a foundational model, so expect some repetition in longer generations
  • You MUST use our custom tokenizer (regular Hindi tokenizers won’t work well)
  • Limited to 512 token context (I know, I know - we’re working on it!)
  • Knowledge is limited to what was in our training data

What’s next?

We’re working on a PR to get the ConvaiCausalLM architecture into the main transformers library. This should make it even easier to use.

I’d really appreciate your feedback! If you try it out, let me know how it performs, what you think of the tokenizer, and any interesting applications you find for it.

Thanks for checking out our work!

~ The Convai Innovations Team

P.S. Looking for collaborators interested in Hindi NLP! DM me if you want to chat about potential projects :blush:

1 Like