Using hugging face models with private company data?

I have a internal hackathon project idea for my company that involves training an LLM on some released and unreleased user manual documents. I can’t use ChatGPT and discovering hugging face, this might be just what I need as it can work offline with pretrained models.

I’ve also discovered things recently such as llama index and langchain! These both appear to be similar in that they allow you to use private data with LLMs, but unsure are they something I need to use.

Can anyone shed light on my understanding of the above, or has been in a similar situation where they needed to use an LLM with private data?

Many thanks,
Sean

1 Like

TLDR: You can install the transformers library (🤗 Transformers) plus e.g. tensorflow or pytorch, download an appropriate pre-trained model from the library (check that the license allows commercial usage) and use it all locally. Also fine-tuning is possible.

Longer answer from ChatGPT on “how can I use and fine-tune a model from Hugging Face locally on confidential data?”:

Fine-tuning a model from Hugging Face’s Transformers library on confidential data can be done locally, ensuring data privacy. Here’s a step-by-step guide to help you through the process.

Step 1: Install Required Packages

First, make sure you’ve installed the Transformers library and a suitable backend (e.g., PyTorch or TensorFlow).

pip install transformers
pip install torch  # for PyTorch

Step 2: Data Preparation

Prepare your confidential data, ensuring it is divided into training, validation, and test sets. Tokenize the data and convert it into a format compatible with the model you are using.

Step 3: Load Pre-trained Model

Load the pre-trained model and corresponding tokenizer.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-name")
model = AutoModelForSequenceClassification.from_pretrained("model-name", num_labels=NUM_LABELS)

Step 4: Fine-Tuning

Fine-tune the model on your confidential data. If you are using PyTorch, this might look like:

from torch.utils.data import DataLoader
from transformers import AdamW

# Tokenize your training data
# ...

# Create a DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Initialize optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)

# Training loop
for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        inputs = tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
        labels = batch['labels']
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

Step 5: Model Evaluation

After fine-tuning, evaluate the model’s performance on a validation set to ensure it meets your criteria.

Step 6: Use Model

Once you’re satisfied with the performance, the fine-tuned model can be integrated into your application and used locally for inference, ensuring that your confidential data does not leave your local environment.

Step 7: Save Fine-Tuned Model

You can save your fine-tuned model locally for future use.

model.save_pretrained("path/to/save/")

Please note that this is a simplified guide. The actual implementation may require additional steps such as data preprocessing, handling imbalanced data, model validation, and more. Always remember to review the licensing terms of any pre-trained model you decide to use.

2 Likes

Exellent reply. Thanks for the details information an the GPT answer. Is there a need to use langchain then? It does seem to hook into hugging face nicely by the looks of it.

I am actually not quite sure on this question (when would I need langchain as a Hugging Face user or choose langchain over HF) You might find this short tutorial interesting which combines HF and langchain for a complete image-to-audio-story demo application (including GUI): Hugging Face and LangChain Tutorial

Since I find the question interesting, I also asked ChatGPT “How do HF and langchain compare?”. Here is the answer which I found quite informative (with the usual caution):

ChatGPT: As of my last update in January 2022, Hugging Face (HF) and Langchain serve different needs and operate in distinct contexts within the AI and NLP landscape. Here are some key differences:

Hugging Face:

  1. Open Source: HF has a strong commitment to open-source development. Their Transformers library is widely used and community-supported.

  2. Extensive Model Hub: HF offers a broad range of pre-trained models for various languages and tasks, such as BERT, GPT-2, T5, and more.

  3. Community and Collaboration: HF has a large user base contributing models, fine-tuning techniques, and other resources.

  4. Research-Oriented: HF actively collaborates with research institutions and aims to bring cutting-edge research into their platform.

  5. General NLP: HF focuses on general-purpose NLP models and tasks such as translation, summarization, classification, etc.

  6. Commercial Services: HF also provides commercial services like model hosting, data annotation, and enterprise solutions.

Langchain:

  1. Domain-Specific: Langchain seems to focus on specific industry applications, such as finance, rather than providing a general-purpose NLP library.

  2. Custom Solutions: They may offer tailored NLP solutions designed for specific enterprise needs.

  3. End-to-End Service: Langchain might provide an end-to-end service that includes not just the model but also the data pipeline, integration, and other aspects.

  4. Proprietary: Unlike HF, Langchain may operate on a more proprietary basis, offering custom solutions as a service.

  5. Limited Model Range: They may not offer the wide range of models and tasks that HF provides, focusing instead on specialized applications.

  6. Business-Oriented: The focus might be more on delivering immediate business value through specialized applications rather than contributing to or utilizing open-source developments.

Because information about Langchain is limited in my database, the above observations may not fully capture the scope and offerings of Langchain.

In summary, while Hugging Face aims to be a general-purpose, open-source NLP resource, Langchain appears to be more industry-specific and may operate on a proprietary basis. Depending on your needs—whether you are looking for a wide range of pre-trained models and community support, or a custom, industry-specific solution—one may be more suitable than the other.

Another option would be to investigate the Enterprise version of HuggingFace. Might be cheaper than running your own GPUs locally depending on whether you have a powerful enough computer already or not.

When working with private data and language models, privacy and security considerations are paramount. If you’re dealing with sensitive or confidential information, you should take steps to ensure that the data is protected. Some general strategies to consider include:

  • Data Anonymization: Remove or encrypt any personally identifiable information (PII) from the documents before training or using the models.
  • Local Deployment: If possible, deploy the models on your own infrastructure to have more control over data security. Some pre-trained models can be fine-tuned on your private data, like company information API, and used locally.
  • Access Control: Implement strict access control mechanisms to restrict who can access and use the models and data.

It sounds like you have an interesting project idea for your internal hackathon involving training a Language Model (LLM) on user manual documents. I’ll provide some clarification on the tools you mentioned—Hugging Face, LLMs, Llama Index, and LangChain.

  • Overview: Hugging Face is a platform that provides a variety of natural language processing (NLP) resources, including pre-trained models, datasets, and tools for working with transformers.
  • Relevance: Hugging Face’s Transformers library offers easy access to pre-trained models, including those for language generation. It provides a wide range of pre-trained models, and you can fine-tune them on your specific task or data.

Considerations for Using LLMs with Private Data:

  • Privacy and Compliance: Ensure that your approach complies with privacy regulations and your company’s lead data enrichment handling policies.
  • Data Security: Evaluate tools like Llama Index or LangChain for secure interactions with models if you’re dealing with sensitive or private information.
  • Fine-tuning: If fine-tuning on private data is part of your plan, be cautious about potential information leakage from the training data.

@theseancronin Did you manage to achieve what you wanted? We’ve actually been building this functionality into both the desktop application and the python SDK of our federated AI platform (www.bitfount.com). We’re planning to make this free to use with HF models. We’d love to get your feedback on it if you’re open to trying it out