Converting text stories into embeddings with metadata and uploading to Pinecone for chatbot and content creation

wheelie · May 8, 2023, 10:49am

Hello everyone,

We have embarked on our first project involving a few hundred text stories (around 500mb). Our goal is to convert these text files into embeddings, add metadata specifying the source of each story, and upload the data to Pinecone. Once uploaded, we aim to utilize models like Koala 13b, GPT-4-x-aIpaca-13b-native-4bit-128g, and others with our private Pinecone data to develop chatbots and generate content.

We would be grateful for any suggestions, tutorials, or other valuable information to help us bring this project to life. We have all the necessary infrastructure in place and will be using Python as our preferred programming language.

Looking forward to your insights and guidance!

Best regards,
Ram.Sh

wheelie · May 8, 2023, 12:47pm

starting with the first step, It reads the text files, cleans them, generates embeddings, and uploads the data to Pinecone with the metadata included.

would that be the advised way to convert the text into vectors for our final goal?

import os
import re
import json
import spacy
import nltk
import uuid
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import Word2Vec
import pinecone

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize objects
stop_words = set(stopwords.words('english'))
stemmer = nltk.SnowballStemmer('english')
lemmatizer = nltk.WordNetLemmatizer()
nlp = spacy.load("en_core_web_md")

# Define function to clean text
def clean_text(text):
    # Lowercase text
    text = text.lower()
    # Remove non-alphabetic characters
    text = re.sub('[^A-Za-z]+', ' ', text)
    # Tokenize text
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Rejoin words
    cleaned_text = ' '.join(words)
    return cleaned_text

# Define function to preprocess text for embeddings
def preprocess_text(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text.lower())
    # Remove stop words
    words = [word for word in words if word not in stop_words]
    # Stem and lemmatize the words
    words = [stemmer.stem(lemmatizer.lemmatize(word, pos='v')) for word in words]
    return words

# Define function to generate embeddings
def generate_story_embeddings(text):
    # Preprocess text
    words = preprocess_text(text)
    # Generate Word2Vec model
    model = Word2Vec([words], size=100, window=5, min_count=1, workers=4)
    # Extract embeddings
    embeddings = model.wv[words]
    return embeddings

# Initialize Pinecone
pinecone.init(api_key="your-pinecone-api-key")
pinecone_index = pinecone.Index(index_name="your-pinecone-index-name")

# Process text files
data_dir = "data/"

for filename in os.listdir(data_dir):
    if filename.endswith(".txt"):
        # Read text file
        with open(os.path.join(data_dir, filename), "r") as f:
            text = f.read()

        # Clean and generate embeddings
        cleaned_text = clean_text(text)
        embeddings = generate_story_embeddings(cleaned_text)

        # Add metadata with filename and hardcoded category
        metadata = {'filename': filename, 'category': 'hardcoded_category'}

        # Upload embeddings and metadata to Pinecone
        pinecone_index.upsert(str(uuid.uuid4()), embeddings, metadata=metadata)

# Deinitialize Pinecone when finished
pinecone.deinit()

shaneel · June 15, 2024, 9:46pm

Hi @wheelie : Thanks for sharing the approach, hope this approach worked for you… If yes, were you able to make use of the values that you passed into metadata during Retrievals & Prompts ?

I have a similar problem and want extremely fine tuned context information based on some file level information & user input fields for a RAG application. Kindly confirm and also suggest any special inputs if you have for us…

Topic		Replies	Views
Is there a way to download embedding model files and load from local folder which supports langchain vectorstore embeddings Beginners	1	2022	November 29, 2023
Get sentence embedding vector using API? 🤗Transformers	0	340	September 10, 2021
Language model to search an answer in a huge collection of (unrelated) paragraphs Research	4	1527	July 6, 2021
Return embeddings via inference api 🤗Transformers	0	380	January 17, 2023
Upload Sentence Transformer embeddings Beginners	3	578	January 23, 2023

Converting text stories into embeddings with metadata and uploading to Pinecone for chatbot and content creation

Related topics