Q1: Whats the equivalent of model.encode() from the HF transformer library using AutoModel.from_pretrained('distilbert-base-cased')
?
Here is the code of interest in SBERT that works and i want to replicate its behavior with HF’s distilbert
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
distilBERT_sentence_embeddings = model.encode(list(x_train), show_progress_bar=True)
Here is what I’ve tried but this does not give me the embedding that i’m looking for:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
Note that I haven’t plugged in a list of x_train but the goal is to eventually plug it in for “Hello, my dog is cute” when I get the code to work.
Q2: Additionally, SBERT here is trained on an NLI task using a siamese network that requires pairs of inputs – is it acceptable to use this SBERT model for non-NLI tasks (like seq classification or sentiment analysis) and with tasks that don’t have sentence pairings?