My gut feeling is one thing but,
Is there a way to breakdown text into the most granular form that can be applied to any model? (For instance IOB, tagging, tokens etc?)
My gut feeling is one thing but,
Is there a way to breakdown text into the most granular form that can be applied to any model? (For instance IOB, tagging, tokens etc?)
hey @Joanna if i understand correctly, your question is fundamentally about the manner in which we tokenize raw text into tokens right? for example, BERT uses a WordPiece tokenizer that can be used for all downstream tasks, provided some alignment between the tokens and labels is given for tasks like NER.
there’s a nice walkthrough the various tokenization strategies that are employed in transformers
here: Summary of the tokenizers — transformers 4.5.0.dev0 documentation
Thank you @lewtun this is really helpful and straight forward. Depending on the goal and purpose of processing text, do you think tokenizing one way could then be passed into a NLTK, spaCy, or HuggingFace model? Or breaking down text into some format to move through/across multiple models from various software libraries? (thank you again, I have so much respect for Hugging Face!)
this idea of having interoperable tokenizers is an interesting one, although there’s a lot more involved when interfacing with the API of each framework and tokenization doesn’t seem to be the bottleneck across them. you might be interested in checking out the Tokenizers library (link) which is framework-agnostic and quite versatile
Thank you, what do you think would be the bottleneck?
as far as interoperability across frameworks is concerned, here’s a few examples that i’ve personally ran into before:
transformers
) and re-using it in another (e.g. scikit-learn
)transformers
or spacy
inside scikit-learn
pipelinesfor approaches that try to address these issues, you might find ONNX and [tokenwiser
](GitHub - koaning/tokenwiser: Bag of, not words, but tricks!) of interest.
hth!