A universal granular method to breakdown text for modeling

Joanna · May 19, 2021, 4:27pm

My gut feeling is one thing but,

Is there a way to breakdown text into the most granular form that can be applied to any model? (For instance IOB, tagging, tokens etc?)

lewtun · May 20, 2021, 12:42pm

hey @Joanna if i understand correctly, your question is fundamentally about the manner in which we tokenize raw text into tokens right? for example, BERT uses a WordPiece tokenizer that can be used for all downstream tasks, provided some alignment between the tokens and labels is given for tasks like NER.

there’s a nice walkthrough the various tokenization strategies that are employed in transformers here: Summary of the tokenizers — transformers 4.5.0.dev0 documentation

Joanna · May 20, 2021, 2:11pm

Thank you @lewtun this is really helpful and straight forward. Depending on the goal and purpose of processing text, do you think tokenizing one way could then be passed into a NLTK, spaCy, or HuggingFace model? Or breaking down text into some format to move through/across multiple models from various software libraries? (thank you again, I have so much respect for Hugging Face!)

lewtun · May 21, 2021, 4:26pm

this idea of having interoperable tokenizers is an interesting one, although there’s a lot more involved when interfacing with the API of each framework and tokenization doesn’t seem to be the bottleneck across them. you might be interested in checking out the Tokenizers library (link) which is framework-agnostic and quite versatile

Joanna · May 22, 2021, 12:43am

Thank you, what do you think would be the bottleneck?

lewtun · May 23, 2021, 1:25pm

as far as interoperability across frameworks is concerned, here’s a few examples that i’ve personally ran into before:

training a model in one framework (e.g. transformers) and re-using it in another (e.g. scikit-learn)
using components from one framework in another, e.g. using components from transformers or spacy inside scikit-learn pipelines

for approaches that try to address these issues, you might find ONNX and [tokenwiser](GitHub - koaning/tokenwiser: Bag of, not words, but tricks!) of interest.

hth!

Topic		Replies	Views
Text Classification tokenizer problems on inference Intermediate	4	2275	October 12, 2022
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
NER for chunks / sentences 🤗Transformers	4	2359	February 12, 2021
Tokenizing Domain Specific Text 🤗Tokenizers	5	1441	November 20, 2020
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	86	January 9, 2025

A universal granular method to breakdown text for modeling

Related topics