A universal granular method to breakdown text for modeling

My gut feeling is one thing but,

Is there a way to breakdown text into the most granular form that can be applied to any model? (For instance IOB, tagging, tokens etc?)

hey @Joanna if i understand correctly, your question is fundamentally about the manner in which we tokenize raw text into tokens right? for example, BERT uses a WordPiece tokenizer that can be used for all downstream tasks, provided some alignment between the tokens and labels is given for tasks like NER.

there’s a nice walkthrough the various tokenization strategies that are employed in transformers here: Summary of the tokenizers — transformers 4.5.0.dev0 documentation

Thank you @lewtun this is really helpful and straight forward. Depending on the goal and purpose of processing text, do you think tokenizing one way could then be passed into a NLTK, spaCy, or HuggingFace model? Or breaking down text into some format to move through/across multiple models from various software libraries? (thank you again, I have so much respect for Hugging Face!)

this idea of having interoperable tokenizers is an interesting one, although there’s a lot more involved when interfacing with the API of each framework and tokenization doesn’t seem to be the bottleneck across them. you might be interested in checking out the :hugs: Tokenizers library (link) which is framework-agnostic and quite versatile

Thank you, what do you think would be the bottleneck?

as far as interoperability across frameworks is concerned, here’s a few examples that i’ve personally ran into before:

  • training a model in one framework (e.g. transformers) and re-using it in another (e.g. scikit-learn)
  • using components from one framework in another, e.g. using components from transformers or spacy inside scikit-learn pipelines

for approaches that try to address these issues, you might find ONNX and [tokenwiser](GitHub - koaning/tokenwiser: Bag of, not words, but tricks!) of interest.