Static type checking(with mypy): What's the official position?

davidat · July 24, 2020, 3:46pm

PEP484 introduced type hinting into Python(There have been more PEPs). mypy (documentation is here) is the most popular type checker that conforms to these PEPs.

Is there an official position from HuggingFace on typing the public API of it’s libraries? I tried looking through Discourse, and Github issues, I couldn’t find any mention of plans for static type checking support.

My impression of the codebase is that, while the “public facing api” is actually completely documented in terms of types(using docstrings), the actual type annotations (whether inline, or in a separate stub-only package are not there.

PS: If this question is better suited for Github issues, let me know.

sgugger · July 24, 2020, 5:03pm

Type annotation is half-there, half-not-there, mainly because the transformers library used to support python 2.7. For instance a new file like pipelines.py has almost everything properly typed-annoted.

As I work on the documentation, I try to add them where missing but if you want to help with PRs, feel free to do so

davidat · July 24, 2020, 6:53pm

Thanks for the quick reply! That’s very good to know. I was deciding between keeping a separate stub only package or creating a fork and annotating the parts I use inline, and now I know which way to go.

I’ll probably be annotating only the parts that I use, but I’ll make an effort to put make sure everything I do is nice enough for a PR.

davidat · July 29, 2020, 3:55am

An update: Currenlty, I’m annotating the .from_pretrained() methods on AutoTokenizer, AutoModel and AutoConfig.

Using @overload, along with Literal[] in the arguments, allows one to specify whether AutoConfig.from_pretrained(...) is a BertConfig, a T5Config, …etc. However, since there are dozens of pretrained models, all the overloads quickly get in the way of readability.

I think this issue must be addressed in one or more of the following ways.

Separate out annotations to a stub-only package(which IMO is less intrusive, and yet requires same level of maintainance as pyi files in the repo itself).
Just return a PretrainedConfig, and avoid the super-specific overloading.

I don’t think option 2 by itself is enough since, to the best of my knoweldge, it’s impossible to type AutoTokenizer.from_pretrained according to the docstring, without changing the actual signature(or having a “single” overload, which is not allowed, which means we have to create a “dummy” overload whose only use is to fool mypy).

@sgugger , what do you think?

sgugger · July 29, 2020, 1:23pm

No you can’t be as detailed as the docstring. Just return an “AutoTokenizer”.
In general, I put the more general type hints and use the docstring to clarify if needed.

davidat · July 29, 2020, 7:54pm

Sounds good!

Would returning a PreTrainedTokenizer would make more sense than an AutoTokenizer?

sgugger · July 29, 2020, 8:47pm

It’s probably clearer for the doc.

davidat · August 3, 2020, 4:45pm

Hey! I’m just about to make a PR, and I realized now that I might not have understood you completely here. Did you mean “annotate with AutoTokenizer, and let the documentation take care of all the other details”?

Currently, the PR annotates as follows

class AutoTokenizer:
    @classmethod
    def from_pretrained(<the arguments>) -> "PreTrainedTokenizer":
         ...

since all instances returned by the method are indeed instances of subclasses of PreTrainedTokenizer. Let me know if you think AutoTokenizer would be better, and I’ll change my PR.

sgugger · August 3, 2020, 5:02pm

I have no strong opinion since both are valid. I guess I’d have more thoughts once I see the actual code in the PR.

julien-c · August 4, 2020, 1:22pm

is definitely more correct than AutoTokenizer which is not a class but just a factory (no object is ever an instance of AutoTokenizer)

Topic		Replies	Views
Can't load tokenizer using from_pretrained, Interface API 🤗Tokenizers	0	323	May 21, 2024
Java Client for Hub/Inference API Beginners	1	4104	February 13, 2024
New api-interface-community library - what's allowed and how to test it 🤗Hub	0	735	December 13, 2021
Can't load tokenizer using from_pretrained, Inference API 🤗Tokenizers	4	1799	May 6, 2024
Private model inference widget suddenly stopped working 🤗Hub	4	1083	July 19, 2023

Static type checking(with mypy): What's the official position?

Related topics