Static type checking(with mypy): What's the official position?

PEP484 introduced type hinting into Python(There have been more PEPs). mypy (documentation is here) is the most popular type checker that conforms to these PEPs.

Is there an official position from HuggingFace on typing the public API of it’s libraries? I tried looking through Discourse, and Github issues, I couldn’t find any mention of plans for static type checking support.

My impression of the codebase is that, while the “public facing api” is actually completely documented in terms of types(using docstrings), the actual type annotations (whether inline, or in a separate stub-only package are not there.

PS: If this question is better suited for Github issues, let me know.

1 Like

Type annotation is half-there, half-not-there, mainly because the transformers library used to support python 2.7. For instance a new file like pipelines.py has almost everything properly typed-annoted.

As I work on the documentation, I try to add them where missing but if you want to help with PRs, feel free to do so :wink:

3 Likes

Thanks for the quick reply! That’s very good to know. I was deciding between keeping a separate stub only package or creating a fork and annotating the parts I use inline, and now I know which way to go.

I’ll probably be annotating only the parts that I use, but I’ll make an effort to put make sure everything I do is nice enough for a PR.

3 Likes

An update: Currenlty, I’m annotating the .from_pretrained() methods on AutoTokenizer, AutoModel and AutoConfig.

Using @overload, along with Literal[] in the arguments, allows one to specify whether AutoConfig.from_pretrained(...) is a BertConfig, a T5Config, …etc. However, since there are dozens of pretrained models, all the overloads quickly get in the way of readability.

I think this issue must be addressed in one or more of the following ways.

  1. Separate out annotations to a stub-only package(which IMO is less intrusive, and yet requires same level of maintainance as pyi files in the repo itself).
  2. Just return a PretrainedConfig, and avoid the super-specific overloading.

I don’t think option 2 by itself is enough since, to the best of my knoweldge, it’s impossible to type AutoTokenizer.from_pretrained according to the docstring, without changing the actual signature(or having a “single” overload, which is not allowed, which means we have to create a “dummy” overload whose only use is to fool mypy).

@sgugger , what do you think?

No you can’t be as detailed as the docstring. Just return an “AutoTokenizer”.
In general, I put the more general type hints and use the docstring to clarify if needed.

Sounds good!

Would returning a PreTrainedTokenizer would make more sense than an AutoTokenizer?

It’s probably clearer for the doc.

Hey! I’m just about to make a PR, and I realized now that I might not have understood you completely here. Did you mean “annotate with AutoTokenizer, and let the documentation take care of all the other details”?

Currently, the PR annotates as follows

class AutoTokenizer:
    @classmethod
    def from_pretrained(<the arguments>) -> "PreTrainedTokenizer":
         ...

since all instances returned by the method are indeed instances of subclasses of PreTrainedTokenizer. Let me know if you think AutoTokenizer would be better, and I’ll change my PR.

I have no strong opinion since both are valid. I guess I’d have more thoughts once I see the actual code in the PR.

1 Like

is definitely more correct than AutoTokenizer which is not a class but just a factory (no object is ever an instance of AutoTokenizer)

1 Like