Hello, my name is Paul O’Leary McCann. I’m the creator of fugashi, the MeCab wrapper which is used for Japanese tokenization in some BERT models. I also maintain mecab-python3, which was used in Transformers before fugashi. I also have done a lot of work on the Japanese models in spaCy and worked on improving speed in SudachiPy last year.
My interest in Japanese NLP:
I live in Japan and have worked in Japanese industry for years, and as an independent consultant the past year and a half. I think that while there are good tools for working with the unique challenges Japanese presents there’s often issues with ease of use or maintenance, so a lot of my open source work has focused on that.
Projects I am working on or interested in starting:
I just released Kanji Club, a kanji search site, though it isn’t a machine learning project. Other than that, I’m not working on anything in Transformers actively at the moment but I enjoy keeping up with community developments and look forward to seeing what people come up with! Please do feel free to @ me if you have any trouble with or questions about the MeCab wrappers.
Actually, if the Livedoor News Corpus isn’t in datasets yet it should probably be added, so that’s a project idea…
Some of my projects:
Kanji Club: Just released instant kanji search-by-parts site
fugashi: Pythonic Cython-based MeCab wrapper
cutlet: A library to Romanize Japanese