Hello, my name is Paul OāLeary McCann. Iām the creator of fugashi, the MeCab wrapper which is used for Japanese tokenization in some BERT models. I also maintain mecab-python3, which was used in Transformers before fugashi. I also have done a lot of work on the Japanese models in spaCy and worked on improving speed in SudachiPy last year.
My interest in Japanese NLP:
I live in Japan and have worked in Japanese industry for years, and as an independent consultant the past year and a half. I think that while there are good tools for working with the unique challenges Japanese presents thereās often issues with ease of use or maintenance, so a lot of my open source work has focused on that.
Projects I am working on or interested in starting:
I just released Kanji Club, a kanji search site, though it isnāt a machine learning project. Other than that, Iām not working on anything in Transformers actively at the moment but I enjoy keeping up with community developments and look forward to seeing what people come up with! Please do feel free to @ me if you have any trouble with or questions about the MeCab wrappers.
Actually, if the Livedoor News Corpus isnāt in datasets yet it should probably be added, so thatās a project ideaā¦
Some of my projects:
-
Kanji Club: Just released instant kanji search-by-parts site
-
fugashi: Pythonic Cython-based MeCab wrapper
-
cutlet: A library to Romanize Japanese
Elsewhere online: