My interest in Japanese NLP:
Although I mainly work with English texts as my research subject, my first language that I use daily is Japanese.
I was involved in the Japanese translation project of spacy-course. I would be pleased if I could make some contribution in the “Languages at Hugging Face”.
Hello, my name is Paul O’Leary McCann. I’m the creator of fugashi, the MeCab wrapper which is used for Japanese tokenization in some BERT models. I also maintain mecab-python3, which was used in Transformers before fugashi. I also have done a lot of work on the Japanese models in spaCy and worked on improving speed in SudachiPy last year.
My interest in Japanese NLP:
I live in Japan and have worked in Japanese industry for years, and as an independent consultant the past year and a half. I think that while there are good tools for working with the unique challenges Japanese presents there’s often issues with ease of use or maintenance, so a lot of my open source work has focused on that.
Projects I am working on or interested in starting:
I just released Kanji Club, a kanji search site, though it isn’t a machine learning project. Other than that, I’m not working on anything in Transformers actively at the moment but I enjoy keeping up with community developments and look forward to seeing what people come up with! Please do feel free to @ me if you have any trouble with or questions about the MeCab wrappers.
Actually, if the Livedoor News Corpus isn’t in datasets yet it should probably be added, so that’s a project idea…
Some of my projects:
Kanji Club: Just released instant kanji search-by-parts site
My interest in Japanese NLP:
I would like to add more Japanese language models so that Japanese developers can easily build Japanese language solutions.
Some projects you are working on or interested in starting:
I am planning to add a Japanese pre-training model(Considering XLNet, Roberta).
Building datasets and developing tokenizers are crucial concerns in my planning. I would be happy to cooperate with you
Hello, I am Kazuma Takaoka.
I am a developer of Japanese morphological analyzers, Sudachi and SudachiPy.
We have recently started a project to use SudachiPy as a tokenizer for transformers, and we are planning to release models using Sudachi.
Hi, NLPers! My name is Shunsuke Kitada. I’m a Ph.D. student interested in NLP for deep learning and multimodal fields with NLP and computer vision. @whitphx san told me this awesome thread on Twitter. Like you guys, one of my interests is Japanese NLP.
Lately, I have been favoring huggingface (HF) datasets. HF datasets can be published separately from the cumbersome data loader part, and has a simple and very easy-to-use interface. We hope it will be used by a wide variety of people.
I’m currently releasing implementations of several Japanese datasets/benchmarks to be made available on HF datasets. Here are some of them:
In addition to these datasets, I am planning to release HF datasets implementations for other Japanese language datasets as well. I hope that these activities contribute to boosting the Japanese NLP field.
I’ve been maintaining an article titled フリーで使える日本語の主な大規模言語モデルまとめ, which collects various pre-trained language models specific to Japanese.
Although recent trends are in a very-large-scale model like ChatGPT, I believe there is still of some value in gathering information about “medium-size” models we can fine-tune on our own.
If you want to add a new model or make a correction, feel free to comment to the article (or contact me via hellorusk1998[at]gmail.com )
Welcome @kaisugi san!
I often look at your site and use it as a reference. Thank you for maintaining such a useful information source and I’m so happy to see you in this thread!
My name is Akim Mousterou. I am born and raised in Paris, France. 39 years old. I am currently living in the city of light but moving around a lot (SF, HK, and Tokyo). My work in NLP is revolving mostly around NER/Knowledge base for strategic insights. Apart from NLP, I am passionate about network effects, alternative datasets, and quantitative research.
My interest in Japanese NLP:
In university, I did Japanese studies and I have a Master’s degree in multilanguage engineering, NLP with a focus on the Japanese language. From 2009 to 2010, I did my working holiday visa in Tokyo, Japan. Over the years, I have worked as a business consultant with European companies in Japan and a few Japanese companies.
Recently, I validated the JLPT N2 for fun but my Japanese is a little bit rusty. : )
My research in NLP & Quantitative research:
I shared a few notions on my Github on the specificities of Japanese in NLP that obviously you might be aware of it.
NER specificities in Japanese for Masa of Softbank on Twitter, testing of ASR Whisper on earning calls of Uniqlo, and introduction in Quantum NLP for Japanese → AkimParis · GitHub
My latest project is an Anki deck with around 400 words in Japanese (English and French) about machine learning, statistic, and natural language processing to promote communication among NLP practitioners. → Vocabulary Japanese (En/Fr) about Machine Learning & NLP/CV - AnkiWeb
Please feel free to connect on LinkedIn (Akim Mousterou) or GitHub (AkimfromParis).
Thank you for sharing your works!
Promoting communication among NLP practitioners (and team members with different expertise) is what I am interested in, and your deck seems helpful!
It seems that I cannot update my self-introduction post, so let me add some updated information.
I’m Yusuke Mori.
In 2021, I got my Ph.D. in the field of Information Science and Technology.
Now I am working as a researcher in the field of NLP.
My interest in Japanese NLP:
I am interested in storytelling, machine learning, and natural language processing, which I believe have a tight relationship with creativity.
Please visit my website if you are interested in the following topics,
COMPASS (a writing support system to COMPlement Author unaware Story gapS)
I’m a grad student in Cognitive and Brain Science at USC ( a university in California ). Right now I’m primarily working on whisper. I’m trying to create an accurate model for Japanese translations of things related to pop culture like tv anime etc.