AMA with Julien Chaumond, HF CTO

Last week we hosted @julien-c, Hugging Face CTO, in the Hugging Face Community Discord Server in an open AMA. Here you can find the questions and answers :hugs:

1. What is your favourite drink and why is it chai?

I have universal taste i.e. I love both tea and coffee. I have to say though, many team members at HF are coffee fans! We even have a PyTorch coffee machine (some one tried to rebrand it to TF once)

2. What was (is) the most difficult technical problem to solve when creating (maintaining) the HF Hub?

So, personally I’m working a lot on the HF Hub, that’s where I spend a lot of my time at HF

I wouldn’t say anything was super hard technically, but the big defining decisions were:

  • when we started it was just direct uploads to a S3 bucket (with auth on top of it). We started having the need for versioning, there’s a versioning feature on top of S3 but we wondered if we should just try to implement a git server. At first we thought it might be over-engineered, but in the end after experimenting with building our own git server it started to be clear it was going to be a good decision!

  • trying to keep a super simple platform vs. building too many non-consistent features at the same time.
    In a way we can choose to be more opinionated than a generic git host like GitHub or Gitlab, because we’re building something tailored for ML ie. models, datasets and Spaces. When you think about it GitHub is nicely done but some of the underlying concepts are super complex. We’re trying to make them a little bit easier, because we’re 100% focused on ML

3. If Hugging Face had an official mascot, would it be a hugging alpaca or a hugging llama?

This question didn’t get an answer, but it came out that the bear is the totem animal from @julien-c.

4. Are there any research projects going on at HF? Where does one find them?

Yes! we have a super awesome research team now at HF (about 20 people including super cool team members like Lucile Saulnier, Douwe Kiela, Meg Mitchell, Victor Sanh etc). One big project we are working on right now is BigScience: it’s a Large Language model the size of GPT-3 and when training is completed it’ll be the largest multilingual open-sourced LM which is super exciting.

You can follow the training in real time on the hub at the top of🙂

PS/ it’s an open collaboration with many different partners from Academia and companies. You should join if you’re interested in the research side!

5. Why did you choose to use git-lfs over something like DVC to store datasets and models?
We did experiment with DVC a bit but we liked the fact that git-lfs was already widely used and tested, and more seamlessly integrated into git itself. lfs is very similar to the data versioning layer of DVC if i remember correctly

I wrote a little bit more details in [Announcement] Model Versioning: Upcoming changes to the model hub IIRC

6. After BigScience can we expect HuggingFace to train their guns at Vision domain and do something grand on image generation like Dall-e2 or something of the sorts?:star_struck:

There’s a community member named @boris who’s doing amazing replication of Dall-e 2 you should ping him!! (note from Omar: you can check out a demo here and I suggest to check Boris awesome Twitter threads).

From Omar: Btw apart from huge models, there are lots of works in using transformers for CV. We can already do image segmentation, classification, and object detection. @nielsr and the team have been doing some very cool stuff, and we also recently had a GAN sprint. @NielsR_ Has been sharing cool stuff in the computer-vision channel, feel free to lurk over there if that’s interesting to you :slightly_smiling_face:

7. Maybe not a really technical question, but as ethical AI is quite important for HF, do you actually have a policy that forbids “unethical” companies from being active on the Hub and spaces and if yes, how does that work out, so who decides?

Super important question. On the open source side we don’t have a policy for this. We will work on a “Flag this repo” feature, for the community to be able to more seamlessly report bad stuff directly on the website (for now they ping us and we work with e.g. model authors to add disclaimers in case there are ethical issues and/or big limitations to applications of the model/dataset/space)

On the business side however, in the past we’ve declined working with some companies or institutions for ethical reasons (for instance, we do not wish to work with the intelligence industry – in any countries – because the potential for bad things is just too high).

I’m actually looking for feedback on those points. Please get in touch if you want to discuss more

8. Do you have users/usage hf un legal activities ? (lawyers, academics,…)

Yes! I’ve seen models for summarization of legal documents (in several different languages), entity extraction (for instance for anonymization of legal decisions before releasing them in open data). Many different applications!

9. Adjacent to the role of HF question, what’s the long-term business model for HF?

As I understand today it’s mostly funded by VC, which eventually runs out. And of course, it’s generally tricky to build a sustainable and large business out of open source. I imagine hosted inference may be important, but I’m sure there’s more :slightly_smiling_face:

We’re already generating significant revenue through a combination of hosted services, open source support, and partnerships with e.g. ML hardware or Cloud companies. For instance it’s public that we work with AWS to make sure our software works well on their cloud, including their SageMaker platform

Longer term I think a lot of value is going to be at the intersection of “collaboration” and “compute”. i.e. what if you’re a company/team using ML and we make it super easy to launch training jobs or inference endpoints directly from the collaboration platform. Then you don’t even necesarrily have to go through your AWS or Azure account, etc. And because ML compute is hard, if we make it more seamless it could be a game changer for the adoption and success of ML itself :slightly_smiling_face:

10. What excites you the most about the future of Hugging Face and what’s the next big thing for it?

On the Hub side (where I’m working most of the time) I think we’re still super early in the sense that the platform is functional but there’s way more stuff we want to build to provide more support for anyone working on ML and collaborating on ML.

“Collaboration” and “seamless integration to compute” are the two big areas where we are working on the Hub side! Stay tuned =)

11. Any uses of the hub so far that have really stood out to you/made you smile? Anything unexpected you’ve had to deal with in eg. spaces?

bitcoin miners :slightly_smiling_face:

11. I have been a big fan of transformers for ~2 years, but only in last few weeks realized the amazing new features like Spaces. Can you share thoughts on what makes HF so successful at be able to continue to innovate?

From Omar: HF is community centric. No, really…it’s suuuuper community centric. It’s extremely transparent, open, collaborative. We have channels and discussions with lots of open source companies, libraries, research groups, etc, and we really push to solving problems that the community has. My opinion is that this has allowed HF to innovate and be very successful.

From Julien: yes 100%. Also THE TEAM!! We’re super super lucky to have assembled a small team of super super talented individuals, where everyone is passionate about building the future of AI collaboratively with the community. Every day I’m grateful to be able to do the best work of my life – and something meaningful – with this team :heart: :blue_heart: :purple_heart:

12. Ever had any comments or pushback from (“the serious”) VCs/tech peeps etc about the adorable company logo/name

No, I think the biggest complaint is maybe Alien fans for whom it makes them think a little bit too much of Face huggers :slightly_smiling_face: :slightly_smiling_face:

13. Hugging Face is already the open source leader, what are the big unsolved challenges that you are excited about? With Transformers rising in Vision, do you envision creating a bigger focus on CV inside the team?

Yes, because the tooling and platform works well for all fields of ML, we are progressively expanding support to CV, but also Audio, RL, even structured data… But our goal is to be driven by the community and what’s most useful to everyone in ML

I think one big challenge is that the ML community is going to grow way larger in the coming years (today it’s ~50x times smaller than software engineering for instace, it may get as big or even bigger?) and the challenge for us is to make ML as accessible as possible to as many potential contributors as possible (including people just getting started with ML or software)

It’s going to be super important to enable a diverse, large community to work on ML at scale

Thanks everyone for joining!