Funcom Dataset for summarization

Hi everybody!
I’m just started on nlp and i’m working on my degree thesis, which involves experimenting with some dataset. I found the funcom dataset which is made of pieces of java code and their javadocs. My question is, does anybody has ever tested sota models for summarization on this dataset? Would it give good results? Or the pretraining on such models does not give any knowledge on source code?

Thanks in advance :smiley:

Hi airnicco8,

I’m not an expert, but that looks a bit tricky. What would you intend to do with the funcom data? Would you be trying to build a seq-2-seq model that could translate from java code to comment string?

If you are supposed to be doing NLP, then java code might not be appropriate, as java is not a Natural Language.

A big advantage of the huggingface library is that it includes many pre-trained models, that you can fine-tune to your own data. I don’t think there are any models pre-trained on java code. See this page for the list of models available in huggingface https://huggingface.co/transformers/pretrained_models.html

I suggest you start with something simpler.

Thanks for the reply, that’s what i thought too but i wanted to ask for the sake of double checking!