All i know is GPT-J is GPT-3 like model, and GPT-3 is bigger implementation of GPT-2 architecture. Then is there any technical difference between GPT-2 and GPT-J?
A answer comes from new bing:
GPT-J and GPT-2 are both transformer models that can generate text from a prompt. However, they have some differences in their size, architecture, training data and performance.
GPT-J is a newer and larger model than GPT-2. It has 6 billion parameters, while GPT-2 has 1.5 billion parameters¹. GPT-J also uses Rotary Position Embedding (RoPE) to encode the position of each token, while GPT-2 uses sinusoidal position embedding².
GPT-J was trained on the Pile, a large-scale curated dataset created by EleutherAI, which contains diverse sources of text such as books, Wikipedia, GitHub and Reddit². GPT-2 was trained on WebText, a smaller dataset scraped from the internet by OpenAI, which mainly consists of news articles and blog posts³.
GPT-J is claimed to approach GPT-3-level performance on some natural language processing tasks, such as arithmetic, logic, quantitative reasoning and pseudocode generation. GPT-2 is less capable of these tasks and somewhat underperforms equal-sized GPT-3 models¹.
Source:
(2) EleutherAI/gpt-j-6B · Hugging Face. EleutherAI/gpt-j-6B · Hugging Face Accessed 2023/3/31.
(3) EleutherAI claims new NLP model approaches GPT-3-level performance. EleutherAI claims new NLP model approaches GPT-3-level performance | VentureBeat Accessed 2023/3/31.
while the above answer is helpful, I believe GPT-J uses sinusoidal position embedding as well according to hugging face https://github.com/huggingface/transformers/blob/172f42c512e1bf32554ef910fe82f07916b4d4af/src/transformers/models/gptj/modeling_gptj.py#L115
I don’t think GPT-2 uses sinusoidal position embedding? according to HF implementation - transformers/src/transformers/models/gpt2/modeling_gpt2.py at main · huggingface/transformers · GitHub