Train gpt-2 from scratch in Italian

giannigi · September 8, 2022, 2:29pm

Hello, I’m looking for a guide or some help to train gpt-2 from scratch on a small corpus. If this makes any difference the corpus is in italian, and I only know pytorch.

About tokenizers: is the tokenizer somehow language-agnostic?

About the corpus: is it supposed to be formatted in a specific way? I’ve seen some corpus’ files formatted with a space surrounding all punctuation symbols (before and after).

Is this the recomended script?

github.com

huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

#!/usr/bin/env python
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset.

Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=text-generation

This file has been truncated. show original

or this??

I’ve found this guide but uses tensorflow???

thanks

Topic		Replies	Views
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Train GPT2 on wikitext from scratch Beginners	5	3838	October 25, 2021
Language-modeling script "killed" when fine-tuning gpt2-medium Beginners	3	6895	May 19, 2023
How to train a translation model from scratch Beginners	9	12576	March 1, 2022

Train gpt-2 from scratch in Italian

Related topics