Amharic NLP - Train BERT-style model

yosiasz · February 25, 2021, 4:43am

@israel Here is a thread where we can collaborate on work to pre-train a BERT-style model for Amharic on OSCAR data.

One thing I have noticed on a lot of NLP efforts is it has a high barrier to entry. I believe the documentation needs to be so clear that anyone (with minimum data science knowledge) coming after us has to be able to implement the process with easy to follow step by step instructions.

Thanks

israel · February 25, 2021, 6:43am

Please DM me.

yosiasz · February 27, 2021, 9:20pm

@yjernite

this part has some issues in my jupiter

def tokenize_function(examples):
    return tokenizer(examples["text"])

when called from

tokenized_datasets = datasets.map(tokenize_function, batched=True, 
num_proc=4, remove_columns=["text"])

File “”, line 2, in tokenize_function
NameError: name ‘tokenizer’ is not defined

yjernite · March 1, 2021, 12:17am

Are you talking about the Colab you shared in another thread? Did you run the cells in order? Looks like you haven’t run tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) yet

Topic		Replies	Views
Amharic BERT Training Beginners	2	472	February 23, 2021
Amharic NLP: Newbie where do I start Languages at Hugging Face	13	2513	February 27, 2021
Habesha BERT Amharic Model cards	0	1697	March 5, 2021
Amharic NLP - Introductions Languages at Hugging Face	5	892	February 24, 2021
Doccano dataset for named entity recognition task using BERT Beginners	3	507	May 14, 2024

Amharic NLP - Train BERT-style model

Related topics