Training a model to add HTML formatting to a web article?

chumpbo · June 17, 2021, 5:43pm

Hello Everyone, I want to train a seq2seq model that would take as an input a web/blog article in plain text and manage to add basic HTML formatting to it (such as < h3 > or < h2 > on some titles, < strong > on some keywords or phrases, < i > the same, etc…).

I have a dataset with thousands of articles containing those HTML tags and for each article I have its plain text version.
Am I supposed to fine-tune an existing seq2seq model like BERT or MT5, should I make one from scratch? How would I go about doing such thing? All I found on the internet is tutorials on fine-tuning pre existing models for sentiment analysis and translation but I cant find anything that gets close to what I want and I don’t know if it’s even possible to achieve.

I will appreciate enourmously any kind of help or pointers. Thank you very much.

Topic		Replies	Views
Train tokenizer for seq2seq model 🤗Tokenizers	0	337	April 19, 2024
Generate sentences from keywords only Beginners	4	3013	November 26, 2021
Looking for example for seq2seq model Beginners	0	397	December 26, 2022
Model trains with Seq2SeqTrainer but gets stuck using Trainer 🤗Transformers	4	1950	August 23, 2021
How to train a translation model from scratch Beginners	9	12550	March 1, 2022

Training a model to add HTML formatting to a web article?

Related topics