Should I Include Poet Information as a Feature in LLM Training with 3,356 Unique Poets?

JanaGu · October 9, 2024, 8:45am

Hello, I am working on a project where I am training a large language model (LLM) to generate Arabic poetry. My dataset includes poems from 3,356 unique poets, and I am considering whether to include the poet as a feature (e.g., adding a special token for each poet).

My main concern is whether this will make the model more complex and potentially hinder its ability to learn other important patterns, such as rhyme schemes, meter, and thematic elements. Would adding a unique token for each poet (given the large number) lead to slower convergence or confusion during training? Or is it generally fine to include poet-specific tokens without negatively impacting the model’s learning of other patterns

Topic		Replies	Views
Train model for poem, quotes and for emotional writings Beginners	0	335	June 25, 2021
Token merging for fast LLM inference Research	0	493	April 17, 2024
Custom NER with ~54 entities Community Calls	0	430	May 25, 2023
Using truncated fragments as input samples in training 🤗Tokenizers	3	683	July 1, 2021
Building Own Knowledge Base LLM Beginners	1	1576	April 6, 2024

Should I Include Poet Information as a Feature in LLM Training with 3,356 Unique Poets?

Related topics