FineTuning a LLM-Model

I want to fine tune a small LLM Model like Llama etc… to chat with arxiv publications.

Example questions:

  • Show me all publication issued in 2022
  • Show me all author names issues publications in 2022

Question

  • Is it even possible and how to start?
  • Can I restrict a LLM after Tuning only to the new data (arxiv publications)
1 Like

Thank you on having me.

  1. Your idea is possible.

It is possible task. In my research team fine tune the model using new dataset esp.)medical dataset. So it is not a problem to fine tune the model on the arxiv publications if the dataset is correct.

  1. Arxiv dataset

In my opinion, you should use langchain framework instead of fine tuning your model. Fine tuning might be right way to build your own arxiv assist. But it costs too much. So I recommend you using langchain. Langchain support the arxiv module.

1 Like

Yes, it is absolutely possible to fine tune a small LLM model like LLaMA to work specifically with a dataset such as arXiv publications. Fine tuning allows the model to specialize in a domain, enabling it to answer queries like “Show me all publications issued in 2022” or “List all authors who published in 2022.”

How to start:

  1. Collect and preprocess data – Gather arXiv metadata, abstracts, or full-text PDFs and structure it in a clean, machine-readable format (JSON, CSV, etc.).

  2. Choose your LLM and fine-tuning method – Small models like LLaMA or Alpaca are suitable. Use techniques like LoRA (Low-Rank Adaptation) for efficient fine tuning without retraining the full model.

  3. Train on your dataset – Feed the model the arXiv-specific data, including prompts and expected responses.

  4. Test and validate – Ensure the model provides accurate answers for queries like publications by year, author names, or topics.

Restricting the model to new data:

While you can fine tune the model to prioritize arXiv publications, most LLMs retain knowledge from pretraining. To strictly restrict answers to your dataset, you can combine fine tuning with a retrieval-augmented generation (RAG) setup. In this approach, the model queries the arXiv dataset dynamically, ensuring responses are always grounded in the latest publications.

In short, you can specialize an LLM for arXiv and even enforce strict domain constraints using RAG or embedding-based retrieval, giving you an AI model focused entirely on your dataset.

1 Like