Does anyone know a list of LLMs that provide the list of data sources that were used in the training?

Does anyone know a list of LLMs that provide the list of data sources that were used in training? I am studying about LLMs and how they can relate to open science concepts.

1 Like

Although there aren’t many people who actually list them, there is a datasets field where you can list the data sets you used for training.

---
language: 
  - "List of ISO 639-1 code for your language"
  - lang1
  - lang2
thumbnail: "url to a thumbnail used in social sharing"
tags:
- tag1
- tag2
license: "any valid license identifier"
datasets:
- dataset1
- dataset2
metrics:
- metric1
- metric2
base_model: "base model Hub identifier"
---