Does anyone know a list of LLMs that provide the list of data sources that were used in training? I am studying about LLMs and how they can relate to open science concepts.
1 Like
Although there aren’t many people who actually list them, there is a datasets field where you can list the data sets you used for training.
---
language:
- "List of ISO 639-1 code for your language"
- lang1
- lang2
thumbnail: "url to a thumbnail used in social sharing"
tags:
- tag1
- tag2
license: "any valid license identifier"
datasets:
- dataset1
- dataset2
metrics:
- metric1
- metric2
base_model: "base model Hub identifier"
---