Agent Course final project: can we see the score for each question?

When our agent submits its results for the 20 test questions, we get this response:

{
  "username": "string",
  "score": 0,
  "correct_count": 0,
  "total_attempted": 0,
  "message": "string",
  "timestamp": "string"
}

That doesn’t tell us which questions we got wrong or right, just the total number. Is there any way to see individual results?

1 Like

Hmm… There doesn’t seem to be a function that returns individual ratings for each item in the official space.


The official Unit 4 scoring API does not return per-question correctness. It only returns an aggregate ScoreResponse (username, score, correct_count, total_attempted, message, timestamp). That is exactly what the /submit endpoint is coded to return. (Hugging Face)

What the API exposes (and what it does not)

The course docs list the public routes:

  • GET /questions
  • GET /random-question
  • GET /files/{task_id}
  • POST /submit (returns the overall score) (Hugging Face)

There is no route documented (or implemented) that returns “task_id → correct/incorrect” for your submission. (Hugging Face)

How scoring is computed (why you only see totals)

Inside the scoring Space code, /submit:

  1. Iterates over your submitted answers list.

  2. Looks up each task_id in an internal ground_truth_answers map.

  3. Compares your answer vs ground truth using a simple normalization:

    • submitted.strip().lower() == ground_truth.strip().lower() (Hugging Face)
  4. Returns only counts and the derived percentage, not a per-item breakdown. (Hugging Face)

Also note: the course “Hands-On” page explicitly warns the scoring is “EXACT MATCH” style (format matters), and reminds you not to include “FINAL ANSWER” boilerplate in the submission output. (Hugging Face)

Practical way to get per-question results (local check)

If your goal is debugging (which questions you miss and why), the cleanest approach is to reproduce the scorer locally.

Key fact: the course says the 20 questions come from GAIA level 1, validation split. (Hugging Face)
And the scoring Space itself loads GAIA validation and reads the “Final answer” field. (Hugging Face)

So you can:

  • Fetch the 20 task_ids via GET /questions.
  • Load gaia-benchmark/GAIA validation split locally.
  • Join on task_id.
  • Apply the same normalization (strip().lower()) and compute a per-task boolean.

Minimal Python sketch

# deps: datasets, requests
from datasets import load_dataset
import requests

API = "https://agents-course-unit4-scoring.hf.space"

# 1) Pull the exact evaluation questions (task_id list)
questions = requests.get(f"{API}/questions", timeout=30).json()
task_ids = {q["task_id"] for q in questions}

# 2) Load GAIA validation (same split the scorer uses)
ds = load_dataset("gaia-benchmark/GAIA", "2023_level1", split="validation", trust_remote_code=True)

# 3) Build ground truth lookup for the 20 tasks
gt = {}
for ex in ds:
    tid = str(ex.get("task_id"))
    if tid in task_ids:
        gt[tid] = str(ex.get("Final answer", ""))

def norm(x: str) -> str:
    return str(x).strip().lower()

# 4) Compare with your agent outputs
# Replace this with your saved run results: {task_id: submitted_answer}
my_answers = {
    # "some-task-id": "my answer",
}

per_question = []
for q in questions:
    tid = q["task_id"]
    submitted = my_answers.get(tid, "")
    expected = gt.get(tid, "")
    ok = norm(submitted) == norm(expected)
    per_question.append((tid, ok))

# Print or log
for tid, ok in per_question:
    print(tid, "✅" if ok else "❌")

This gives you exactly what you asked for: correctness per task_id, for the same 20 items, using the same matching rule as the server. (Hugging Face)

If you want the server to return per-question correctness

That would require the course maintainers to change the API contract (extend ScoreResponse to include an array of per-task results, then return it). Right now the Space implementation does not do that. (Hugging Face)

Quick summary

  • The official /submit response is aggregate-only. No per-question breakdown exists in the API. (Hugging Face)
  • Scoring is basically strip().lower() + exact string equality against GAIA “Final answer.” (Hugging Face)
  • To see which questions you missed, reproduce the scorer locally by loading GAIA validation and matching on task_id. (Hugging Face)
1 Like

Thank you so much for the detailed response! The workarounds are good enough.

1 Like