scx.ai logo

Technical Appendix

MAGPIE Technical Appendix - Multi-Turn Reasoning Evaluation

Detailed evaluation of multi-turn reasoning capabilities in the MAGPIE model family using MMLU-Pro, SouthernCross-AussieQA-1k benchmark, and Australian localisation scoring.

By SCX.ai Research Team12 min read

1. Overview

This appendix details the evaluation of multi-turn reasoning capabilities in the MAGPIE model family.

MMLU-Pro measures general knowledge with multiple choice questions across 14 domains, including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and an "Others" category.

These experiments were conducted using MAGPIE-Medium (120B parameters) and MAGPIE-Small (20B parameters), both trained using LoRA fine-tuning on Australia-specific datasets. Reasoning and model output were both conditioned on Australian English.

In addition to MMLU-Pro, SCX.ai evaluated Australian localisation using the SouthernCross-AussieQA-1k benchmark and its internal Aussie Sovereign Check.

2. Training and Evaluation

2.1 Training Pipeline

A three-stage fine-tuning pipeline is set up for training Project MAGPIE under the CUDA 12.9 toolkit.

Stage 0: Sets up the CUDA / Unsloth toolchain, and generates the Hugging Face Accelerate configuration.

Stage 1: Extracts 1,000 samples from the corpus and does a dry run.

Stage 2: Starts training on the full Australia-focused dataset. The trainer loads unsloth/gpt-oss-120b-unsloth-bnb-4bit, and re-freezes everything except the LoRA adapters (r = 16, α = 32 applied to q/k/v/o and MLP projections).

2.2 Evaluation

A modified version of the MMLU-Pro benchmark feeds the 14 subjects (Law, Biology, etc.) with deterministic prompts to the model and a 2,048-token cap. An overall score averages individual subject scores.

For evaluating Australia-specific questions, the SouthernCross-AussieQA-1k benchmark was used (see below).

3. Scoring and Confirmation

3.1 MMLU-Pro Benchmark

Every MMLU-Pro request logs the prompt, raw answer, extracted option, and a per-subject accuracy summary. Failed extractions fall back to a random guess so totals remain honest.

3.2 Human Review

Human review was conducted solely on the SouthernCross-AussieQA-1k dataset and results.

4. SouthernCross-AussieQA-1k - Australian-Sovereign Evaluation

4.1 Dataset

SouthernCross-AussieQA-1k is a curated evaluation set of 1,000 question-answer pairs designed to probe Australian-specific knowledge and language, spanning:

  • Australian English spelling, idioms, and phrasing
  • Culture and entertainment
  • Travel and safety (beach conditions, outback driving, wildlife, bushwalking)
  • Indigenous history and contemporary issues
  • Civic and institutional knowledge (parliaments, ANZAC traditions, superannuation, Medicare, etc.)

4.2 Category-Balanced Evaluation

Questions are labelled into high-level categories (language, culture, travel, entertainment, general, Indigenous), and results are reported per category to ensure that models do not overfit to a single slice of Australian life. Category-wise scores are used to identify strengths (e.g., travel safety) and gaps (e.g., specific policy domains).

4.3 Automatic Scoring + Manual Spot Checks

Evaluation combines automatic and human review:

Automatic scoring: exact and semi-exact matching against reference answers, including normalisation for punctuation and minor paraphrase.

Manual review: a sampled subset of outputs is reviewed by Australian human annotators, focusing on:

  • Cultural tone and appropriateness
  • Safety-critical content (beach, bush, wildlife, driving)
  • Indigenous topics and historical sensitivity
  • Avoidance of US/UK-centric framing where Australian practice differs

MAGPIE-Medium passes SCX.ai's internal Aussie Sovereign criteria, with reviewers confirming natural Australian English, appropriate safety guidance, and culturally respectful handling of Indigenous content.

4.4 Regression and Comparison

The same SouthernCross-AussieQA-1k suite is run on:

  • MAGPIE-Small and MAGPIE-Medium
  • Top-performing open source models
  • Subsequent MAGPIE iterations and domain variants

SouthernCross-AussieQA-1k enables:

  • Regression tracking: ensuring future changes do not degrade Australian localisation.
  • Comparative reporting: e.g., "X-point lift on SouthernCross-AussieQA-1k overall, with the largest improvements in travel safety and Indigenous history questions."

7. Future Work

Future updates will integrate retrieval-augmented grounding (RAG) for domain-specific recall and increase context length to over 32,768 tokens for cross-jurisdictional legal testing.

SouthernCross-AussieQA-1k will be expanded beyond 1,000 items and further diversified by state/territory and demographic perspective.

8. Reproducibility

All prompts, evaluation scripts, and reference outputs are versioned in the internal MAGPIE evaluation repository, with the SouthernCross-AussieQA-1k benchmark and Aussie Sovereign scoring maintained in a separate repository to reduce the expansion of training context into the benchmark tests.

External researchers or partners may request benchmarking materials via info@scx.ai.

Related Topics

MAGPIEmulti-turn reasoningAustralian AIMMLU-ProSouthernCross-AussieQA-1ksovereign AImodel evaluationfine-tuningLoRAAustralian EnglishCUDAUnsloth
MAGPIE Technical Appendix - Multi-Turn Reasoning Evaluation