MiniBARD

Benchmark for Aesthetics, Roleplay, and Depth

That is a fantastic title. B.A.R.D. (Benchmark for Aesthetics, Roleplay, & Depth) feels prestigious and perfectly encapsulates the "vibe" of the models you are testing.

Aesthetics: Captures the Creative Writing and the "beauty" of the prose.

Roleplay: Directly addresses the RP-Bench and character immersion.

Depth: Covers the Reasoning and EQ—the model's ability to understand complex subtext and provide nuanced, multi-layered responses.

This project establishes a high-precision, local benchmarking pipeline designed to compare the creative writing and roleplay capabilities of two 12B Large Language Models (Qliphoth v1a and v1b) without relying on external API keys. By utilizing a runpod 3090 GPU server and 500GB of local storage, the system bypasses massive, generic benchmarks like MMLU in favor of a targeted "Council of Judges" approach. This method uses five diverse, high-parameter "judge" models—including Gemma 3, Nemotron, and Cydonia—to evaluate model outputs across 80 complex prompts. To ensure scientific integrity, the pipeline implements a "blind" pairwise comparison that shuffles the order of responses to eliminate first-entry bias and utilizes "abliterated" judges to prevent moralizing or refusals from skewing the creative scores. The final result is a majority-rule verdict that provides a definitive win rate, offering a clear, compute-efficient data point on which model version produces superior prose and instruction-following.

The full BARD suite evaluates LLM reasoning, emotional intelligence, creative writing, and roleplay. It also removes all non-english prompts.

The BARD tool (designed for runpod) allows the user to specify either API keys or a locally hosted council of LLM judges which the benchmark output JSONs are then processed through.

=== PROMPT LOADING COMPLETE ===
Total English prompts loaded: 2863
 - mt_bench: 80 prompts
 - eq_bench: 1573 prompts
 - cw_bench: 419 prompts
 - rp_bench: 791 prompts
===============================

MiniBARD uses Bernoulli sampling with a fixed seed (random.seed(420)) to generate perfectly diverse, representative slices of the full benchmark. This achieves high fidelity of the full BARD score while only requiring 10% of the compute power.

=== MINI-B.A.R.D. PROMPT LOADING COMPLETE ===
Total representative prompts loaded: 320
 - mt_bench: 80 prompts
 - eq_bench: 80 prompts
 - cw_bench: 80 prompts
 - rp_bench: 80 prompts
=============================================

BARD is a composite of the following benchmarks:

Thanks @sam-paech for releasing EQ-Bench suite.

Examples

Current Judge Models

Cydonia 4.3 24B
Mag Mell 12B
Nemotron 8B Ablit
~~Gemma 3 27B Ablit~~

Future Features Planned

Slop tests
Censorship tests
Q0 Benchmark scores
Possible representative slicing of MMLU, HellaSwag, etc.