(as shown above), the aggregated ELO rank lacks a solid foundation. Our experiments measuring 'Ranking Set Stability' via a "crossover score" quantify this fragility: in one test, a model's ELO-derived haiku rankings significantly reshuffled (Crossover Score: 66, lower is better) when only ...