SCALER: A Procedurally Generated, Leakage-Resistant Benchmark for Evaluating Multi-Step Reasoning in Large Language Models | Publicación