On the Limits of LLM Reasoning: Evidence From Contamination, Translation, and Answer Modification in Multiple-Choice Benchmarks | Publicación