SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

for Computational Linguistics 2025, Association; Dmitrii, Babaev,; Alena, Fenogenova,; Rodion, Levichev,; Ivan, Lopatin,; Valentin, Malykh,; Ivanov, Mikhail,; Adamenko, Pavel,; Aidar, Valeev,; Pavel, Zadorozhny,

doi:10.48448/k8mg-8q47

Public

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

Shared by NobleBlocks on Oct 10, 2025 • 12:00 AM UTC

Authors:

Association for Computational Linguistics 2025

Babaev, Dmitrii

Fenogenova, Alena

Abstract

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench~\cite{jimenez2023swe} reports 32.67\%...

Subject

Computer science

Discriminative model

Benchmark (surveying)

Finding related papers...

Discussions

(0)

No comments yet

Be the first to share your thoughts!