Aider's refactoring benchmark exercises based on popular python repos
This repository holds exercises for a coding benchmark used by the
aider
AI coding tool.
This benchmark was designed to provoke “lazy coding” in the GPT-4 Turbo models,
which have this widely reported problem.
This benchmarked assisted in the design and evaluation of a solution to the
lazy coding problem.
Asking GPT-4 Turbo to format code changes as unified diffs
reduced lazy coding
by 3X.
Aider has long used a
benchmark suite based on 133 Exercism python exercises.
But these are mostly small coding problems,
usually requiring only a few dozen lines of code.
GPT-4 Turbo is typically only lazy on 2-3 of these exercises:
the ones with the most code and which involve refactoring.
Based on this observation, I set out to build a benchmark based on refactoring
a non-trivial amount of code found in fairly large files.
To do this, I used python’s ast
module to analyze
9 popular open source python repositories
to identify challenging refactoring tasks.
The goal was to find:
self
parameter, so they can be trivially refactored out of the class.We can then turn each of these source files into a task for the benchmark,
where we ask GPT to do something like:
Refactor the
_set_csrf_cookie
method in theCsrfViewMiddleware
class to be a stand alone, top level function.
Name the new function_set_csrf_cookie
, exactly the same name as the existing method.
Update any existingself._set_csrf_cookie
calls to work with the new_set_csrf_cookie
function.
A simple python AST scanning script
found 89 suitable files
and packaged them up as benchmark tasks.
Each task has a test
that checks if the refactor
was performed roughly correctly:
To be clear, this is not a rigorous test that the refactor was performed correctly.
But it does serve as a basic sanity check that the refactor was essentially done as a cut & paste, without eliding any code as comments.
And it correlates well with other laziness metrics
gathered during benchmarking like the
introduction of new comments that contain “…”.
The result is a pragmatic
benchmark suite that provokes, detects and quantifies GPT coding laziness.
The refactoring exercises are based on code from the following
repositories: