Developed by DeepMind (Google’s sister company), will AlphaCode be a reference among code-generating AIs?
Let be a prime number P. Find two integers a and b such that P modulo a = P modulo b and 2 ≤ a < b ≤ P. This is one of the problems DeepMind – Google’s sister company – faced with AlphaCode.
The role of this machine learning model? Generate code in response to algorithmic problems. And not just tasks. We are therefore dealing with potentially long programs of several hundred lines.
In this field, benchmarks show resolution rates of no more than 5%. This is without counting, says DeepMind, the proportion of false positives (30 to 60%) linked to the lack of tests in the data sets used.
These datasets generally come from competitions whose objective is precisely to answer such algorithmic problems. Among them, there is Codeforces. AlphaCode had to work on challenges proposed in this framework.
Before putting it to the test, it had to be trained. First, on 715 GB of data from public GitHub repositories. Specifically, a “snapshot” of July 14, 2021 including projects in C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala and TypeScript.
This training instilled AlphaCode with the “fundamentals” of code generation. A second dataset named CodeContests allowed to refine its performance. Its content: problems, solutions and tests from Codeforces and two other resources: Description2Code and CodeNet. DeepMind divided it into three subsets (training, validation and testing), all taken over separate time intervals to avoid transfer bias.
In the middle of the developers
Unfortunately, Codeforces’ public data does not display tests in their entirety when they exceed 400 characters. Therefore, additional tests had to be generated from the existing ones. For example, by incrementing/decrementing integers or by (e)changing elements in strings.
With a “basic” training, AlphaCode has similar performances to Codex, we are told. It exceeds it when certain techniques are applied. Among others, the framing of metadata, the modification of the probability of distribution of the tokens between encoder and decoder or hints on the relevance of the solutions under development.
To challenge the different versions of the model, DeepMind retrieved ten exercises recently submitted on Codeforces. The inference phase took place on a 3750 TPUv4 and 3750 TPuv4i configuration. The objective: for each exercise, generate massive solutions (half in C++, half in Python), then filtered by running the tests provided in the problem description. A clustering system allows to reduce the sample even more; the idea being to have 10 at most.
Result: on average, on the ten exercises, the best version of AlphaCode performs better than 47.7% of the candidates. This score is reached with an average of 2.4 submissions per problem.
By setting the limit to 1 million samples, AlphaCode solves 34.2% of the problems in the CodeContests dataset, “validation” segment.
AlphaCode, a model with environmental impact
Beyond the performances of AlphaCode relative to the developers having treated these exercises, what to retain? In particular that :
- AlphaCode tends to reuse code from the training data, but more for data processing than for pure logic
- The syntax is not always correct, in particular for the solutions in C++
- AlphaCode generates about the same amount of “useless” code as a human
- The resolution rate increases with the number of parameters (even when limited to 10 samples), the computational power, the number of samples and the size of the data sets
- Bigger” models are more qualitative (they get a better resolution rate for the same number of samples)
- The simpler the description, the higher the resolution rate
- Asymmetric architecture between encoder and decoder improves sampling speed
- Results are much better when training models on multiple programming languages, including non-targeted output languages
- Without filtering, less than 1% of proposals pass tests
- AlphaCode has a significant environmental impact: training and sampling consumed hundreds of days/petaflops. Despite this, the 41 billion hyperparameter model could not be trained to the same level as the others, in the name of frugality.