GPT4 is incredible at programming!

NISARG SHAH
5 min readSep 12, 2023

In this new benchmark of LLMs, several models were asked the same question. Their responses allow us to gauge their relative competence at a given task. This is a very important exercise as it allows us to identify the best model for the task.

Let’s see an example. For an evaluation of model’s programming competence, each model was asked to write a function to find nth number in the Fibonacci sequence. Here’s the exact prompt:

Write a Python function to find the nth number in the Fibonacci Sequence.

A fairly simple prompt.

Here are some responses you would get from different models.

GPT3.5 uses an iterative loop to calculate the nth number. It also adds some explanation.

Here is a Python function that finds the nth number in the Fibonacci Sequence:

```python
def fibonacci(n):
if n <= 0:
return "Invalid input. Please enter a positive integer."
elif n == 1:
return 0
elif n == 2:
return 1
else:
a, b = 0, 1
for _ in range(3, n+1):
a, b = b, a + b
return b
```

To use this function, you can call it with the desired value of n:

```python
print(fibonacci(10)) # Output: 34
print(fibonacci(20)) # Output: 4181
```

The function first checks if the input is valid (i.e., a positive integer). If n is less than or equal to 0, it returns an error message. If n is 1, it returns 0, and if n is 2, it returns 1, as these are the first two numbers in the Fibonacci Sequence.

For n greater than 2, the function uses a loop to calculate the nth number in the sequence. It initializes two variables, a and b, with the values 0 and 1, respectively. Then, it iterates from 3 to n, updating the values of a and b in each iteration by swapping them and adding them together. Finally, it returns the value of b, which represents the nth number in the Fibonacci Sequence.

NSQL LLaMA-2 (7B) responds with the following:

SELECT FLOOR(SQRT(5) * SQRT(5)) / 2 AS "nth_number" FROM (VALUES (1), (1)) AS "fibonacci_sequence"

Looking at such a comparison, you might prefer using GPT3.5 for a simple programming task over NSQL LLaMA-2. By reviewing responses of more models, you will have a clear understanding of the relative ability of different models at the task of writing simple programs. Now let’s look at a slightly more complex task.

Each model was given the following prompt:

Write a Python function that prints the next 20 leap years. 
Reply with only the function.

You would expect some key elements in the response:

  1. The response contains valid python code.
  2. The response only contains a function. No extra text.
  3. The function produces exactly 20 years in output.
  4. The function does not produce any unintended functionality.
  5. Each produced year is actually a leap year in future.
  6. The function can be run at any time in future to get a valid result.

Let us look at some examples.

We will start with Alpaca (7B).

def print_leap_years():
for i in range(20):
if i % 400 == 0:
print(i)
print("End")

print_leap_years()</s>

There are several issues with this program:

  1. Invalid syntax — the </s> at the end prevents the program from running.
  2. It starts to search for leap years from 0 instead of the current year (2023).
  3. Its output only contains one leap year (0) instead of 20.
  4. It prints End at the end of the sequence, which it was not instructed to do. When using generative AI programmatically, this can easily introduce unintended functionality.

Let’s look at Chronos Hermes (13B) next:

def leap_years(n):
years = [2000, 2004, 2008, 2012, 2016, 2020, 2024, 2028, 2032, 2036, 2040, 2044, 2048, 2052, 2056, 2060, 2064, 2068, 2072, 2076, 2080, 2084, 2088, 2092, 2096, 2100]
for i in range(n):
print(years[i])
leap_years(20)

This is an interesting approach but has several limitations.

  1. All values are hardcoded.
  2. Years 2000 and 2100 are not actually leap years. (As you can see here, in case you didn’t know.)
  3. The values produced are not actually the “next” 20 leap years, they are the 20 leap years from the year 2000.

Let’s review the response of Claude v2:

Here is a Python function to print the next 20 leap years:

def print_next_20_leap_years():
current_year = 2023
num_leap_years = 0
while num_leap_years < 20:
if current_year % 4 == 0 and (current_year % 100 != 0 or current_year % 400 == 0):
print(current_year)
num_leap_years += 1
current_year += 1

This is much better, but it still has two problems.

  1. There is some extra text at the top (without a comment). So, if you wanted to take the response programmatically and run it, you couldn’t.
  2. It assumes the current year as 2023, instead of treating as a parameter or a dynamic value. So, if you ran the same function in 2025, it would produce wrong results.

Code Llama (34B) avoids one problem, but introduces another:

def leap_years(year):
for i in range(year, year + 20):
if i % 4 == 0 and i % 100 != 0 or i % 400 == 0:
print(i)

leap_years(2020)

It cleverly adds a parameter for the year, which allows it to be useful in future, but it has one problem too: Instead of producing 20 leap years, it will produce up to 5 leap years.

Finally, let’s look at GPT 4:

def print_leap_years():
import datetime
current_year = datetime.datetime.now().year
count = 0
while count < 20:
if (current_year % 4 == 0 and current_year % 100 != 0) or (current_year % 400 == 0):
print(current_year)
count += 1
current_year += 1

print_leap_years()

This is pretty good. Let’s go over our original checklist:

  1. It produces valid python code.
  2. Response only contains code, no extra text.
  3. It produces exactly 20 years.
  4. There is no unintended functionality.
  5. Each produced year is actually in future and a leap year.
  6. I can run the function in the year 2025 and still get a valid list of the next 20 leap years.

The only flaw, the prompt asked to write a function, not execute it. But it is a mistake made by most models, so I will ignore it for now. Overall, GPT4 performed really well. Here’s the report card of the models I evaluated:

Of course, for any real-world project, you’d need to perform much more analysis to determine a model’s suitability in the given domain, but such benchmarks allow us to narrow further analysis down to 2–3 models.

If you are curious, you can see responses from other models here: LLM Benchmark (llmonitor.com)

--

--