Favoring SQL over Redis for an evergreen leaderboard

At PDX Tools, one has the option to upload an EU4 save file and the app will deduce what achievements were earned and how many ingame days have elapsed in the save.

With this information we can create a leaderboard for individual achievements.

Here’s the Postgres database schema to start us off, where we store the save id, the time it was uploaded, what achievements were detected, and the elapsed ingame time.

CREATE TABLE saves (
    save_id TEXT PRIMARY KEY,
    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
    achievements INTEGER[] NOT NULL,
    elapsed INTEGER NOT NULL
);

Let’s insert some data.

INSERT INTO saves
  (save_id, achievements, elapsed)
VALUES
  ('hF3ebtQHsB', '{1, 5, 10}', 6213),
  ('mn1tEAnROs', '{}', 8501),
  ('rAtKpZX3T0', '{5}', 5134),
  ('H2pghiu3pR', '{5, 10}', 3679);

Then let’s say we have a page, like /leaderboard/achievement/5, so we select all the save files that completed achievement 5 and order them by their elapsed score with the created_at column deciding any tie breakers.

SELECT *, RANK() OVER (
  ORDER BY elapsed, created_at ASC
) FROM saves
  WHERE 5 = ANY(achievements)

The above will output the following saves:

save_id	achievements	elapsed	rank
H2pghiu3pR	5,10	3679	1
rAtKpZX3T0	5	5134	2
hF3ebtQHsB	1,5,10	6213	3

It’s a great start, but it hasn’t addressed keeping the leaderboard fresh.

Evergreen Leaderboard

One of the ways to keep the leaderboard fresh is to introduce a tax on saves from older versions of the game whenever a new patch is released, so recent runs are favored.

With the SQL approach, we can add a patch column and a computed elapsed_adj column:

CREATE FUNCTION tax_factor(patch INTEGER) RETURNS INTEGER
  LANGUAGE SQL IMMUTABLE STRICT
  RETURN 10 + (34 - LEAST(patch, 34));

CREATE TABLE saves (
    save_id TEXT PRIMARY KEY,
    -- ...
    patch INTEGER NOT NULL,
    elapsed INTEGER NOT NULL,
    elapsed_adj INTEGER GENERATED ALWAYS AS (
      elapsed * tax_factor(patch) / 10
    ) STORED NOT NULL
);

The above defines the latest patch for the leaderboard as patch 34, and the tax is a 10% penalty per version behind 34.

Now we can refresh our data with patch information:

INSERT INTO saves
  (save_id, achievements, patch, elapsed)
VALUES
  ('hF3ebtQHsB', '{1, 5, 10}', 34, 6213),
  ('mn1tEAnROs', '{}', 34, 8501),
  ('rAtKpZX3T0', '{5}', 30, 5134),
  ('H2pghiu3pR', '{5, 10}', 25, 3679);

And get the rankings (which have a new order!) for achievement 5, by referencing our new elapsed_adj column:

SELECT *, RANK() OVER (
  ORDER BY elapsed_adj, created_at ASC
) FROM saves
  WHERE 5 = ANY(achievements)

save_id	achievements	elapsed	elapsed_adj	rank
hF3ebtQHsB	1,5,10	6213	6213	1
H2pghiu3pR	5,10	3679	6990	2
rAtKpZX3T0	5	5134	7187	3

Due to limitations in Postgres, it’s not possible to alter the generated column. When patch 35 is released we’d have to drop and re-add the column. A workaround is to replace the function and issue an UPDATE to force a recalculation of the generated column.

CREATE OR REPLACE FUNCTION tax_factor(patch INTEGER) RETURNS INTEGER
  LANGUAGE SQL IMMUTABLE STRICT
  RETURN 10 + (35 - LEAST(patch, 35));

UPDATE saves SET patch = patch;

The update may seem like a noop and innocuous, but it actually has a large performance caveat as the update modifies every single row.

Perhaps more enticingly, one might want to relegate elapsed_adj to a regular column and push management and calculation of elapsed_adj to the application in question. That way, users could see their adjusted score before committing and without duplicating the calculation logic between the application and database.

Then when a new patch comes out the application can issue an UPDATE statement.

UPDATE saves SET elapsed_adj =
  elapsed * (10 + (35 - LEAST(patch, 35))) / 10;

All the same performance caveats apply, though.

Optionally, one can skip updating entries with no achievements, as they would never be on a leaderboard:

UPDATE saves SET elapsed_adj =
  elapsed * (10 + (35 - LEAST(patch, 35))) / 10
WHERE cardinality(achievements) != 0;

If users of the app rarely upload saves with achievements, we can avoid a full table scan by creating an index expression over cardinality.

Redis

Why even use a SQL database for this? Why not use Redis, which has sorted set guides and use cases catering directly to leaderboards? Redis should unlock better performance when querying the leaderboard, especially around answering questions like “what is save X’s achievement 5 rank?”

I agree that Redis is the more appropriate tool for leaderboards. I even wrote about how to design an evergreen leaderboard with it a couple years ago.

Redis should be the first thing one reaches for when building leaderboards, but the pitfalls I identified back then still exist.

To break ties in score, an epoch timestamp can be introduced as a fractional component of a score (hopefully your score is only natural numbers). In SQL, we can specify the created_at column as a secondary factor in the ordering.

In Redis, each achievement and patch combination would have its own sorted set. Then the leaderboard for a given achievement would be yet another sorted set that should behaviorally be a ZUNIONSTORE view over the sets for each combination of achievement id and patch, weighted according to the patch tax. Thus, if there are 100 achievements and 10 patches, there are now 1100 sorted sets.

The proliferation of sorted sets can seem overwhelming compared to a single SQL table. And now the application needs to insert the raw score into one set and the adjusted score into another. Data is being duplicated between Redis and the database and that can be a source of confusion and potential inconsistencies.

The epoch that we’re storing as a fractional part of the score is back to cause problems. Executing ZUNIONSTORE with the tax weight could cause the fractional part to overflow the score to the next number, and could affect rankings. This problem can be mitigated if the epoch is relegated to a smaller portion of the fraction, but then one risks precision loss.

There are other cross cutting concerns whenever introducing a new data layer to an application:

Deployment
Configuration
Authentication
Logging
Monitoring
Backup
Dependency management

So even though Redis is easy to use, and seems purpose built for leaderboards, it has a non-zero cost.

If a leaderboard is a core tenet of an application or Redis is already in use, then Redis is the natural choice. Otherwise, I’d see if the existing database is sufficient.

Schema Design and Indexes

Assuming that maintaining everything within a SQL database is desired, we should spend some time scrutinizing the schema and queries shown.

Right now, the query for the leaderboard for an achievement does a full table scan and another scan through the achievements array for each row. This is suboptimal for any app regardless if a leaderboard isn’t a headlining feature.

Unfortunately, Postgres does not allow our leaderboard query, as written, to be indexable as the ANY function is not an operator. We have to massage our query to use the array containment operator:

SELECT *, RANK() OVER (
  ORDER BY elapsed_adj, created_at ASC
) FROM saves
  WHERE achievements @> '{5}'

Then index creation would look like:

CREATE INDEX idx_saves_achievements
  ON saves USING GIN(achievements);

Alternatively, the Postgres intarray extension exists and is said to bring additional performance and operators to index against.

It kinda feels like we’re going down a rabbit hole. Very Postgres specific. Maybe we should pause for a moment, and ask ourselves how deep do we want to tie the data model to Postgres features and syntax?

With the dawn of database access on the edge soon approaching with solutions like Cloudflare D1 and PlanetScale’s new JS driver, which are SQLite and MySQL respectively, it could be a good idea to code against the least common denominator, to ease any potential data migrations. Or maybe there’s the desire to have a local, thick-client mode with SQLite.

Normalization

By now the local database pundit is knocking on my door telling me to eschew arrays and normalize the data by creating a separate achievements table:

CREATE TABLE achievements (
  save_id TEXT REFERENCES saves(save_id) NOT NULL,
  achievement INTEGER NOT NULL
);

To me, this is a tough sell:

Pedantic: There is nothing else related to achievements in the database
Cumbersome: A new save entry now needs to execute inserts equal to 1 plus the number of achievements detected. If it’s later determined the initially detected achievements are incorrect, instead of a single update needed if all achievements were in an array together, a bunch of deletes and inserts are required.
Inefficient: Each achievement array does not take up much room by itself, but now we’re unnesting with the save id which is text that could be much longer.

A denormalized form often has performance benefits compared to its normalized sibling, though this varies depending on the situation, and is debatable. Jeff Atwood wrote about database normalization and touches upon performance. However, I think the most important takeaway from the article is this sentence:

when it comes to database design […], try to err heavily on the side of sane, simple design

I can’t add too much more than that. In this case, a denormalized form with an array is the sane and simple design.

Sometimes it seems like every stackexchange answer about SQL arrays mentions normalization, and I grow tired of it.

Json

MySQL and SQLite don’t have an array data type, but Postgres, MySQL, and SQLite all understand and can query Json.

The data for querying Json is a bit different between databases, but at least they all agree that a given column contains Json, which should alleviate migration concerns.

We’ll modify our Postgres schema to be Json:

CREATE TABLE saves (
    -- ...
    achievements JSONB NOT NULL,
);

No changes are needed to our index, as JSONB is indexable. Optionally, one can leverage the jsonb_path_ops operator class to further tune performance.

Our update statement and leaderboard queries only need slight tweaks:

UPDATE saves SET elapsed_adj =
  elapsed * (10 + (35 - LEAST(patch, 35))) / 10
WHERE jsonb_array_length(achievements) != 0;

-- ...
  
SELECT *, RANK() OVER (
  ORDER BY elapsed_adj, created_at ASC
) FROM saves
  WHERE achievements @> '5'

MySQL is even more adept at indexing a Json array due to support for multi-valued indexes that work with the MEMBER OF function; our exact use case! And of course, MySQL supports ranking.

SQLite is the one relatively left in the cold. There is no dedicated type for valid Json, but there is a myriad of Json functions that would allow us to query achievements for a leaderboard. Not efficiently though, as there is no index for SQLite Json arrays.

Conclusion

Am I advocating for others to ditch Redis in favor of a single SQL table with Json when implementing a leaderboard?

Writing that question out loud almost makes me recoil. But that is what this article is converging towards.

And that’s ok.

No one knows your app better than you. If your app is already employing Redis or the leaderboard is a core feature where performance can’t be compromised, then continue down the Redis path.

The point of this article is that a SQL database is a great second choice, especially if it allows one to avoid complicating their app with an additional database. And as this article has shown, storing data as a Json array is well supported across Postgres, Mysql, and SQLite, with both Postgres and Mysql exposing indexes for efficient querying.

A single SQL table with Json columns could be the sane and simple database design you crave.