Disecting the Postgres Bulk Insert and Binary Format
Published on:Table of Contents
Introduction
This post delves deep into the world of bulk loading data with
PostgresSQL. I’ve written about this subject
before in Optimizing Innodb for Bulk Insert for MySQL-esque databases. These
posts differ not merely by which database they target, but also in their target
audience. The Innodb post is more a write oriented setup, where there are a
many writes in batches. On the contrary this post will deal with inserting a
data set as fast as possible into the database using
COPY
and
evaluating pg_bulkload.
For those interested in optimizing more frequent, but smaller number of writes, see:
- PostgreSQL Insert Strategies – Performance Test
- How to Insert Data [into PostgresSQL] - as Fast as Possible
Scenario
The data is a set of Reddit voting data, which is the the same data used in Processing arbitrary amount of data in Python. This is what the data looks like (username, link, and score).
00ash00,t3_as2t0,1
00ash00,t3_ascto,-1
00ash00,t3_asll7,1
00ash00,t3_atawm,1
00ash00,t3_avosd,1
0-0,t3_6vlrz,1
0-0,t3_6vmwa,1
0-0,t3_6wdiv,1
0-0,t3_6wegp,1
0-0,t3_6wegz,1
There are 7.5 millions rows.
The test machine specs:
- Ubuntu 14.04 Server
- 2x Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
- 4GB DDR3 RAM at 1600MHz
- 5400 RPM 8MB cache HDD
- Postgres 9.5 Alpha2
- Relevant Postgres Variables
- maintenance_work_mem: 1GB
- shared_buffers: 1GB
- checkpoint_segments: 16MB
Our table will be simple, but I’ll talk through some of the business logic.
- A surrogate key is defined to make referring to a vote easier. The surrogate key is defined as serial, so it will auto increment on insertion.
- Since the data doesn’t contain any nulls, we’ll include
NOT NULL
on the columns as a sanity check - Once the table is loaded there’ll be indexes on
username
andlink
, which will simulate the need to do queries byusername
orlink
. I say “once the table is loaded” because the table will start out with zero indexes (discussed later)
CREATE TABLE votes (
voteId serial PRIMARY KEY,
username text NOT NULL,
link text NOT NULL,
score int NOT NULL
);
-- We're going to have a couple indexes on the data but we are going to add them
-- only after we have loaded the data for performance reasons. So don't actually
-- execute these statements now.
CREATE INDEX ON votes (username)
CREATE INDEX ON votes (link)
For those of you screaming that the schema is not normalized, relax. Read
Maybe Normalizing Isn’t
Normal, When Not
to Normalize your SQL
Database,
and understand that our data is immutable. Normalizing the schema, such that
there would be a users
and a link
table would only complicate the loading
process and provide no benefits. I don’t want to split a three column csv into
a three table database just so I can check off the box that says “Normalized”.
For performance, the schema will start out defining zero indexes:
If you are loading a freshly created table, the fastest method is to create the table, bulk load the table’s data using COPY, then create any indexes needed for the table. Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.
We’ll take the recommendations in Populating a Database and
increase the size maintenance_work_mem
and max_wal_size
(keep reading for
explanations). However, double check all variables set because some variables
require a restart and affect all queries, and we’re only interested in the
bulk loading queries.
For each test we’ll be clearing the data and all indexes. The process will be repeated several times and the best time will be reported.
Copy Csv
The easiest case:
COPY votes (username, link, score)
FROM '/tmp/votes.csv'
WITH (FORMAT 'csv', HEADER false);
CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);
Results: 1m7.799s
That’s about 110,000 rows inserted per second – and we’re just getting started.
Copy Text
Luckily the data we are working with follows Postgres’s text format, so we’ll see if there is any performance difference, as there should be less parsing necessary.
COPY votes (username, link, score)
FROM '/tmp/votes.csv'
WITH (FORMAT 'text', DELIMITER ',')
CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);
Results: 1m5.189s
An improvement, but if you have csv data, it would be safer to use the default Postgres csv format.
Copy Binary
The last format available for COPY
is binary, and the documentation contains
the half hearted statement about the performance of the binary statement.
It is somewhat faster than the text and CSV formats.
Being the curious programmers, we want numbers, so let’s convert our csv into binary. The next problem, how do we convert it? The documentation is sparse and contains various jargon that took me several read through to understand. The format isn’t difficult to implement, as on my first try, I cracked the format for even complicated types. I thought I should document a bit of the format to share the knowledge.
The Format
First thing first, the header. While it contains various extensions, you’d be fine ignoring any extensions and hard-coding the first 16 bytes as is shown in the example code.
Next the data, which can be described with a very simple formula (going to use pseudo code here). I’ve parenthesized the size of data types.
for every row of data:
write the number of columns to be written (short)
for every column in row:
write the amount of data in bytes about to be inserted (integer)
write data (length varies by type and contents)
For instance if we had a row with three integer columns:
for every row of data:
# Each row has three integer columns
writeShort(3)
for every column in row:
# Each integer is four bytes long
writeInt(4)
# Write the integer
writeInt(value)
The Code
Let’s take a look at what it would take to convert our csv into Postgres' binary format in Python. The decision to use Python was arbitrary, any language should be able to accomplish this task.
#!/usr/bin/env python2.7
import csv
import fileinput
import sys
import struct
from collections import namedtuple
from itertools import imap
# Converts each line of the csv into a vote
RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
votes = imap(lambda x: x._replace(score=int(x.score)), votes)
# Postgres' binary format is big-endian, so we declare a couple helper objects
# that will convert integers to the binary format of specified length.
# 'shorter' is for writing 16 bit fields
# 'inter' is for writing 32 bit fields
shorter = struct.Struct('>h')
inter = struct.Struct('>i')
# Postgres contains various header information of 15 bytes followed by an
# optional header extension, which we mark as having a length of 0.
sys.stdout.write(b'PGCOPY\n')
sys.stdout.write(b'\xFF')
sys.stdout.write(b'\r\n\x00')
sys.stdout.write(inter.pack(0))
sys.stdout.write(inter.pack(0))
def writeStr(s):
"""Writes a string in Postgres' binary format"""
# Prefix string content with the length of the string
sys.stdout.write(inter.pack(len(s)))
sys.stdout.write(s)
for vote in votes:
# Each binary row is prefixed with how many columns (tuples) it contains.
# Since we are inserting username, link, score (voteId is auto generated),
# we are inserting three fields
sys.stdout.write(shorter.pack(3))
# Write out username and link
writeStr(vote.username)
writeStr(vote.link)
# Prefix score content with the length of the score data type (integer),
# which is 4 bytes
sys.stdout.write(inter.pack(4))
sys.stdout.write(inter.pack(vote.score))
# End insert with -1
sys.stdout.write(shorter.pack(-1))
To generate the binary file (optionally we could pipe to Postgres, but that would be measuring the Python code, but we’re not interested in that):
./to-bin.py < votes.csv > votes.bin
How fast was Postgres able to load a format that it natively knew?
Results: 1m6.406s
I don’t know if I can accuractely describe the disappointment when I first saw
that number. The binary format, in this case, is not faster. I emphasize “in
this case” because the binary format had to work with a significantly larger
file (50% bigger!). While in a csv format 1 and -1 take up 1 and 2 bytes
respectively, in the binary format they consume 8 bytes. We could change the
schema to use smallint
but that would only have save 2 bytes per row.
Additionally, strings need a four byte prefix to denote length, which would
constitute most of the string length. Clearly the binary format isn’t an
obvious win, and in depending on the situation could lead to worse performance.
Advanced Format
Since serializing text and integers was easy, I’ll walk through how I figured out how to serialize Postgres arrays into the binary format. A dive into the Postgres source code is needed, so a basic understanding of C is needed (from the documentation):
To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the *send and *recv functions for each column’s data type
Let’s pretend that instead of inserting a single vote, the column contains an
array of votes (maybe someone can now vote multiple times on a link?). Now comes
the hard part, interpreting
array_recv
Here are some info we glean from the function:
- An array’s content is prefixed by a header
- First integer is the number of dimensions in the array. For a single dimension array, put 1.
- Second integer represents the flags. I’ll be marking 0 for flags.
- Third integer is the element type OID, as arrays in Postgres are homogenous.
Since the element type is
integer
, akaint4
we find the OID from a simple select:SELECT oid FROM pg_type WHERE typname = 'int4'
- Fourth integer is the number of elements in the array to be inserted
- Fifth integer is the index to start inserting elements. Since Postgres is indexed at one, this value should not be smaller than one.
Then for each element in the array, we use the previous formula of writing the size of the content followed by the content.
Unlogged Copy
A new feature in Postgres 9.5 is the ability to alter whether a table writes to the WAL. Previous versions (down to 9.1) allowed creating unlogged tables, but the tables were permanently unlogged. Considering that pg_bulkload achieves superior write performance by bypassing the WAL, we should attain similar results. A blog post from 2011 gives a 2x performance bump for unlogged tables, so we’ll try it out.
What are unlogged tables good for? A reputable Postgres blog explains the uses for the (at the time) upcoming 9.1 release:
- you don’t need it to stay alive over Pg crash
- you don’t need it replicated to slaves (well - at least using warm/hot standby mechanisms)
- you do need as fast writes as possible
What happens when you set a previously UNLOGGED
table to LOGGED
? The wiki
explains:
Note that setting an unlogged table to logged will generate WAL which will contain all data in the table, so this would cause a spike in replication traffic for large tables
CREATE UNLOGGED TABLE votes (
voteId serial PRIMARY KEY,
username text NOT NULL,
link text NOT NULL,
score int NOT NULL
);
COPY votes (username, link, score)
FROM '/tmp/votes.csv'
WITH (FORMAT 'csv', HEADER false)
CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);
ALTER TABLE votes SET LOGGED;
Results: 1m28.054s
This number is higher than what I thought it would have been, so I cleared the
database and executed the same statement without ALTER TABLE votes SET LOGGED
.
New Results: 0m34.219s
This is twice as fast as any of the logged tables! 220,000 rows a second. Turns
out the mentioned blog post was accurate with its 2x performance claim. Writing
to an UNLOGGED
table would be even faster if I had a faster processor because
Postgres was taking up 100% CPU for the entire duration. Think about that. A
database that is CPU-bound is incredible.
We can conclude from the numbers that though there is no replication, there is large overhead to writing to the WAL. Later in the article, we’ll discuss whether one should take the compromise of speed vs safety.
Time for Fun?
We’ll have a break and have a little bit of fun. Using CTEs (Common Table
Expressions), we
can elegantly query for the top n
voting users that, unlike a LIMIT
query,
won’t cut off the those at the end of the results that are tied with other
users.
WITH scored_users AS (
SELECT username AS username,
COUNT(CASE WHEN score = -1 THEN 1 END) AS downvotes,
COUNT(CASE WHEN score = 1 THEN 1 END) AS upvotes,
COUNT(*) AS votes
FROM votes
GROUP BY username
), results AS (SELECT username, votes,
(upvotes - downvotes) AS score,
(upvotes::float / votes) as positive,
dense_rank() OVER(ORDER BY votes DESC) AS vote_rank,
dense_rank() OVER(ORDER BY (upvotes - downvotes) DESC) score_rank
FROM scored_users)
SELECT username, votes, score, positive, vote_rank, score_rank
FROM results
WHERE vote_rank < 6
ORDER BY vote_rank,positive DESC,username;
| username | votes | score | positive | vote_rank | score_rank
----------------+-------+-------+-----------+-----------+------------
32bites | 2000 | 0 | 0.5 | 1 | 1000
42omle | 2000 | 0 | 0.5 | 1 | 1000
48VDC | 2000 | 0 | 0.5 | 1 | 1000
aa001m | 2000 | 0 | 0.5 | 1 | 1000
aberant | 2000 | 0 | 0.5 | 1 | 1000
ADAP2K | 2000 | 0 | 0.5 | 1 | 1000
...
Interesting, there seems to be a cap at 2000 votes (max 1000 upvotes and max 1000 downvotes). The data, when grouped by link name is much more interesting.
WITH scored_links AS (
SELECT link AS link,
COUNT(CASE WHEN score = -1 THEN 1 END) AS downvotes,
COUNT(CASE WHEN score = 1 THEN 1 END) AS upvotes,
COUNT(*) AS votes
FROM votes
GROUP BY link
), results AS (SELECT link, votes,
(upvotes - downvotes) AS score,
(upvotes::float / votes) as positive,
dense_rank() OVER(ORDER BY votes DESC) AS vote_rank,
dense_rank() OVER(ORDER BY (upvotes - downvotes) DESC) score_rank
FROM scored_links)
SELECT link, votes, score, positive, vote_rank, score_rank
FROM results
WHERE vote_rank < 6
ORDER BY vote_rank;
| link | votes | score | positive | vote_rank | score_rank
----------+-------+-------+-------------------+-----------+------------
t3_beic5 | 1660 | 1102 | 0.831927710843373 | 1 | 2
t3_92dd8 | 1502 | 1186 | 0.894806924101198 | 2 | 1
t3_9mvs6 | 1162 | 650 | 0.779690189328744 | 3 | 15
t3_bge1p | 1116 | 988 | 0.942652329749104 | 4 | 3
t3_9wdhq | 1050 | 204 | 0.597142857142857 | 5 | 290
Here we see that links that get a lot of votes don’t necessarily get a lot of upvotes. You don’t see it here, as I truncated the data, but the 36th most vote getting link, has a score of -55, but there is a trend that stories that receive many votes, tend to have a positive score.
pg_bulkload
Back to business.
pg_bulkload is a high speed data loading utility for PostgreSQL.
pg_bulkload achieves its performance by skipping the shared buffers and WAL
logging (similar to UNLOGGED
tables).
Installation
Installation is somewhat non-trivial.
- Download the source from github
- Install all the dependencies
- make; make install
- Connect to the wanted database and execute
CREATE EXTENSION pg_bulkload
Limitation
pg_bulkload can’t handle default or missing columns. I find this a huge limitation, which immediately eliminates many use cases. All is not lost – we can work and hack our way around it. The fix is to generate our own primary keys and add them to our csv.
Schema Change
Change the schema to extract the sequence outside the table so that we may call the sequence outside the table. Index creation is moved into the initial schema because pg_bulkload is optimized to work with pre-built indexes.
CREATE SEQUENCE voteId_seq;
CREATE TABLE votes (
voteId integer PRIMARY KEY DEFAULT nextval('voteId_seq'),
username text NOT NULL,
link text NOT NULL,
score int NOT NULL
);
CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link)
Now calculate the 7.5 million sequence ids and join them with the original dataset.
psql reddit -c "
COPY (SELECT nextval('voteId_seq')
FROM generate_series(1, $(wc -l < /tmp/votes.csv)))
TO STDOUT" >/tmp/vote_ids
paste /tmp/vote_ids /tmp/votes.csv > /tmp/votes2.csv
First Results
Calculating the 7.5 million sequence ids took 8 seconds. I feel like that is slow but I have nothing to compare it against. At least pg_bulkload will now load it. Now for the real load:
Results: 0m42.569s
It’s a 35% improvement, which is significant, but it is not significant enough to warrant the trouble with installation, lack of documentation, and I had to generate ids out of band, which knocks it down to about a 20% improvement.
Parallel
pg_bulkload has one more trick up its sleeve: there is a mode where pg_bulkload
will read an input file in one process and write the rows in another. One to
thing to note, and this is documented in pg_bulkload, is when specifying
WRITER = PARALLEL
I had to edit /etc/postgresql/9.5/main/pg_hba.conf
and
use the “trust” method to authenticate. Be careful, trust
authentication,
can be used inappropriately and leave the database open to security
vulnerabilities.
So how did it do?
Results: 0m36.634s
A small improvement did occur. It would be interesting to investigate whether adding more cores would help (I’m guessing not).
Advantages Compared With COPY
COPY can not handle constraint violations:
Duplicate rows are not permitted if they violate a unique constraint. Conversely, MySQL can with the option to REPLACE or IGNORE.
On the other hand, pg_bulkload contains options such as DUPLICATE_ERRORS
and
ON_DUPLICATE_KEEP = NEW | OLD
to customize the behavior on duplicate
constraint violations. The only constraints that pg_bulkload won’t check are
foreign key constraints.
Investigation into maintenece_work_mem
pg_bulkload contains an interesting statement about maintenece_work_mem, which is a server configuration variable:
The maintenece_work_mem, PostgreSQL parameter, affects the performance of pg_bulkload. The duration becomes almost 15% shorter if this parameter is changed from 64 MB to 1 GB.
To cross reference maintenece_work_mem, let’s check the official documentation
Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes (64MB). Since only one of these operations can be executed at a time by a database session, and an installation normally doesn’t have many of them running concurrently, it’s safe to set this value significantly larger than work_mem. Larger settings might improve performance for vacuuming and for restoring database dumps.
At first, this seems like it shouldn’t really have anything to do with bulk loading data, why would bumping this value up from 64MB to 1GB cause a 15% speedup?
Examinging pg_bulkload source code and the postgres source code is illuminating:
/*
* We size the sort area as maintenance_work_mem rather than work_mem to
* speed index creation. This should be OK since a single backend can't
* run multiple index creations in parallel. Note that creation of a
* unique index actually requires two BTSpool objects. We expect that the
* second one (for dead tuples) won't get very full, so we give it only
* work_mem.
*/
Increasing maintenece_work_mem
will allow creating indexes faster as they are
sorted using maintenece_work_mem
, and since no two indexes can be created at
the same time, it is safe to set maintenece_work_mem
to a larger value.
WAL Analysis
Databases are interesting in how they record changes. I’ve selected a couple quotes from the book, Database Management Systems. ‘Any change to a database object is first recorded in the log; the record in the log must be written to stable storage before the change to the database object is written to disk’. The database appends work to be done to the log and at an undetermined time in the future it will actually do the work. The alternative would require the database to write the data immediately to provide the same crash recovery guarantees, which is very slow.
The danger of disabling the log can’t be overstated. ‘If a transaction made a change and committed, [and WAL is disabled], some of these changes may not have been written to disk at the time of a subsequent crash’. This could mean that a client or user may see a successful action (eg. a new user registers), only for the database to crash a few seconds (or minutes!) later. Without WAL there is no guarantee if the registration was propagated to the database, so the user would end up terribly confused and need to re-register.
Our UNLOGGED
table doesn’t write to the WAL. Let’s say that once the table is
loaded, it is immutable (there are no more subsequent inserts) and won’t be
replicated to another server – there is still one very large downside. The
original commit message for unlogged tables sets the
stage:
[Unlogged tables] are not available on standby servers and are truncated whenever the database system enters recovery
If we load the voting data into the database and the following commands are executed:
service postgresql restart
pkill -9 -f postgres
service postgresql start
psql reddit -c "SELECT COUNT(*) FROM votes"
The result will always be 0
, even if you wait a significantly long amount
of time. If anything at anytime causes the database shutdown irregularly (power
loss, kill -9
, etc), the database will be truncated because the database
doesn’t know whether the table is in a good state. That is why, unless it is
trivial to restore the database from some other sort of data, do not use
UNLOGGED
for long term data.
We can recover a bit of the UNLOGGED
speed using a trick with transactions.
When you execute COPY
against a table that was created in the same
transaction, the WAL will be optimized away. From the book, PostgreSQL 9.0 High Performance
The purpose of the write-ahead log is to protect you from partially committed data being left behind after a crash. If you create a new table in a transaction, add some data to it, and then commit at the end, at no point during that process is the WAL really necessary. (pg. 402)
We can see that optimization in action with the following:
BEGIN;
CREATE TABLE votes (
voteId serial PRIMARY KEY,
username text NOT NULL,
link text NOT NULL,
score int NOT NULL
);
COPY votes (username, link, score)
FROM '/tmp/votes.csv'
WITH (FORMAT 'csv', HEADER false);
CREATE INDEX ON votes (username);
CREATE INDEX ON votes (link);
COMMIT;
Results: 0m51.810s
Notice that this is a 25% improvement, over the previous COPY
with the csv. I
find this a fine compromise.
For more information on some of the use cases for UNLOGGED
tables, there’s a
fine article, Reducing writes with Unlogged
Tables
In case you are curious, synchronous_commit
will not provide an performance
benefit because it is only when dealing with multiple
transactions.
Parallel Copy
The COPY
insertions we’ve been doing have been sequential, and they have been
pretty CPU intensive. If we have multiple cores, can we split a copy into
multiple chunks and load them separately, but in parallel. Note that this will
take several transactions because a transaction is inherently single threaded
and errors will arise if copies are intermixed in a single transaction.
In my tests I have not noticed an improvement in logged tables. The work was
divided evenly between the cores, but at half usage, so total insertion time
took approximately the same time. However, others have found contrary
evidence and
parallel COPY
commands is listed in PostgreSQL 9.0 High Performance as the
fastest way to insert data.
Where I did see major improvement, were UNLOGGED
tables. I saw insertion time
decrease linearly with the number of cores (ie, all cores are saturated during
insertion). And at this point, an index on the data takes longer than the
actual data insertion.
In general, I would not recommend separate COPY
commands for the same data
because it breaks some of the semantics with the all-or-nothing behavior of
COPY
, which can be confusing if some of the data is loaded and some isn’t.
pgloader
pgloader is a project dedicated to solely loading data into Postgres. There are many things going for it.
- Well documented
- Well maintained
- Flexible (multiple data formats to import: MySQL/SQLite/etc
- Resiliant (loads only the data that conforms to the schema)
- On the homepage, it’s performance is lauded
Unfortunately, tucked away in the section about performance we find the limitation:
When given a file that the PostgreSQL COPY command knows how to parse, and if the file contains no erroneous data, then pgloader will never be as fast as just using the PostgreSQL COPY command.
pgloader looks to be an amazing tool, but not for our purposes.
Conclusion
loading | time |
---|---|
Copy csv | 1m7.799s |
Copy text | 1m5.189s |
Copy binary | 1m6.406s |
Copy unlogged | 1m28.054s |
Copy unlogged no WAL | 0m34.219s |
Copy unlogged parallel | 0m25.314s |
Copy csv in one txn | 0m51.810s |
pg_bulkload | 0m42.569s |
pg_bulkload parallel | 0m36.634s |
- If you need a short lived table that can be reconstructed easily, use an
UNLOGGED
table - Else if you need the fastest insertion, use multiple
COPY
commands withUNLOGGED
table and then alter the table to logged at some point in the future - Else if you need duplicate row behavior, but need the speed of
UNLOGGED
tables, usepg_bulkload
- Else
COPY
should be good enough
Comments
If you'd like to leave a comment, please email [email protected]
An excellent article, with excellent level of detail and clarity.
Amazing post! Thanks for sharing all this information with such a great detailing .