Optimizing Innodb for Bulk Insert

Update: October 2nd, 2015: This article contains an important omission, as it does not consider LOAD DATA INFILE command. If the data that needs to be loaded can be loaded as is, or can be loaded with minimal tampering, I’d recommend using LOAD DATA INFILE first, and then apply the MySql server variable changes I recommended as needed. Additionally, this post only focuses on Innodb storage engine, if TokuDB is available on your platform, it may be worth looking into it as TokuDB’s primary use case is write-intensive environments.

Background

Need to parse, validate, and split into several Innodb tables a number of CSVs that range in size of 10,000 to over a million lines of data. These CSVs need to be loaded in a single transaction, such that if a problem occurs on the last line, the entire file is rolled back. As can be imagined, the database goes under immense write pressure during a small window of time. This post was warranted after seeing underwhelming performance. Here I document my quest.

The goal is to eek as much performance I can out of the test machine without compromising data integrity.

Test machine:

Tour of MySql Variables

I explored all the MySql server variables, I think I’ve come up with a pretty good list of variables to optimize bulk insert speed. Note that some of these variables, by default, are set to sensible values depending on the MySql version, but I believe being explicit is better than implicit, so I replicate it here.

Why the file system matters

I discovered why the file system matters when investigating why I wasn’t getting the performance I expected. I thought I knew where the bottleneck was, but like all the other stories out there, I was wrong. By using iotop, I found that a process called jbd2/sda1-8 was using 90-99% of the IO time, and that MySql throughput was limited to 500KB. Scrounging around the web, I discovered it was the journaling process of ext4, which ensures the file system against corruption, but wait, doesn’t Innodb ensure itself against corruption?

You can disable Innodb double write feature on transactional file systems because this variable dictates whether Innodb stores the data twice before it being stored in the database, though the MySql manual states that “the double write buffer does not require twice as much I/O overhead or twice as many I/O operation”. A transactional file system guarantees that partial writes are impossible, which is what the double write feature protects against. A type of transactional file system is a journaling file system. By default ext4 has journaling baked in, so one can save writes by skipping innodb_doublewrite

For completeness, this was how I mounted the drive:

/dev/sdb /mnt/database ext4 defaults,nodiratime 0 0

Pretty standard mount, the only oddity is nodiratime, which “disables writing file access times to the drive every time you read a file [and] disables the writing of file access times only for directories while other files still get access times written.” This is done to improve performance and the fact that file access times for a database aren’t important (and since a database accesses files a lot, this does indeed help). I did not benchmark the effects of nodiratime.

Some may argue to enable innodb_doublewrite and disable the journaling through data=writeback, but I disagree because disabling the journaling will affect all files on the drive and not just the database. I reasoned in my mind that it would be a lot more apparent why the database variable was set a certain way rather than the file system.

Options Not Considered

INSERT DELAYED. “INSERT DELAYED is slower than a normal INSERT if the table is not otherwise in use.” “As of MySQL 5.6.6, INSERT DELAYED is deprecated, and will be removed in a future release.” Also Innodb can’t use INSERT DELAYED

There is a way to skip checking for duplicates when inserting data:

If you have UNIQUE constraints on secondary keys, you can speed up table imports by temporarily turning off the uniqueness checks during the import session:

SET unique_checks=0;
... SQL import statements ...
SET unique_checks=1;

For big tables, this saves a lot of disk I/O because InnoDB can use its insert buffer to write secondary index records in a batch. Be certain that the data contains no duplicate keys.

Not an option because when splitting the CSV into multiple tables, checking for uniqeness and eliminating duplicates is key.

Another option is to not check foreign key constraints when loading the data:

If you have FOREIGN KEY constraints in your tables, you can speed up table imports by turning off the foreign key checks for the duration of the import session:

SET foreign_key_checks=0;
... SQL import statements ...
SET foreign_key_checks=1;

For big tables, this can save a lot of disk I/O.

A convincing argument can be made for using this technique. It is valid to use because the data files are independent of each other there will be no foreign key constraint violations. I decided against implementing this because these aren’t “big” tables (we’re talking about <10 million rows), and most, if not all, the data should fit in the buffer pool. If the data is in the buffer pool it is trivial to check for foreign key constraints.

Config

[mysqld]
innodb_change_buffering=all
innodb_change_buffer_max_size=25
# innodb_buffer_pool_instances=1 Change at your own discretion
innodb_buffer_pool_size=3072M
innodb_log_file_size=384M
innodb_log_buffer_size=128M
innodb_flush_log_at_trx_commit=1
skip-innodb_doublewrite

The Result

The result of increasing the log buffer and file, and skipping the double write cut the time for loading the files to a tenth of the original time. I’m very satisfied with this result.

Resources:

Comments

If you'd like to leave a comment, please email [email protected]

2018-02-13 - Nils

It’s worth noting that LOAD DATA will be a single transaction, it makes sense to split this up into smaller batches, otherwise you’ll have very sizeable undo log records created

2018-11-16 - Nuno Valente

HERE BE DRAGONS!

Hi, if you arrived here like me you are trying to improve the write performance of your bulk inserts. Check this please: https://www.percona.com/blog/2014/05/23/improve-innodb-performance-write-bound-loads/

DISABLING THE innodb doublewrite MAY CORRUPT YOUR DATA.