Saturday 3 October 2009

Scaling writes in MySQL

Scaling writes in MySQL: "We use MySQL on most of our projects. One of these projects has a an access pattern unlike any other I've worked on. Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again. Each record is about 104 bytes long (thre's one VARCHAR column, everything else is fixed), and that's after squeezing out every byte possible. The average number of records that we write in a day is 40 million, but this could go up.

A little bit about the set up. We have fairly powerful boxes with large disks using RAID1/0 and 16GB RAM, however at the time they only had 4GB. For BCP, we have a multi-master set up in two colos with statement level replication. We used MySQL 5.1.

My initial tests with various parameters that affect writes showed that while MyISAM performed slightly better than InnoDB while the tables were small, it quickly deteriorated as the table size crossed a certain point. InnoDB performance deteriorated as well, but at a higher table size. The table size turned out to be related to the innodb_buffer_pool_size, and that in turn was capped by the amount of RAM we had on the system.

I decided to go with InnoDB since we also needed transactions for the summary tables and I preferred not to divide my RAM between two different engines. I stripped out all indexes, and retained only the primary key. Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates. This small point was crucial for BCP, because it meant that we did not have to keep track of which records had already been inserted. If something crashed, we could just reinsert the last 30 minutes worth of data, possibly into the secondary master, and not have any duplicates at the end of it. I used INSERT IGNORE to get this done automatically.

Now to get back to the table size limit that we were facing. Initial tests showed that we could insert at most 2100 records per second until the table size got to a little over the innodb_buffer_pool_size and at that point it degraded fairly rapidly to around 150 records per second. This was unacceptable because records were coming in to the system at an average rate of 1000 per second. Since we only needed to read these records at the end of the day, it was safe to accumulate them into a text file and periodically insert them in bulk. I decided to insert 40,000 records at one time. The number I chose was arbitrary, but later tests that I ran on batches of 10K, 20K and 80K showed no difference in insert rates. With batch inserts, we managed to get an insert rate of 10,000 records per second, but this also degraded as soon as we hit the limit going down to 150 records per second.

System stats on the database box showed that the disk was almost idle for most of the run and then suddenly shot up to 90-100% activity once we hit this limit, so it was obvious that at this point, the DB was exchanging data between buffers and disk all the time.

At this point, someone suggested that we try partitioning, which was available in MySQL 5.1. My first instinct was to partition based on the primary key so that we could read data out easily. However, reads weren't really our problem since we had no restriction on how fast they needed to be (at least not as much as writes). Instead, I decided to partition my table based on the pattern of incoming data.

The first part was obvious, use a separate table for each day's data. On a table of this size, DROP TABLE is much faster than DELETE From <table> Where ..., and it also reclaims lost space. I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file.

Secondly, each table was partitioned on time. 12 partitions per day, 2 hours of data per partition. The MySQL docs for Partitioning were quite useful in understanding what to do. The command ended up looking like this:
    CREATE TABLE (
...
) PARTITION BY RANGE( ( time DIV 3600 ) MOD 24 ) (
Partition p0 values less than (2),
Partition p0 values less than (4),
Partition p0 values less than (6),
Partition p0 values less than (8),
Partition p0 values less than (10),
Partition p0 values less than (12),
Partition p0 values less than (14),
Partition p0 values less than (16),
Partition p0 values less than (18),
Partition p0 values less than (20),
Partition p0 values less than (22),
Partition p0 values less than (24)
);
The time field is the timestamp of incoming records, and since time always moves forward (at least in my universe), this meant that I would never write to more than 2 partitions at any point in time. Now, a little back of the envelope calculations:
    44M x 102 bytes = approx 4.2GB
2x for InnoDB overhead = approx 8.4GB
+10% for partitioning overhead = 9.2GB
/12 partitions = approx 760MB per partition
This turned out to be more or less correct. In most cases total table size ranges between 8-10GB, sometimes it goes up to 13GB. Partition sizes range from less than 700MB to over 1GB depending on the time of day. With 4GB of RAM, we had an innodb_buffer_pool set at 2.7GB, which was good enough to store two partitions, but not good enough to work on any other tables or do anything else on the box. Boosting the RAM to 16GB meant that we could have a 12GB buffer pool, and leave 4GB for the system. This was enough for 2 partitions, even if the total number of records went up, and we could work on other tables as well.

After partitioning, tests showed that we could sustain an insert rate of 10K rows per second for some time. As the table size grew past 10 million records, the insert rate dropped to about 8500 rows per second, but it stayed at that rate for well over 44 million records. I tested inserts up to 350 million records and we were able to sustain an insert rate of around 8500 rows per second. Coincidentally, during Michael Jackson's memorial service, we actually did hit an incoming rate of a little over 8000 records per second for a few hours.

One more BotE calculation:
    8500 rows per second  x  86400 seconds per day = 734.4 Million records per day
Considering that before this system was redesigned it was handling about 7 Million records per day, I'd say that we did pretty well.

PlanetMySQL Voting:
Vote UP /
Vote DOWN"

No comments:

Sike's shared items