Buffer Files Mysql Insert

Recently, we were faced with a problem with mysql insertion performance. Basically, the problem is we need to insert lots of data in a short period time repeatedly. The data inserted must not be a duplicate, moreover, when we insert a data which is already existed in the database, we need to update it. For instance, we want to insert a tweet object which has a retweet count property. When we first inserted the tweet, it has X retweet count. Later, when we want to insert a more updated data, it has Y retweet count. When we met these cases, we need to update older data in database, instead of inserting new data. 

No brainer way would be every time we want to insert a data, we have to search the database to find if it is already in the database. If it is found, then update; if not, then insert. This method would be very costly since every time we need to insert a data we need to select the table to check. So here we will use a trick using “ON DUPLICATE KEY UPDATE” clause.

Inserting with “ON DUPLICATE KEY UPDATE” clause would force the database insert a data, and when it found a conflict due to unique key constraint, it would delete the newer data and update the older data. Let’s see below example:

Table “Tweet” has properties like this:


tweet_id VARCHAR(50) UNIQUE,

content TEXT,

retweet_count INT


It has already one row:

id:1;  tweet_id: “1000001”;  content: “This is a tweet”; retweet_count: “1”;


Now we want to insert some tweets like these:

tweet_id : “1000001”;  content: “This is a tweet”; retweet_count: “9”;

tweet_id : “1000002”;  content: “I love twitter”; retweet_count: “0”;

tweet_id : “1000003”;  content: “I like facebook better”; retweet_count: “0”;


Notice that we have a new record with the same unique key (tweet_id “1000001”). If we insert the data using clause “ON DUPLICATE KEY UPDATE retweet_count=VALUES(retweet_count)”, the data in the table would be like this:


id:1;  tweet_id : “1000001”;  content: “This is a tweet”; retweet_count: “9”;

id:3;  tweet_id : “1000002”;  content: “I love twitter”; retweet_count: “0”;

id:4;  tweet_id : “1000003”;  content: “I like facebook better”; retweet_count: “0”;


Notice that the auto increment `id` jumped from 1 to 3. It means, the query make the database insert a new data first then when it notices that there is a duplicate key constraint, it triggers the “ON DUPLICATE KEY UPDATE” clause, which are updating the older data, then delete the new data. Using this method, we tried inserting 1000 record and benchmark it with no brainer way. The result is we get around 40% performance improvement.


Now, we use another trick to improve it further. This time we use a buffer file to collect the data, then insert them to the database. The tricky part is, instead of building one query per data, we build one query containing all the data in the buffer. It is similar when you dump a database including its data.

To be simple, we collect our data until it reach a certain limit, then when it reach the limit, build one insert query and commit the database. If successful, we empty the buffer file. In our case, we collect the data and write it to a csv file while keeping track of how many records in there.

The one query approach I mentioned before, would generate a query like this for the case above:

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000001′,’This is a tweet’,’9′),

(‘1000002′,’I love twitter’,’0′),

(‘1000003′,’I like facebook better’,’0′)

ON DUPLICATE KEY UPDATE `retweet_count` = VALUES(`retweet_count`);


instead of the usual queries of:


INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000001′,’This is a tweet’,’9′);

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000002′,’I love twitter’,’0′);

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000003′,’I like facebook better’,’0′);


Using this method, we get a whooping 2000% performance improvement. This is of course harder to develop and maintain, so you might want to study your case better before implementing it.