Open Source Is Free, but Not Free

Sebelum membahas tentang bagaimana perusahaan dengan produk open source mendapatkan revenue, kita bahas sejenak apa itu produk open source. Kebanyakan software sebelum awal milenium ke-3 sifatnya adalah berbayar dan isi codenya tidak dapat dilihat apalagi diutak-atik orang lain. Pada tahun 1983 muncullah faham open source, namun pergerakan ini baru mulai terasa getol sejak akhir milenium ke-2. Software open source adalah software yang source code-nya (kalo untuk kacamata awam mungkin bisa dianalogikan bahan baku kali ya) dilisensikan dengan open source license dan pemiliknya dapat mendistribusikan source code software tsb ke siapa saja untuk dilihat, dikembangkan, dan digunakan. Hal ini tentu sangat berdampak pada perkembangan software, terlebih pada komunitas developer (programmer) since open source is free. For more info, just read the wiki http://en.wikipedia.org/wiki/Open-source_software

Kembali ke bahasan utama.

Sebenarnya dulu saya heran, kok ada ya orang baik yang udah capek-capek nulis code lalu disebar-sebar tanpa meminta sepeser uang pun. Gile baik bener ya ūüėÄ Yang lebih mengherankan lagi adalah adanya perusahaan yang produknya open source, gimana dapet uangnya tuh?

Ok, so let me clarify. Free, in English, dapat berarti bebas atau gratis. Dalam context Open Source Software (OSS), free yang dimaksud adalah bebas, bebas diapa-apain, bukan gratis. Hampir semua software open source memang gratis (not sure, let me know kalo ada yang berbayar), tapi idenya di sini adalah code softwarenya terbuka untuk semua orang sehingga orang-orang bebas untuk menggunakan dan mengembangkannya.

Business model untuk OSS yaitu dengan mengambil untung dari biaya pelatihan, support, maupun maintenance, bukan dari penjualan atau pemakaian software tersebut. Let’s take an example.

Misal kita menggunakan database MongoDB untuk perusahaan kita. Seiring dengan berkembangnya perusahaan kita, tentu makin kompleks kebutuhan yang kita perlukan serta makin besar data yang kita handle. Ketika perusahaan menjadi besar, suatu saat akan tiba titik dimana kita tidak bisa hanya menggunakan produk open source tsb dan kita harus punya kerjasama dengan pengembang produk tsb. Kenapa? Agar bila tiba-tiba shit happens kita dapat meminta bantuan dari pihak pengembang produk dengan cepat, supaya masalah teratasi. Kecepatan dan security inilah yang kita perlukan. Tentu kita dapat mengakali dengan memiliki tim rockstar developers, tapi tidak semua perusahaan mau atau merasa perlu memiliki developer rockstar (super jago).

Jadi intinya, open source is free (bebas), not free (gratis)

Advertisements

Buffer Files Mysql Insert

Recently, we were faced with a problem with mysql insertion performance. Basically, the problem is we need to insert lots of data in a short period time repeatedly. The data inserted must not be a duplicate, moreover, when we insert a data which is already existed in the database, we need to update it. For instance, we want to insert a tweet object which has a retweet count property. When we first inserted the tweet, it has X retweet count. Later, when we want to insert a more updated data, it has Y retweet count. When we met these cases, we need to update older data in database, instead of inserting new data. 

No brainer way would be every time we want to insert a data, we have to search the database to find if it is already in the database. If it is found, then update; if not, then insert. This method would be very costly since every time we need to insert a data we need to select the table to check. So here we will use a trick using “ON DUPLICATE KEY UPDATE” clause.

Inserting with “ON DUPLICATE KEY UPDATE” clause would force the database insert a data, and when it found a conflict due to unique key constraint, it would delete the newer data and update the older data. Let’s see below example:

Table “Tweet” has properties like this:

id INT PK AUTO INCREMENT,

tweet_id VARCHAR(50) UNIQUE,

content TEXT,

retweet_count INT

 

It has already one row:

id:1; ¬†tweet_id: “1000001”; ¬†content: “This is a tweet”; retweet_count: “1”;

 

Now we want to insert some tweets like these:

tweet_id : “1000001”; ¬†content: “This is a tweet”; retweet_count: “9”;

tweet_id : “1000002”; ¬†content: “I love twitter”; retweet_count: “0”;

tweet_id : “1000003”; ¬†content: “I like facebook better”; retweet_count: “0”;

 

Notice that we have a new record with the same unique key (tweet_id “1000001”). If we insert the data using clause “ON DUPLICATE KEY UPDATE retweet_count=VALUES(retweet_count)”, the data in the table would be like this:

 

id:1; ¬†tweet_id : “1000001”; ¬†content: “This is a tweet”; retweet_count: “9”;

id:3; ¬†tweet_id : “1000002”; ¬†content: “I love twitter”; retweet_count: “0”;

id:4; ¬†tweet_id : “1000003”; ¬†content: “I like facebook better”; retweet_count: “0”;

 

Notice that the auto increment `id` jumped from 1 to 3. It means, the query make the database insert a new data first then when it notices that there is a duplicate key constraint, it triggers the “ON DUPLICATE KEY UPDATE” clause, which are updating the older data, then delete the new data. Using this method, we tried inserting 1000 record and benchmark it with no brainer way. The result is we get around 40% performance improvement.

 

Now, we use another trick to improve it further. This time we use a buffer file to collect the data, then insert them to the database. The tricky part is, instead of building one query per data, we build one query containing all the data in the buffer. It is similar when you dump a database including its data.

To be simple, we collect our data until it reach a certain limit, then when it reach the limit, build one insert query and commit the database. If successful, we empty the buffer file. In our case, we collect the data and write it to a csv file while keeping track of how many records in there.

The one query approach I mentioned before, would generate a query like this for the case above:

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000001′,’This is a tweet’,’9′),

(‘1000002′,’I love twitter’,’0′),

(‘1000003′,’I like facebook better’,’0′)

ON DUPLICATE KEY UPDATE `retweet_count` = VALUES(`retweet_count`);

 

instead of the usual queries of:

 

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000001′,’This is a tweet’,’9′);

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000002′,’I love twitter’,’0′);

INSERT INTO `tweet` (`tweet_id`,`content`,`retweet_count`) VALUES

(‘1000003′,’I like facebook better’,’0′);

 

Using this method, we get a whooping 2000% performance improvement. This is of course harder to develop and maintain, so you might want to study your case better before implementing it.