Columnar Databases II

10/8/2013

Continued from Columnar Databases I ...

So if a situation allows us to live with the limitations of the columnar databases, how good are columnar databases. To find out for myself I set up an experiment to compare a popular row based database (Oracle 11g) with compression turned on with a columnar database (Infobright) that relies on an open source database engine (MySQL). I also set the experiment to mainly explore the compression in storage, rather than any query performance as I did not have resource to set up that elaborate an experiment.

For OLTP data structures(TCP-H(tm) Bench Mark) the Oracle compressed row data store used about 10 GB storage including some indexes, which are required for such databases. Infobright database on the other hand showed a 1.9GB. This is about 1/5th the storage required. This is a significant saving, when there are not a lot of indexes in row store, and if more indexes were added for performance reasons it would have shown even better comparison on storage requirements.

For Star Schema Bench Mark database, the data extracts were of the range of 6.7 GB of raw ASCII data, when pulled from Infobright (it by default provides quoted strings, etc) vs. about 6 GB of raw ASCII data when pulled from Oracle tables, using custom pipe delimiter. When loaded into Infobright it compressed the data into a size of about 800MB, again using no indexes. When loaded into Oracle database with the same compression scheme as before the data used about 6.5GB. From these observations, we conclude that while Oracle provides compression, considering that we had 3 large indexes on the lineorder, and smaller indexes on smaller tables as well. However the columnar database (Inforbright) provided an order of magnitude compression compared to the raw text data and row store's (Oracle) equivalent database with basic compression. Due to lack of appropriate storage (exadata machine) I could not test the more aggressive compression scheme available from Oracle row store database.

Query timings were better in case of Infobright database where the large fact table extraction to flat file took about 12 minutes and in case of Oracle the same took about 42 minutes. Thus highlighting the benefits of smaller storage, at the least, as the queries did not use any index for Oracle either as these queries gets all the data from the tables in the join.

The star schema shows a higher compression ratio for the columnar database, even though, it uses mainly numeric type data types in the large fact table.

References:-

Abadi, D., Boncz, P., Harizopoulos, S. "Column-oriented Database Systems" in VLDB ’09, August 24-28, 009, Lyon, France.
Abadi, D.J., Madden, S.R., and Ferreira, M. "Integrating compression and execution in column-oriented database systems" In Proc. SIGMOD, 2006.
Hodak, W., Jernigan, Kevin, "Advanced Compression with Oracle Database 11g Release 2" An Oracle White Paper from Oracle corporation, September 2009
Oracle, "Oracle 11g SQL Reference Guide" from otn.oracle.com
Oracle, "Oracle 11g Utilities Guide" from otn.oracle.com
Inforbright.org, "Direct data load guide" available from inforbright.org

2 Comments

Jonathan Finerfrock

11/1/2013 02:32:15 am

We have been using Infobright Enterprise Edition now for 4 years. Like you, we saw a huge compression result from our raw data. We moved off of SQL Server 2005. While admittedly, we do not have a huge data warehouse we did see significant increases in query performance. The "limitations" that we have on Infobright are for insert and update. while Infobright claims that you can do them, the performance is so slow it is not really an option. our work around is really not that onerous. we truncate several of our tables and reload. Infobright has a very fast data loader that allows us to do this with little impact to up time. for our bigger tables we delete large sections of our tables, Like current month and then insert the revised current month. these "work arounds" are minimal compared to return that we get from Infobright. Remember once it is set up and running there is minimal administration...no indexes, no partitions, no tuning! so yea, it is worth the limitations when my end users can get a result back in 2 seconds on IB versus the 20 seconds on SQL Server and all the work that I had to do to get it down to 20seconds

SARBJIT PARMAR

11/3/2013 05:10:00 am

Updates are definitely a problem for columnar databases because of the very nature of the problem being tackled, and are well documented in the literature for various columnar databases. I am surprised that you included insert also as being a problem. Updates are slow for row based databases as well, especially when you have to update a large number of rows, the required logging makes the operations inherently slow, and delete/insert approaches tend to work better even for row based stores in those cases, but this approach would be orders of magnitude better. So even though updates are supported, I am sure they are not encouraged by the vendors or community supporting the columnar databases.

Your comment will be posted after it is approved.

About Sarbjit Parmar

A practitioner with technical and business knowledge in areas of Data Management( Online transaction processing, data modeling(relational, hierarchical, dimensional, etc.), S/M/L/XL/XXL & XML data, application design, batch processing, analytics(reporting + some statistical analysis), MBA+DBA), Project Management / Product/Software Development Life Cycle Management.

Columnar Databases II

Leave a Reply.

About Sarbjit Parmar

Archives

Categories