Category: Bigdata

Job Scheduling Best Practices

10/13/2013

While working on various data warehousing or batch jobs I have pondered, ran proof of concept projects for evaluating job scheduling tools, and implemented jobs so that they could be run in minimal time utilizing all the resources available in the execution environment while avoiding pitfalls like excessive parallelism. I have come up with some observations/ideals/requirement in this area, as described below.

Make the jobs ready for as quick a restart as possible.
Job dependency should be managed robustly.
Every job should report its status back to parent process/job scheduler. It is very important especially when we need to manage dependencies based on the success/failure of a predecessor job.
Every failed job should capture the error reported for performing a root cause analysis.
Job plan in the job scheduler ideally should not hardcode the degrees of parallelism.
Resource consumption requirements should be defined at every job level. A scheduler should be able to use this information to decide which job could be scheduled next. Some jobs’ resource requirements are readily available to the scheduler because it’s running those tasks. However other requirements may not be that readily available, e.g. the database resource consumption. Good scheduling tools allow specification of metadata for resources being either actual or virtual for running the jobs.
Scheduler should know the overall resources(either actual detected or virtually defined) available in the system (cluster/SMP/MPP) so that it could keep track of which resources are being used and which resources are available so that it could run the jobs for which dependencies have been satisfied and
resources are available. It should manage the consumption of resource so that overscheduling of the jobs does not happen. This is very important to avoid excessive context switching and overall slowdown in performance. Allowing the scheduler to allocate resources also allows flexibility in the schedule that does not need constant tuning based on the changes in job plans.
After a job is complete the allocated resources should be released back to the scheduler.
Every job should be run at the earliest available opportunity and no later.
An ideal scheduler provides a very flexible calendar for running jobs, e.g. on First day, nth Day, Last Day of Month/Week/Year. The day may be regular day, working day, holiday, etc. And these types of days could/should be easily configurable.
An ideal scheduler should also be able to subscribe to other events for triggering jobs, including availability of certain files, notifications from emails or other queuing systems.
An ideal scheduler provides ability to transfer files, preferably with pipelining mechanisms so that files are transferred for immediate consumption and not landed to the disk.
There should be multiple notification mechanisms available for the users when certain events occur, these events should be easily defined/configured. The event notifications may be available through operator consoles/emails/text, etc.
Every batch job should take away the resources for the minimal time from online user consumption. Make smart use of an offline processing area.

0 Comments

Columnar Databases II

10/8/2013

2 Comments

Continued from Columnar Databases I ...

So if a situation allows us to live with the limitations of the columnar databases, how good are columnar databases. To find out for myself I set up an experiment to compare a popular row based database (Oracle 11g) with compression turned on with a columnar database (Infobright) that relies on an open source database engine (MySQL). I also set the experiment to mainly explore the compression in storage, rather than any query performance as I did not have resource to set up that elaborate an experiment.

For OLTP data structures(TCP-H(tm) Bench Mark) the Oracle compressed row data store used about 10 GB storage including some indexes, which are required for such databases. Infobright database on the other hand showed a 1.9GB. This is about 1/5th the storage required. This is a significant saving, when there are not a lot of indexes in row store, and if more indexes were added for performance reasons it would have shown even better comparison on storage requirements.

For Star Schema Bench Mark database, the data extracts were of the range of 6.7 GB of raw ASCII data, when pulled from Infobright (it by default provides quoted strings, etc) vs. about 6 GB of raw ASCII data when pulled from Oracle tables, using custom pipe delimiter. When loaded into Infobright it compressed the data into a size of about 800MB, again using no indexes. When loaded into Oracle database with the same compression scheme as before the data used about 6.5GB. From these observations, we conclude that while Oracle provides compression, considering that we had 3 large indexes on the lineorder, and smaller indexes on smaller tables as well. However the columnar database (Inforbright) provided an order of magnitude compression compared to the raw text data and row store's (Oracle) equivalent database with basic compression. Due to lack of appropriate storage (exadata machine) I could not test the more aggressive compression scheme available from Oracle row store database.

Query timings were better in case of Infobright database where the large fact table extraction to flat file took about 12 minutes and in case of Oracle the same took about 42 minutes. Thus highlighting the benefits of smaller storage, at the least, as the queries did not use any index for Oracle either as these queries gets all the data from the tables in the join.

The star schema shows a higher compression ratio for the columnar database, even though, it uses mainly numeric type data types in the large fact table.

References:-

Abadi, D., Boncz, P., Harizopoulos, S. "Column-oriented Database Systems" in VLDB ’09, August 24-28, 009, Lyon, France.
Abadi, D.J., Madden, S.R., and Ferreira, M. "Integrating compression and execution in column-oriented database systems" In Proc. SIGMOD, 2006.
Hodak, W., Jernigan, Kevin, "Advanced Compression with Oracle Database 11g Release 2" An Oracle White Paper from Oracle corporation, September 2009
Oracle, "Oracle 11g SQL Reference Guide" from otn.oracle.com
Oracle, "Oracle 11g Utilities Guide" from otn.oracle.com
Inforbright.org, "Direct data load guide" available from inforbright.org

2 Comments

Columnar Databases - Part I

10/6/2013

0 Comments

With the need for processing more and more data and also the availability for more data captured electronically from various data collection points through commercial, non-profit, government or research communities. This phenomenon is termed as Big Data in industry parlance. To make sense out of
this data being gathered it requires large amount of processing power. This data may be available in granular form or as documents, and sometimes both may be co-related. Over the period of time we notice that the nature of the data gathered is getting changed. Traditionally most of the data was transactional in nature, requiring CRUD(create, update, delete) operations. Now a larger amount of data is being created that is usually not updated and may only be deleted when it is no longer needed, usually after a longer period of time than in the transactional sense. While OLTP database provided the ability to store the CRUD operations with ACID(atomic, consistent, isolated and durable) properties for handling more granular data, they were then enhanced to add storage of various types of documents(text, pictures, etc.) again with the OLTP type of transactions in mind. These databases typically use a normalized data model for storage. But the need for providing ACID guaranties, and to handle different type of data volumes for analytical needs, the data could no longer be contained in those models. Therefore the data warehouses were designed using same type of databases, but with different type of data models (typically dimensional, though not always). Data warehouses allowed separation of data from the OLTP systems, but still grew fairly large in volumes, and typically serve more read type operations than frequent updates or writes.

While working with large volumes of data, I noticed that at times a large number of columns in a table have low cardinality, but the overall size of the table itself may be fairly large. This led me to believe that one could reduce the size of the data as stored on the disk if compression techniques are used. Since disk access is usually the slowest part of the access of a database a smaller footprint of data would presumably lead to faster retrieval of the data from the slower medium, however, there would be associated CPU cost that would be incurred in compressing the data. Since data warehouses carry the largest amounts of data there presumably would be tradeoff scenarios in using one or the other technique. Even though the normalization theory is about reducing the redundancy of duplicate data, and therefore providing most efficient storage, there has to be other techniques that could be combine with this to reduce the overall query timing. One of the recent technologies that focus on this aspect is the columnar storage based databases.

[Abadi, Boncz, Harizopoulos, 2009] provide a brief introductory, tutorial to the columnar databases. They describe the columnar databases as
"Column-stores, in a nutshell, store each database table column separately, with attribute values belonging to the same column stored contiguously, compressed, and densely packed, as opposed to traditional database systems that store entire records (rows) one after the other."

They trace the history of the column stores back to 1970s, when the usage of transposed files were explored. In 1980s the benefits of decomposed storage mode(DSM) over row based storage were explored. Its only in 2000s that these data stores finally took off.

Because of the affinity of the data values stored contiguously on disk pages for each of the columns the data lends to better compression schemes, that may be light weight in their CPU utilization but still provide heavy
compression.

However, these databases are challenged in their ability to provide updates and also in tuple construction required for use in applications which access data through ODBC/JDBC type interfaces. The tuple construction is required to present the data in row format used by these access applications.

Continued at Columnar Databases II

0 Comments

About Sarbjit Parmar

A practitioner with technical and business knowledge in areas of Data Management( Online transaction processing, data modeling(relational, hierarchical, dimensional, etc.), S/M/L/XL/XXL & XML data, application design, batch processing, analytics(reporting + some statistical analysis), MBA+DBA), Project Management / Product/Software Development Life Cycle Management.

Job Scheduling Best Practices

Columnar Databases II

Columnar Databases - Part I

About Sarbjit Parmar

Archives

Categories