Seems like the presentation got deleted by ... be back soon.
Introduction A modern DBMS is necessarily a fairly complex system. The complexity arises from the fact that these systems need to abstract transaction management techniques from the application developers; need to provide simultaneous access to the database from multiple applications or users in timesharing and multiprocessing systems; and, usage of data caching or buffering techniques to provide fast access to frequently and recently used data. These systems could experience events(e.g. failures, conflicts) in hardware, software and media that could result in incomplete application of the data from a single transaction to the database, requiring recovery of the database. Recovery may also be necessary when certain user or programming errors result in un-intentional loss in data.
The recovery techniques vary widely according to the type of interruption and by the sophistication of the features provided by the DBMS. A failure in the last part of a transaction may require only an undo of the changes made to the database for the earlier parts of the transaction. A failure in media, however, may require shutdown of the database instance, recovery of media from a backup, and redoing the transactions which were committed before the failure happened and undoing the transactions that were in progress when the failure happened. In the database providing multi-versioning support, it might be possible to recover the data before the error happened.
This report surveys literature describing the general techniques used for databases recovery and their relative merits and flaws and how the field has progressed over time. This report then describes how these techniques are being used in Oracle relational database management system.
Literature Survey [Elmasri, Rameez 2007] provides an introduction to the concepts of transaction processing, desirable properties of transactions and operations required to realize transactions. It presents how multi-user systems could lead to incorrect data in the database if it allows uncontrolled execution of transactions. It also describes types of failures that happen during transaction processing. It discusses various steps in a transaction and some of the recovery and concurrency control mechanisms. It introduces to the concept of DBMS system log that keeps track of database access and how it could be used to implement recovery from failures. It also defines a schedule as an execution sequence of several transaction and characterizes schedules in terms of their recoverability. Of significance are recoverable schedules that make sure that once a transaction commits it never has to be undone. Cascade-less schedules require that an aborted transaction does not require aborting another transaction. Strict transaction schedules require that a DBMS be able to restore the old values of items that have been changed by an aborted transaction.
If the nature of the failure, e.g. a disk crash, resulted in extensive damage to large parts of the database, then the recovery method uses a database that was backed up to archives and apply the transaction recorded in the system log up to the point of failure. However, when failure in transactions leaves the database in inconsistent state a combination of undo & redo of operations recorded in the system log would get the data in consistent state.
Since DBMS has to handle large amount of data and disk access is costly, it buffers the data in memory and employs strategies to minimize the disk access, by using the buffers first. When a block is updated a dirty bit could be set for the block and when its time to write to the disk only dirty blocks are written. These modified blocks could be updated in-place i.e. at the original location on the disk or a shadow block could be written in a new place. When in place updating is used before image(BFIM) is written to the log and flushed to the disk before this block is overwritten by after image(AFIM) in the database and this is called UNDO entry. This process is known as write ahead logging. When the AFIM is also stored in the log it is called REDO entry. When a buffer block can’t be written to disk before the transaction commits, it is called a no steal approach, however, if the protocol allows writing the buffer before commit its called a steal(no pun intended). If all the buffer blocks updated by a transaction are written to disk before commit, its called a force approach, otherwise a no-force approach. A typical DBMS employs steal/no-force approach. With steal approach less RAM is required for storing buffer data and with no-force approach less i/o would be required as the same block could potentially be updated by another transaction. To facilitate recovery DBMSes keep a list of active, committed and aborted transactions and information about last checkpoints.
DBMS’s recovery manager would write all the modified buffers at regular intervals (measured in terms of time or number of transactions) to the database. These writes are called checkpoints. After every checkpoint the DBMS writes an entry to system log. So in case of recovery the data blocks before the checkpoint don’t have to be redone. A checkpoint consists of the following actions:
1. Suspend execution of transactions temporarily.
2. Force-write all main memory buffers that have been modified to the disk.
3. Write a record to the log and force write the log to the disk.
4. Resume executing transactions.
The checkpoint record may contain information about active transactions and their first and most recently modified records, this information would help with quicker undoing of any transactions that must be rolled back. However, large amount of data for step 2 could keep the transactions suspended for too long, so it a revised technique called fuzzy check-pointing may be used. In this technique the DBMS does not wait for the step 2 to finish, instead launch step 3 and allows the transactions to resume as soon as the checkpoint log is written. However, the log is in invalid state and it keeps a pointer to the last valid checkpoint long entry. Only after step 2 finishes, it makes the checkpoint log valid and move the pointer to new log record.
Deferred update is a technique where actual changes to database are postponed till transaction completes its execution successfully and reaches commit point. The transaction reaches the commit point only after the update operations are recorded in the log and complete writing to the disk. It is also called a NO-UNDO/REDO recovery algorithm. In a single user environment the procedure involves keeping two transaction lists, one for the committed transaction since the last checkpoint and another one with active transactions. After a failure, read the committed transaction list after the last checkpoint and read the log file and apply the WRITE operations to the database; the active transactions would need to be re-started either by the user or recovery process. The redo operation should be idempotent – that is repeating it any number of times produces the same output. In multi user system the approach in intimately intertwined with the concurrency management approach. If the approach uses strict two-phase locking it ensures a strict and serializable schedule. The technique is essentially same as that described for the single user case, however, the changes(WRITE operations) are applied in the same order in which they were written to the system log. An improvement to efficiency of the algorithm can be made by applying only the last WRITE operation for item X to the database. The main benefit of this approach is that nothing has to be UNDONE. The main limitation is that it reduces the concurrency as the items need to be kept locked till the transaction reaches the commit point.
There are two main categories of recovery techniques in case of Immediate Update to the database i.e. applying the changes to the database before the transaction has committed. In these techniques the update has to be recorded in the log file before it gets applied to the db using the write ahead protocol. This would help us undo the changes to the database in case of failed or rolled-back transaction. One of these techniques(UNDO/NO-REDO) assures that all the changes have been applied to the database and only then the transaction is committed. The other variation is to allow the transaction to commit before the changes to the database are made. The later technique is a more general case and is called UNDO/REDO, and its recovery procedure uses the same two transaction lists(committed since last checkpoint & active ones). First all the operations for active transactions are un-done in the reverse order in which they are written to the log. Then, all the write operations of the committed transactions are re-done. The later step could be optimized by applying on the last write operation for any item. This could be accomplished by walking the redo log in reverse order and keeping an ignore list of items that have already been applied to the database.
Shadow Paging technique considers the database to be made of a number of fixed size blocks for recovery purposes. A directory for those blocks is constructed with ith entry in directory pointing to the ith block in the database. When a transaction begins the directory which points to the most recent database pages on the disk is copied to a shadow directory. The shadow directory is saved to the disk and the current directory is used for changes required by the transaction. The transaction never modifies the shadow directory. When some update to data block is performed the modified block is copied to a new location, however, the old copy stays. The current directory now points to the new block. To recover from a failed transaction the modified blocks are freed up and current directory is discarded. The shadow directory still point to the before image and is reinstated. Committing the transaction requires saving the current directory and discarding the shadow directory. As no work is required to be redone or undone this technique could be called a NO-UNDO/NO-REDO technique. With single users no logs are required, however, with multi-user environments requiring concurrent transactions the techniques incorporates system logs and checkpoints. This technique has limitations that updated database pages change location that makes the task of keeping related pages close harder. Large directories could require large RAM or paging to the disk, and also copying the shadow directories to disk is expensive. Another complication is to keep track of freed up resources when transaction commits. The operation to migrate between old and new directories should be atomic.
The book also describes the ARIES algorithm which is reviewed as part of another paper review. It also describes the two-phase commit protocol that is used by multi-database systems.
[Mohan, et.al. 1992] In this paper the authors introduce a novel method called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) for recovery in concurrent, transactional database management systems using write ahead logging while supporting partial rollback of transactions and fine grained locking strategies. It uses a variation on the UNDO/REDO protocol described above in the review of material from Elmasri,et.al’s book. It uses a transaction table that describes the transactions in flight and a dirty page table that describes the buffers in the pool that are yet to be written to the database. This protocol uses a a system log for recording the changes. It contains records for WRITE, COMMIT, ABORT, END and UNDO. Write record is written when a block is updated, commit, when the transaction is finalized, and abort when a transaction is cancelled. These operations are common to the other protocols described above. It also records a compensating log record when an UNDO operation happens and END is used when a transaction is done with COMMIT or ABORT operations. It uses a new concept of Log Sequence Number(LSN). LSN is a monotonically increasing number and points to the address of a record in the system log. Each of the buffers in the pool contains the LSN for the latest change in that block. Another feature of this protocol is check-pointing. When check-pointing is done a dirty page and transaction table structures are written to the disk. That means that data does not have to be flushed to the disk.
Recovery according to this protocol is three stepped approach. First step is Analysis, which starts with the last checkpoint and starts with the begin_checkpoint record and then proceeds towards the end. It reads the recovery data structures(dirty pages & transaction table) and applies the appropriate changes from the records and records the appropriate LSN. Second step is REDO. This step works efficiently by reading the dirty table records and noting the lowest LSN and highest LSN. Lowest SCN marks the start point for writing the data to the database and the highest LSN marks the end of required REDO. When REDO is done, UNDO can start by looking at the active transactions and apply the before images by walking backward from the end of the system log. It writes a Compensating Log Record(CLR) when an UNDO operation is finished, so that the same operation is not performed again in case there was a server crash in the middle of db failure.
[Agrawal, Dewitt 1985] This paper presents analysis of the behavior and performance characteristics of concurrency control and recovery mechanisms. Since the concurrency control and recovery mechanisms are closely associated, this paper analyzes the performance of these mechanisms in an integrated manner for their application to multi-user, transaction oriented centralized database management systems. The authors conclude:
The choice of the “best” integrated concurrency control and recovery mechanism depends on the database environment. If there is a mix of transactions of varying sizes, log+locking emerges as the most appropriate mechanism. If there are only large transactions with only sequential access pattern, the shadow+locking mechanism is a possible alternative. In an environment of medium and large sized transactions, the differential+locking is a viable alternative to the log+locking mechanism.
The authors also compare the optimistic concurrency control and locking with deadlock detection and conclude that “with the optimistic approach, not only are there a higher number of transaction restarts, but each restart is also more expensive” and that “optimistic method of concurrency control should only be considered in an environment where transactions are small with a very low probability of conflict”.
When comparing the recovery mechanisms they recommend the usage of undo with logging, even though its more expensive than shadow or differential files, because, “logging puts a smaller burden on a successful transaction” and “most of the transactions succeed rather than abort”
[Verhofstad, 1978] is the earliest work on summarizing the various techniques used for “recovery, backing out, restarts, maintenance of consistency and crash resistance” in filing systems, databases and operating systems, with the focus being on database management systems. It categorizes these techniques, presents the basic principles and the situations where they could be applied. It also provides how these techniques were applied in the systems contemporary to the time when this paper was written. I have chosen not dwell on the individual techniques as they have been covered elsewhere in this survey. However, it is mentioned here as one of the earliest works and its usefulness in clarifying the concepts.
[Elhardt, Rudolf 1984] is one of the earliest works that demonstrates the importance of a database cache for high performance and quick restarts in databases. The database cache is demonstrated to be highly effective in applications that require high throughput of short transactions. Per this paper the db cache holds the frequently accessed pages, those pages can be changed without traffic to the database. So it is easier to not store the before images of the unfinished transactions on the database. It states that the commit processing is very fast because of sequential output of the after images to the safe. Transaction failures don’t have to touch the database, but instead could be handled on the cache itself. It provides for a fast restart of the database after a failure. It also supports sophisticated lock protocols with concurrent reading and writing. This paper also recognizes the need to support longer running transactions and long update transactions. They only require cache support for fixed pages. The secondary storage required to keep the recovery information is proportional to the update activity, but not tied to the length of transaction per se. It allows the commit and abort processing to be serialized with other TAs and therefore need not be fast.
[Lau, Madden 2006] represent an novel approach to providing high transactional output and fast recovery in distributed data warehouses. Their approach assumes that distributed databases are used for providing high availability. They show how the data redundancy could be used to approach recovery in such an environment. Their approach does not use a stable log and provides recovery without quiescing the system. Traditional data warehouses are built mostly to provide read access to historical data gathered from the operational database systems. They usually use bulk loads of data instead of insert/update activity and did not require the usage of traditional recovery and transactional control semantics. However, the trend in data warehouses is to provide more frequent inserts and sometimes even updates of data for missing or incorrect data. This updatable data warehouses require the databases to provide both recovery and transaction support. Their approach is called HARBOR (High availability and replication based online recovery). Its based on a requirement that such a system have some form of replication to support high availability. They show how to exploit this redundancy to provide efficient crash recovery without the use of a stable log, forced writes during commit or complex recovery techniques like ARIES.
Their approach is flexible in that it does not require the replicated version of data to be stored in identical format, as it possible to store that data in different formats so that optimizer could use different storage characteristics to partition the workload to use the format that provides fastest response or highest throughput. Data warehouses workloads have high number of ad-hoc queries over large datasets mixed with small amount of OLTP transactions to correct problems with the data. So to avoid contention a technique of snapshot isolation is used. Their approach is similar and uses timestamped(insertion and deletion at each tuple level) and versioned representation of data to isolate reads. Timestamps are written at the time of commit, and a 0 in deletion timestamp says that the tuple is not deleted. Historical queries read data as of past time and the timestamps allow isolation of read data from the most recent update transactions. Their model assumes K safety(i.e. database is still available if k sites fail) is provided by the database designer using data replication.
The recovery algorithm consists of three phases. The first one is to use the checkpoint mechanism to determine the most recent time(T) up to which the updates/inserts have been flushed to the disk. Historical queries provide the updates that were made to the database after T and will be applied to the local database of recovering site. A standard dirty pages table with the identity of all in memory pages with modified flags. Check-pointing consist of taking snapshots of the dirty pages table and then taking latches on the dirty pages and flushing them to the disk. After all these blocks have been flushed the time T is recorded to a well known location on the disk. In second phase the recovering site executes historical queries on remote sites that contain replicated data to bring the missed updates and get current. The use of historical queries avoids read locks and therefore quiesce the system during recovery of large amounts of data. The third phase uses read locks to catch up with any transactions that were committed since the start of phase two and current time. The coordinator keeps a queue of update requests, that would be forwarded to the crashed site so that it could join the ongoing transactions. The authors show how their algorithm surpasses the performance of ARIES algorithm.
[Tukka et.al. 2008] present a survey of current and proposed work in using various multi-version index structures. As the databases are moving towards providing multiple versions of database items for running historical queries as well as providing audit ability, they need to have special structures for storing the data to handle query time and space constraints. There are a large number of these structures that have been made available, however, the authors are not satisfied with the concurrency control and recovery methods available for those structures. They compare the pros and cons of using standard B+ tree structures, R-tree structures, multi-version B+ trees, and time split B+ trees for providing single and multiple versions of the data. Multi-version B+ trees don’t provide comprehensive concurrency control. Time split B+ trees provide concurrency control and recovery methods, but don’t provide space efficiencies. They are interested in designing modified versions of these structures so that more comprehensive concurrency control and recovery could be provided and compare and contrast their results for each of these implementations.
They describe that their recovery methods would be based on the de facto standard ARIES algorithm. They describe that their version of read only transactions would implement the operations:- begin_read_only, query(key,version), range query( range [k1,k2], version v ), commit read only. An updating transaction will not provide a version number in its actions, so a multi-version concurrency control algorithm will be applied to maintain serializable view of the most recent version of the database for reach transaction. They discuss the use of lazy time-stamping techniques[Torp, et.al. 00] for implementing the version control. To implement the update transactions the plan to implement the following operations:- begin_update, query(key), range query( range [k1,k2]), insert( key k, data w), delete(key k), commit-update, abort, undo-insert(log record r), undo-delete(log record r), finish-rollback.
[Lahiri et.al 2001] discuss a very important applied aspect of recovery algorithms when it comes to Oracle database server. The discuss an implementation of important high availability feature of Oracle trademarked as Fast-Start Fault recovery that allows the users to impose predictable bounds to the time required for recovering a crashed database.
Oracle supports a shared disk architecture and a typical database configuration consists of one or more Oracle instances(user and oracle processes and memory structures) accessing shared set of data files that constitute the database. Each instance has its own buffer cache and different caches are kept in synch with through a distributed lock management protocol. This architecture in Oracle is known as OPS(oracle parallel server) or RAC(real application clusters) in more recent versions.
Each instance has its own set of REDO log files and the changes made by transactions to the buffers in cache(also known collectively as SGA) are recorded to the log file for the instance. They describe the log as ever increasing sequence of bytes indexed by Redo Byte Address(RBA). The recovery begins from a Thread Checkpoint RBA and ends with and RBA called Tail.
Each transaction is assigned a Rollback Segment where it records the changes made by the transaction to the database and will be used if a rollback of the transaction is required. When a change is made to a buffer in cache, redo records are written to the log file atomically for the changes both to the buffer as well as to the Rollback Segment.
Oracle uses a version based concurrency system known as Consistent Read. Every transaction is associated with a snapshot time. When it reads a block in buffer it first reconstructs the image of data as of the snapshot time. It uses the undo data stored in the rollback segments. This allows reading the data without the use of locks. Reads/writes don’t block each other.
In Oracle only writers need to use locks. These locks are stored within the blocks of data for each of the rows and are persistent as they can be recovered from the REDO following a crash.
In OPS cluster whenever an instance discovers that another one has failed it first suspends access to the database and then rolls forward the changes in failed instance’s REDO thread from the last checkpoint to the tail of the log. After the roll forward the database is opened for normal online processing. Meanwhile the SMON process applies the UNDO from the rollback segment for any transactions that were in flight at the time of failure. The time taken for roll forward is determined by the REDO log I/Os and Data file I/Os. The former being the number of blocks between the checkpoint and its tail, and the later being the number of data blocks that were modified but were not written to the database. This size is controlled by the periodicity of check pointing.
Dirty buffers are kept in a Buffer Checkpoint Queue in ascending order of the RBA of the first change to the buffer. Writing buffers in the low RBA order advances the checkpoints. Periodic write the database writers to advance the checkpoint is called Fast Start Checkpointing. While conventional checkpointing is exhaustive the one implemented by this approach is incremental, writing only those many buffers as are needed to advance the thread to required position in the log.
Oracle allows the database administrators to specify three dynamic configuration control parameters for controlling the rate of checkpointing. They describe the parameters as:
FAST_START_IO_TARGET: This parameter specifies a bound on the number of blocks that would need recovery at any given time on a running system. These blocks require random read and write IOs and dominate the roll-forward time. If an administrator believes that the I/O subsystem can perform approximately 1000 random IOs per second, she should set “FAST_START_IO_TARGET = 50000” to limit roll-forward time to approximately 50 seconds.
LOG_CHECKPOINT_TIMEOUT: This parameter specifies a bound on the time elapsed since the tail of the log was at the position presently indicated by the thread checkpoint position. For example, setting “LOG_CHECKPOINT_TIMEOUT = 100” implies that the thread checkpoint should lag the tail of the log by no more than the amount of redo generated in the last 100 seconds. Since the time it takes to generate a certain amount of redo is approximately the same as the time it takes to apply that redo, LOG_CHECKPOINT_TIMEOUT may be interpreted as an upper bound on recovery time. It is an upper bound since workloads typically contain a query component that does not generate any redo.
LOG_CHECKPOINT_INTERVAL: This parameter bounds the number of redo blocks between the thread checkpoint and the tail of the log. This parameter therefore imposes an upper bound on the number of redo blocks that need to be processed during roll-forward.
Oracle provides another mechanism called Fast Start Rollback for quicker recovery. As Oracle keeps its lock information in the data blocks itself, database could be opened to new transactions while the rollback is in progress. It provides two mechanisms to make the rollback fast. It parallelizes the rollback of uncommitted transactions and also provides on demand rollback.
Oracle uses inter and intra transaction parallelism algorithms to make the rollback faster. When there are a large number of transactions for requiring rollback, each of them could be rolled back in parallel. Sometimes large transactions could be broken into partitions of independent work and scheduled for rollback in parallel. It uses FAST_START_PARALLEL_ROLLBACK to control the number of threads that would be used for perform parallel recovery. On demand rollback mechanism uses the ability of Oracle to rollback those blocks that are required by other transactions for reading, etc. It uses the rollback segments to provide the data required by consistent reads and therefore could use the same facts to recover those blocks at a time earlier than it would be done normally and this extra work of reading redo to provide consistent image won’t have to be done time and again.
Applications This survey was done with the goal of understanding the role of recovery in databases. Most of the commercial systems provide recovery capabilities at various levels. These recovery capabilities provide a lot of benefits from integrity of the database to high availability. However, these benefits come at significant cost. A large amount of work has been done to come up with the most efficient algorithms for making the recovery quickly. The researcher has been involved in both transactional and data warehousing systems. He appreciates the application of technology in the OLTP domain. He, however, struggled quite a bit by the limitation imposed by these mechanisms where automated recovery by the database might not be the most efficient way to handle recovery for all objects in the database. [Lau et. Al. 2006] refer to the similar goals. Whereas databases like Oracle provide the ability to make the objects no logging, i.e. without the ability to roll forward any changes from the last back up, they don’t provide the ability to not use rollback segments, to the best of researcher’s knowledge. For large transactions this is a problem, when neither the ability to roll forward or rollback is require because it would be more efficient to provide the recovery by the loading mechanism itself and the business might be open to not having the latest data available everyday due to the very nature of data warehouses. They may be able to live with the restore from the last back up from a few days ago and allow the load of data from the source system. There was definitely a need felt for making this tradeoff. Traditional database administrators are not used to supporting large data updates and may not size the ROLLBACK segments(or equivalent terms in different databases, e.g. UNDO table space in more recent versions of oracle) for handling those transactions and balk at providing the required disk space. This may lead to a large number of transaction failures requiring large rollbacks and therefore extra work in the database. However, given a database, and a better understanding of the mechanisms used for recovery, it would lead to better application architecture.
Summary This survey paper was intended to get a better understanding of the concepts and algorithms available for recovery mechanisms. Since no particular order was followed for understanding the concepts, the chronological order was not maintained due to the constraints on time. However, a large number of these concepts were explored and progression of algorithm sophistication was observed. ARIES was observed to be a benchmark algorithm in sophistication and most of the current databases seem to use a variation on this algorithm for their own implementations. However, most of these traditional algorithms are not necessarily most optimal in case of large data set based updates. Research seems to be leading to addressing some of these concerns. Another focus of the recent research has been to improve the ability of the databases to provide multi-versioning of the datasets, when the requirements from watch dog agencies for providing the auditable changes and also to address security concerns with malicious data updates. Lastly, the project also explored the application of these algorithms in Oracle database by reviewing the work done for providing its FAST-START feature set.
Elmasri, Rameez and Navathe, Shamkant. Fundamentals of Database Systems. Boston: Addison Wesley, 2007.
Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H. and Schwarz, P “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging” ACM Transactions On Database Systems, Vol. 17, No. 1, March 1992: 94-162.
Verhofstad, J. “Recovery Techniques for Database Systems” Computing Surveys June 1978: 167-195.
Agrawal, R. and Dewitt, D. J. “Integrated Concurrency Control and Recovery Mechanisms: Design and Performance Evaluation” ACM Transactions On Database Systems, Vol. 10, No. 4, December 1985: 529-564.
Elhardt, K. and Bayer, R. “A Database Cache for High Performance and Fast Restart in Database Systems” ACM Transactions On Database Systems, Vol. 9, No. 4, December 1984: 503-525.
Lau, E. and Madden, S. “An Integrated Approach to Recovery and High Availability in an Updatable, Distributed Data Warehouse” VLDB’06, September 12-15, 2006, Seoul, Korea.
Haapsalo,T., Jaluta, I., Sippu, S., and Soisalon-Soininen, E. “Concurrency Control and Recovery for Multi-version Database Structures” ACM Workshop of PIKM08, October 30, 2008, Napa Valley, California, USA.
Torp, K., Jensen, C., and Snodgrass, R. “Effective timestamping in databases” The VLDB Journal- The international Journal on Very Large Data Bases,8(3-4), 2000: 267-288.
Lahiri, T., Ganesh, A., Weiss, R., and Joshi A. “Fast-start: Quick Fault Recovery in Oracle” ACM SIGMOD, May 21-24 2001: Santa Barbara, California, USA.
About Sarbjit Parmar
A practitioner with technical and business knowledge in areas of Data Management( Online transaction processing, data modeling(relational, hierarchical, dimensional, etc.), S/M/L/XL/XXL & XML data, application design, batch processing, analytics(reporting + some statistical analysis), MBA+DBA), Project Management / Product/Software Development Life Cycle Management.