Home > hamsterdb > A new implementation for Transactions

A new implementation for Transactions

October 14th, 2010

Over the last few weeks i started merging the hamsterdb2 Transaction handling into hamsterdb. (Some time ago i already announced that hamsterdb2 will be discontinued due to lack of resources [= time]).

The code is merged on a new branch “bleeding_edge” (http://github.com/cruppstahl/hamsterdb/tree/bleeding_edge).

What’s so special about the Transaction stuff I’m now writing?

Traditional implementations of Transaction

Traditional implementations of Transactions are often based on one of the following methods:

  1. Modified database pages are locked when they’re modified; the lock is held till the Transaction is committed or aborted
  2. Modified database pages are copied to a “shadow page”; upon commit, the original page and the shadow page are exchanged, usually by swapping pointers
  3. Only one Transaction is allowed (currently this is hamsterdb’s way of dealing with Transactions)
  4. Transactions are not allowed at all

While the last two options obviously have some limitations, the first two are a little bit better. Locking pages can block your Transactions when they modify keys on the same database page:

begin T1
begin T2
insert T1, key1, record1
insert T2, key2, record2 –> this will block till T1 is committed or aborted

Depending on the implementation, “block” means that the calling thread will only continue when T1 is committed or aborted. This will always lead to deadlocks if T1 and T2 are used in the same thread.

Shadow pages are costly – a new page has to be allocated, the memory has to be copied etc.

How will hamsterdb deal with all this?

I have found a different way of dealing with Transactions in hamsterdb. They are based on the assumption that Transaction support is mainly interesting for desktop and enterprise environments, not so much for embedded ones. And a conclusion from that assumption is that there’s sufficient RAM to store additional data. (Another conclusion is that multi-core CPUs are available and some costly operations will be processed in the background; but this is not part of my early milestones.)

Whenever a key/value pair is inserted in a transactional environment, hamsterdb will create an in-memory tree structure for the modified database. The key/value pair is then inserted in this tree. As soon as this Transaction is committed, the tree is flushed to disk. Since RAM is volatile and the tree is lost if the application crashes, a logical journal is used to log the operations. (Lookup and erase operations are similar.)

What maybe sounds a bit complex will be much simpler when reading the pseudo code.

def insert(db, txn, key, value):
    // get or create a new Tree structure for this Database; each
    // Database has just one such tree
    tree=db_get_or_create_tree(db)

    // get or create a new node in this tree which manages all operations of this key
    // each Tree has just one node *per key+
    node=tree_get_or_create_node(tree, key)

    // check if another active transaction (neither committed nor aborted)
    // also modifies this key. If yes then we have two conflicting transactions and
    // we return an error. The function is listed below.
    check_for_conflicts(db, txn, node, key)

    // otherwise we append another operation to this node
    node_append_operation(node, INSERT, record, current_lsn, flags)

    // and write the operation to the journal
    log_append_operation(INSERT, key, record, flags)

    // that's it!
    return

As promised here’s the function to check for conflicting Transactions. It will also check for other errors, i.e. if we’re inserting a key which was already inserted in another committed Transaction (this would create a duplicate key which is not allowed unless explicitly requested by the user).

def check_for_conflicts(db, txn, node, key):
    // iterate over all database operations (insert/erase) which modify this key.
    foreach Operation op in node_get_operations(node):
        // get a reference to the Transaction of this operation. We need to see if it's
        // already committed, aborted or still active.
        optxn=operation_get_txn(op)

        // if the txn was aborted then it's not interesting for us
        if txn_is_aborted(optxn):
            continue

        // if the txn was committed (or if it's the current txn) then we have to have a
        // closer look; if the operation was INSERT then we have a duplicate. If it's ERASE
        // then we can return successfully, because the key no longer exists and we can
        // insert it.
        if txn_is_committed(optxn) or txn==optxn:
            if op_type==INSERT:
                return DUPLICATE_KEY
            if op_type==ERASE:
                return SUCCESS

        // if the txn is neither committed or aborted then it's still running.
        // if two running txn's modify the same key then we return an error
        if txn_is_active(optxn):
            return CONFLICT

    // we're nearly through - we now checked all Transaction trees that were not
    // yet committed. Now we have to check the committed transactions that were
    // already flushed to disk. This basically is a simple btree lookup.
    // if the key does NOT yet exist, we're fine. otherwise we insert a duplicate.
    switch status=btree_find(db, key, flags):
    case SUCCESS:
        return DUPLICATE_KEY
    case KEY_NOT_FOUND:
        return SUCCESS
    default:
        return status

That’s it. hamsterdb allows inserting duplicate keys and it can also overwrite existing keys. To simplify the code I have not included these cases here.

Challenges

While the algorithm above works similar for lookup/erase operations, there are still a couple of challenges and problems that i have not yet solved.

Replacing the physical logfile with a logical journal will bring vast performance benefits, and the file size of the journal will be much smaller than the size of the logfile. However, it also requires that all operations in the journal are idempotent, which means that they can be re-applied even if they were already applied partially or completely. For insert I already know how to solve this problem. For erase the situation is more complex because the SMOs (structure modification operations – the btree merging/shifting/collapsing) are maybe impossible to become idempotent. This will require a simplification of the erase algorithm, i.e. in combination with “lazy splits”, or maybe I will have to fallback to physical logging in such cases.

The Cursor operations will be tricky to implement. Internally, two Cursors have to be created; one for the Transaction tree, another for the Btree. At runtime, both Cursors then have to be consolidated.

Similar problems appear when reading partial keys or when doing approximate matching. In both scenarios the Transaction tree and the Btree have to be consolidated.

An outlook

On the other hand, benefits of the new approach are obvious. By keeping all Transactions in RAM, performance will be high and concurrency (as in having several Transactions in parallel) will be high, too. No pages have to be locked. If the amount of consumed RAM is too high, large keys or records can be swapped to disk immediately.

In addition, as mentioned earlier, committed Transactions can be flushed to disk asynchronously. On today’s multi-core CPUs this offer another performance benefit for many use cases.

I hope I can come up with a first release this year, although most likely with limited functionality. If you’re interested in the progress then drop me a mail, follow the blog or the code – the github link is above. If you’re just curious about the new Transactional abilities, you can browse a few of the unittests. The file is here:

http://github.com/cruppstahl/hamsterdb/blob/bleeding_edge/unittests/txn.cpp

Look at the tests called txnInsertConflict*, txnInsertFind*, txnInsertFindErase*.

chris hamsterdb