Tuesday, October 27, 2009

Divergence Index in Distributed Databases with Replication

This post is basically a synopsis of database architecture/design I recently worked on for a RFP response.

Suppose that in a geographically distributed database deployment scenario, two databases (with copies of the same data) at geographically separated locations accept DMLs simultaneously. To ensure data convergence, these databases also engage in asynchronous bi-directional replication with each other along with standard measures like conflict detection and resolution. However, since the replication overhead is finite, at any instant in time, it is acceptable that a particular record in the database at one site is not available in the database at the other site. Let's assume that the total time taken for a record (after it is updated at the database at one site) to appear in the database at the other site is T. Thus T will comprise of the capture overhead at the source site, the network latency across sites and the apply overhead at the destination site. A very important consideration here is that the capture and apply processes at the databases are assumed to be not specific to tables - meaning that replication across any given pair of tables will undergo a total lag of T. This kind of an arrangement requires strict vigil on the replication throughput to ensure that severe data divergence does not occur. We need to develop metrics to quantify which site 'lags' and by how much. Let's give this metric a jazzy name - the 'Divergence Index' a.k.a DI. I'll put forward a simple approach toward calculating and interpreting the value of DI.

Oracle's documentation on Streams replication discusses the concept of a 'heartbeat table' that is something like this: assume that using some mechanism like NTP (network time protocol), the system date of the databases at both sites are synchronized. At both databases, we create a table 'heartbeat' with just two columns - a siteID (varchar) and a timestamp (defaults to the current time). Using a scheduled recurring job, we update (or insert) a record in this table. Thus this table will contain a record with values corresponding to that site's ID and the timestamp of the last update. Now, this table is configured as a publisher as well as a subscriber to the corresponding table of the other database. Therefore, at any instant in time, this table will tell us (i)when it was locally updated last and (ii)the timestamp of the most current update at the other database that has appeared locally. Neat, isn't it?. Note that this concept is nowhere specific to Oracle - you could do this with any decent RDBMS. So, what does this heartbeat table have to do with calculating DI?
Let's see.
DI measured at site A can be calculated as
DI(A) =( t(A) - t(B) ) / T
where,t(A) represents the highest value of the timestamp column with siteID = A andt(B) represents the highest value of the timestamp column with siteID = B

There are some very interesting and useful deductions possible from this. But before that, remember that the heartbeat tables are merely instrumented tables. The actual replication load is generated by other tables that are configured for application data. So, the updates on the heartbeat tables alone can never be of significance as far as calculating DI is concerned. Agreed? great!

Quite obviously, the ideal value of DI is 1.If DI > 1, either records from one database are taking longer than T to get applied at the other database or the scheduled update of the heartbeat table is not happening at the other site.If DI <> DI(B), updates from A reach B mach faster than they reach A from B - this means that the database at site B is relatively much more loaded with DMLs than A. This indicates that (possibly) the load balancing across databases is improper.

For DI to be an useful metric, it should be measured at an interval not less than the heartbeat update interval. A lot can be achieved by maintaining a history of the value of DI and charting them over a period of hours, days and weeks. For example, if you notice that over a period of time, the DI at both sites keep increasing, then you could additionally see, for a given site, say A, how the value of t(B) has changed over the same period. In the worst case, if t(B) has not changed (thereby DI has increased), it clearly means that A has stopped applying B's updates. If you notice similar behaviour at the other site, you could well have a total replication breakdown.

No comments:

Computers Add to Technorati Favorites Programming Blogs - BlogCatalog Blog Directory