When using streaming replication in PostgreSQL 9, it's important to know what the latency is between the master and slaves, especially when deploying on cloud based instances. Ideally, we'd like to know by how many bytes the WAL logs are lagging.  PostgreSQL offers a neat way to check just that between a given slave and its master via the pg_current_xlog_location() and pg_last_xlog_replay_location() functions.  However, the output seems cryptic.

postgres=# select * from pg_current_xlog_location();
(1 row)

OK, so what does that mean?  Looking at the WAL files in the pg_xlog/ directory, we see this file that appears to be related.


Since WAL files are fixed chunks, we'd expect that the internal xlog pointer will have a byte level position in some future WAL file that has yet to be written. We'd like to decipher the output of the functions and determine what the current byte is.

One of the great things about using open source is that you can quickly drill down into the source code and find out why things are the way they are.  PostgreSQL has converted to using GIT, so let's check it out:

git clone git://git.postgresql.org/git/postgresql.git

After searching a bit, we can find the relevant piece of code in bufpage.h, which defines a struct responsible for tracking the current byte position of the PostgreSQL log,

This tells us that the output format of these functions is "${xlogid}/${xrecoff}" and that (xlogid << 32 | xrecoff) is the 64 bit number that represents the current byte position of the xlog.

From here, it's straightforward to create something that can poll both functions on master and slave, takes the difference, and submits it to some monitoring system.

Here's a simple ruby script that does this and posts the delta to graphite.