Ever heard that “clocks are unreliable” or that “you cannot trust timestamps” and wondered what is that all about? Or, have you ever been asked why Google data centres have GPS clocks?

After reading this you will be able to guess the answers. We will briefly describe the issue with clocks and show the “consequences” of it using Redis cache expiration as an example.

Timestamps and local clocks

To keep machines’ clocks in sync a special protocol called Network Time Protocol (NTP) is used. There are command line tools to deal with NTP status and force sync if required - timedatectl / timesyncd in Ubuntu and w32time service in Windows.
As NTP needs to talk to a time source over a network its precision is in milliseconds in local network/tens of milliseconds over the Internet and it heavily depends on the network speed. So even with NTP local clocks will drift apart.

Simple example

Pay In/Withdraw example — S1 and S2 clocks in sync

Now imagine if S1 time got out of sync and S1 clock is running fast:

Pay In/Withdraw example — S1 clock is ahead of S2

Because of S1 clocks being ahead of S2 the transactions are processed in different order and this will result in different transaction results (Failure where it used to be Success) and different end state (Current Balance 100).

Timestamping on a client machine rather than the servers would solve the ordering of events from the same client but not the total order or events.
Overall the issue would still exist because client machines timestamps can be even more far apart — after all, you are in full control of your local machine time and can set it to anything.

More realistic example

A practical example is synchronization between replicas with “last write wins” (LWW) conflict resolution.

LWW and lost updates

As you can see from this illustration, LWW and the time drift may lead to lost updates.

The node with “slow”/lagging clock will not be able to overwrite updates done by the “fast” node until NTP kicks in and the time is synchronized. The wider is the time difference between the nodes the more is the probability of a lost update happening.

How do systems deal with unreliable clocks?

A more interesting example is how Redis manages to expire keys on read-only replicas.

- After a specified timeout (TTL) a key should expire on both master and replica nodes.
- Due to unreliable clocks it cannot rely on the time on the replicas being in sync with the master’s time.

Approach 1 — Expire the keys independently on the master and the replicas.
The issue with this approach is that a replica can expire a given key at time T (by replica’s clocks) and then a command from master will come at T+x which will assume the key is still present.
For example, *STORE command may come after the replica decides an involved key has already expired (SUNIONSTORE, SDIFFSTORE, SINTERSTORE).

Approach 2 — Expire the keys on the master only.
This way replicas would never expire any keys but wait for the master node to issue the DEL command for the expired keys.
This makes data consistent between master and replicas but only until there is a network delay or partition and the replicas are disconnected from the master — as replicas do not expire keys themselves they will return stale data.

What Redis actually does is the combination of the above.
Redis replicas wait for the master to send DEL command to expire a key, but at the same time for read queries the replica reports a key as expired when its logically expired according to the replica’s local clocks.

This led to an amusing bug in old versions of Redis (inconsistent behaviour of EXISTS and GET on expired keys on replicas).

Time for some local experimentation. To check how it works we can run a master on port 6379 (Redis default) and a replica on port 6380 and set key “foo” to expire on master.

Set key with expiration on master

Then, as it is difficult to emulate a partition when both the master and the replica are running on the same machine, let’s make the master “freeze” via running some slow Lua script:

Make master “unavailable”

While master is busy churning through loops in Lua, we can ping the replica for TTL of “foo”:

Check key TTL on replica

And run the MONITOR command in another session so that we can see the order of commands against the replica:

MONITOR on replica

As you can see, “foo” expires independently on the replica and only when master is free from the Lua script freeze it sends the “DEL” command to expire “foo” key which has already expired on the replica.