Resostats outage postmortem

Today, from approximately 16:30 UTC to 17:45 UTC, the Resostats Dashboard which provides various public metrics on Resonite was offline.

Background#

Routine maintenance was being done on the machine hosting Resostats, namely updating the packages, containers, cleaning up some debugging tools.
Configuration changes were committed to try and have the TSDB sync faster to the S3 storage bucket that backs the whole instance.

Metrics stored on the Mimir instances do not have any set expiration.

The S3 bucket itself is fully replicated and backed up using Restic in multiple places, including rsync.net as an external one.

The cause#

While committing changes to the mimir configuration, the compactor_blocks_retention_period configuration key was swapped from 0 to 12h.

The compactor_blocks_retention_period configuration key in mimir specifies the retention period for blocks. Anything older than the set amount will get marked for deletion, then cleaned up.
You can read more about this in the official mimir configuration documentation.

This prompted the mimir instances to start marking blocks older than 12h for deletion, thus cleaning inadvertently years of historical data.

Restoration#

The error in the configuration was quickly spotted and corrected, but the blocks already marked for deletion were already being cleaned up regardless.
Given the backup hosted on rsync.net was the closest and fastest for this available server, the decision was taken to restore everything from there.

The restoration process was easy enough, given Restic provides a nice command for this:

$ restic  --password-file=/sec/pass -r sftp:user@user.rsync.net:bck/st-fi/mimir-blocks restore latest --target /srv/pool/mimir-blocks

Most of the time spent was the stressful wait for the backup to be downloaded onto the machine.

In the end, about 12h of metrics were lost, which is not that much considering the scale of the outage.

Learnings#

From now on, a backup will be done before starting any maintenance.
The current backup strategy has also been proven robust enough to withstand an event like this one.

Turns out having a proper backup strategy is damn effective.