PostgreSQL Backup and Recovery Orchestration: Recovery
PostgreSQL continuous backups are very powerful, if you know how to use them for recovery. There’s nothing else to do to be sure about that other than actually trying it. Personally, I see recovery as a single process with two possibly different outcomes:
- you’re recovering to the same state your cluster is/was in (because of a hardware failure, provider switch, …) - it’s more of a data migration, but you need your backup anyway
- you’re doing a point-in-time-recovery (someone dropped the wrong table, data got corrupted, …)
Both scenarios follow the same steps and differ slighty at the end.
- Stop the PostgreSQL cluster.
- Copy the current
PGDATA_DIRsomewhere safe, just in case you screw up.
- Replace the
PGDATA_DIRwith the full backup. If you start the cluster right away, it will boot to the last full backup state (in my case, missing a week of WAL segments tops).
In this case, you’re trying to recover as far as possible. With previous steps done succesfully, the next follow:
- Copy all archived WAL segments created after the last full backup to
PGDATA_DIR/pg_xlog. These can be found with
find -newercommand run against the corresponding
.backupfile in your
- If your full backup strategy includes
recovery.conffile creation, you cane safely move it or remove it.
- Start the database cluster again. It is going to boot to the last working state.
If you’re about to migrate your data, you might be better off with simple
pg_restore commands rather than using full backup/WAL segments combination.
PostgreSQL’s PITR can help you restore your accidentally deleted/corrupted data. After the first three steps mentioned above, you should follow with these:
- Copy all archived segments created after the last full backup somewhere the PostgreSQL user can read them (
- Set up the
recovery.conffile properly. If you know something nasty happened at 2018-01-29 08:00:00, try to recover right to that point (or to any other, as described in the documentation).
restore_command = 'cp /your-wal-recovery-folder/%f "%p"' recovery_target_time = '2018-01-29 08:00:00'
- Start the database cluster again. It is going to boot to the last full backup and then play all the WAL segments until the recovery target. Depending on how many WAL segments are about to be used, this might take a while.
You don’t want to find yourself in the middle of the biggest database failure of the century just to find out your backups don’t work, and even if they did, you would have no idea how to use them. Or, even worse, there are no backups at all, because your backup strategy has been failing silently without a single notice for several months.
Try to recover from your backups once in a while.
I forget things and make mistakes. We all do. That’s why I built an ensemble that takes care of our database automatically. Nothing fancy, just a bunch of good old Bash scripts managed with systemd rathern than cron. Next time, I’d like to show you the code and walk you through our current setup.