Originally posted by me on the Pythian blog. This is an older post that I somehow forgot to post on my own blog, but another recent redundancy foul-up reminded me of it.

The story you are about to read is based on actual events. Names and paths have been changed to protect the innocent. I call this scenario “The Perfect Storm” because it took just the right combination of events and configurations. Sadly, this doesn’t make it an unlikely occurrence, so I’m posting it here in hopes that you’ll be able to save yourselves before it’s too late.

I have always had a preternatural dislike for using REDUNDANCY as a retention policy for Oracle RMAN, greatly preferring RECOVERY WINDOW instead, simply because REDUNDANCY doesn’t really guarantee anything valuable to me, whereas RECOVERY WINDOW guarantees that I’ll be able to do a point-in-time recovery to anytime within the past x days. Plus, I had already been burned once by a different client using REDUNDANCY. With the story I’m about to tell, this dislike has turned into violent hatred. I’m going to be light on the technical details, but I hope you’ll still feel the full pain.

First some table setting:
  • Standalone instance (no RAC, no DataGuard/Standby)
  • RMAN retention policy set to REDUNDANCY 2
  • Backups stored in the Flash Recovery Area (FRA)
A few months ago, we had a datafile corruption on this relatively new instance (data had been migrated from an old server about a week prior). The on-call DBA followed up the page by checking for corruptions in the datafile with this command:

RMAN> backup check logical datafile '/path/to/foobar_data.dbf';

This, my friends, led to the major fall, though we did not know it for many hours. You see, the FRA was already almost full. This causes the FRA to automatically delete obsolete files to free up space. That last backup command, while only intended to check for logical corruption, did actually perform a backup of the file, and rendered the earliest backup of the file obsolete since there were two newer copies. That earliest file happened to be from the level 0 backup from which we would later want to restore.

Of course, at first we didn’t know why the file was missing. Logs showed that it was on disk no less than two hours before the problem started. Later, scanning the alert log for the missing backup filename yielded this:

Deleted Oracle managed file /path/to/flash_recovery_area/FOO_DB/backupset/2008_12_01/o1_xxxx.bkp

Oracle deleted the one backup file that we needed!

Even worse, it wasn’t until this time on a Monday night that we realized that the level 0 taken the previous weekend had failed to push the backup files to tape because of a failure on the NetBackup server. The problem was reported as part of Monday morning’s routine log checks, but the missing files had not yet been pushed to tape.

In the end, we were able to drop and restore the tablespace to a previous point in time on a test instance from another backup file and exp/imp data back over. It was ugly, but it got things back online. Many DBAs better than myself gave their all on this mission.

To summarize, the ingredients:
  • Oracle RMAN
  • Flash Recovery Area near full, obediently deleting obsolete files.
  • Tape backup failure
Add in an innocent backup command and . . . BOOM! Failure Surprise.

The two biggest points to take away are:

  • Tape backup failures are still serious backup failures and should be treated as such, even if you backup to disk first.
  • REDUNDANCY is not a viable retention policy. In my house, it is configuration non grata.