Search This Blog

2012-09-03

Oracle: RMAN Backup Script Recommendation

Product: Oracle RDBMS
Version: 8.0 onward (current version is 11g Release 2)
License Precaution:
1. Oracle Advance Compression Option license required if used CONFIGURE COMPRESSION ALGORITHM 'BASIC/MEDIUM/HIGH'
2. Oracle Advance Security Option license required if used

I have developed as well as reviewed a lot of backup policies, or take over other DBA's job specifically to backup. This post will share my expertise and crucial tips which many DBA are either not aware of, or not fully understand the implication of them

1. Always include following statement prior of shutdown database (for backup)

sql 'alter system checkpoint';

Reason: This will force database to write all the database memory buffer (buffer cache) into dbf files, and online redo log. It provides following implications

1.1. If database shutdown abort, data loss is down to minimum (or zero)
1.2. During subsequent startup, if no shutdown abort, it will be no auto recovery, and faster
1.3. Faster to shutdown immediate later
1.4. If there is any hard disk related problem, it will be able to detect at this point
1.5. If there is any disk full problem, it will be able to detect at this point

2. No not issue "alter system switch logfile" command in RMAN

Reason: RMAN will automatically switch online log file for every backup start and end (issue 2 times automatically), so it is redundant to do so

3. Find out from enterprise backup admin how many tape drives in the tape library which integrate with RMAN (regardless of Veritas, Tivoli, HP Data Protector, they all use tape library for big servers). Allocate number of channels according to number of tape drives allow

If 3 tape drives are available in a 40 tape slot tape library, then issue following

run {allocate channel t1 type sbt;
allocate channel t2 type sbt;
allocate channel t3 type sbt;
...other backup commands...}

Reason:

3.1. RMAN only allocate 1 tape drive by default, which cause mean long backup time, and accumulation of archive log files in the server, long recovery time, and other impact not cover here
3.2. Allocating multiple tape channels will make tape backup runs in parallel, and cut the backup time significantly (in above example, by about 1/3 of the total time)
3.3. During recovery/restore, the it will restore the tapes in parallel as well, so again, it will cut the restore time significantly
3.4. If "backup as compressed backupset" is used, then it will compress in parallel, so it will utilize more CPU (similar to multi thread)
3.5. Continuation from (3.4) during restore, the files (backup set) will uncompress in parallel, so it will utilize more CPU (similar to multi thread)

4. Use "backup as compressed backupset" to enable at least basic compression
5. For Enterprise license users, highly recommend to use CONFIGURE COMPRESSION ALGORITHM 'HIGH'

Reason: I recommend to use high compression instead of low or medium because modern CPU is more powerful, which can compress in shorter time. During recovery, everyone's concern about about restoration time, this can cut short the time to read from tape, number of tapes, shorter network x'fer time, and finally most importantly the restore and recovery time. If it significantly increase the total backup time (such as 1 TB database), then look into tape library, multiple tape drives, and multiplex backup channel in this post

6. If there is a standby database, then run RMAN backup against it rather than production database. Moreover, it is required to manually force production primary database's online redo log, so that it will ship across and apply into standby database

Reason:
6.1. Move the load in both disk I/O, network, CPU to different standby database server
6.2. Primary database performance will not be affected by RMAN backup
6.3. Hardware resource, e.g. tape, disk, CPU, memory, network, in primary database can put into better use, such as data warehouse, reporting, etc

7. Keep 1 - 2 most recent backup on disk, or even SATA drives. Spend some money, as this is well worth the ROI, especially setup Flashback Recovery Area together with RMAN backup destination

8. Setup Flashback recovery for at least 1 - 3 hr, depending on DBA respond time. 12 hr is normally more than sufficient, but this could be hard for database which generate 100 GB/hr of archive logs. Purchase a slow SATA drive and slot into the main server if needed to save the cost in SAN/NAS, or even a external 3.5" USB drive

Reason: Flashback allows near immediate recovery of human error (drop table, deletion of some records, revert whole database back to 15 minutes ago) which RMAN backup is unable to restore in split second. It offers a quick undo feature like Microsoft Word, which allows export consultant like me to rewind/forward at split second (and get salute from managers/directors)

9. If tapes ship to off-site office for data protection, then consider either a tape duplication or disk backup be available (covering 2 days).

Reason: I have experience major delay from IBM and Capgemini when asking them to recall yesterday's tape. Their normal respond time is beyond 24 hr on weekend, or need extra could thousand charge for prompt reply. There are 2 incidents in 3 years where they recall partial media, and need another 6 hours to recall additional tapes. Those backup administrators often have no clue how RMAN stores into tapes, and do not pay attention to auto generated tag and virtual file name in the tape media or backup server

10. Test database restoration annually with tape recall (if store tape media stored off-site)

Reason: Many corporations segregate the backup/restore role into many parties. For example, Capgemini and IBM needs to engage following when recalling an off-site tape
1. level 1 support - just log a ticket with brief message to recall tape media. Respond time 1 hr. Even though you tell them to recall tape, they will not recall the tape
2. backup administrator - verify that backup resides in off-site tape media, and needs to spend 0.5 - 1.5 hr to identify the tapes required
3. off-site tape media administrator - middle man to recall tape media, and remove the media. This is a trouble maker who often interrupted the restore as he ejects the tapes daily. It takes about 1 hr to respond
4. "office boy" - the person who physically fetch the tape media. I never find out who this resource is. Respond time 1 - 4 hr + traveling time
5. Tape storage warehouse administrator - person who keep and catalog all the tape medias, as well as give the required tapes you recall. Respond time 1 - 4 hr

On daily basis, the off-site tape media administrator will remove recall tapes, and pass to "office boy" again to keep off-site. There is 70% chance that this resource will remove the restoring tapes at 10 am and abort the whole database restore. Once tape is ejected, it is impossible to ask them to put the tape back (even you discovered in 1 min), and need to repeat the whole process above, because IBM and Capgemini do not have dedicated resource who you can talk to and ask them to undo the damage

A lot of DBA does not consider RTO (Recovery Time Objective) and RPO (Recovery Point Objective) when writing database backup script (including RMAN). They only consider these 2 when designing a high availability architecture, which is a big mistake that many big cooperation makes. Certainly DR database for Oracle is cheap and fast to setup using low end PC (Linux) with simple SATA drive (USD$150 for 3 TB drive), but everyone should seriously sit down and consider the RTO and RPO for the primary database. You are not going to use DR database forever, and it cost a lot to point the application back to primary server, not to mention to setup primary database up. Therefore, it is essential to consider the database restoration in to the picture when talk about backup.



If you need any advice or assistant, I can provide my expertise (paid). Kindly donate CAD$100 per question, or CAD$1000 per solution (8 hour work day)

No comments: