Friday, May 29, 2009

Fast-Start Failover on Oracle Error Conditions

Using Oracle 11g Data Guard configuration you can initiate Fast-Start Failover
based on different Oracle errors - Configure Conditions for Fast-start Failover

I decided to test it and created DG configuration using not only documentation
but running "ENABLE FAST_START FAILOVER" command, getting errors :) ,
looking into drc...log files (it is 11g, they are under trace directory),
correcting problems like: standby redo log files, flashback mode, increasing
Fast-Start Failover Lag Limit and finally got SUCCESS status for
"SHOW CONFIGURATION VERBOSE":
DGMGRL> show configuration verbose

Configuration
Name: bee
Enabled: YES
Protection Mode: MaxPerformance
Databases:
bee - Primary database
beesby - Physical standby database
- Fast-Start Failover target

Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: beesby
Observer: oel1.pythian
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE

Current status for "bee":
SUCCESS

To see conditions which are configured to initiate FSF you need to execute
another command (I usually run help command and do copy/paste
of commands that I need):
DGMGRL> SHOW FAST_START FAILOVER;

Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: beesby
Observer: oel1.pythian
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE

Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES

Oracle Error Conditions:
(none)

There were predefined logically named health conditions for Fast-Start Failover
and no Oracle Error Conditions.

Using help command and provided syntax I set FSF for ORA-00001 error (!)
that means failover should be initiated after violation of unique constraint
(That's terrific condition!)
DGMGRL> ENABLE FAST_START FAILOVER CONDITION 1;
Succeeded.

alert.log also mentioned it but in decoded manner:
OS Pid: 2964 executed alter system set events '1 trace name failover level 1'

where PID=2964 belonged to DMON process. The error now is on conditions list:
DGMGRL> SHOW FAST_START FAILOVER;

Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: beesby
Observer: oel1.pythian
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE

Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES

Oracle Error Conditions:
ORA-00001: unique constraint (%s.%s) violated

The only thing left wass to raise the error - I created the table with PK and tried to insert
duplicated value, got the error:
SQL> insert into t2 values(1);
insert into t2 values(1)
*
ERROR at line 1:
ORA-00001: unique constraint (AG.SYS_C009503) violated


And failover was started:
alert.log on primary
...
OS Pid: 2964 executed alter system set events '1 trace name failover level 1'
Wed May 27 19:30:58 2009
Foreground: user-configured ORA-1 requesting FSFO
Wed May 27 19:31:03 2009
A user-configurable Fast-Start Failover condition was detected. The primary is shutting down due to ORA-00001: unique constraint (%s.%s) violated.
Database bee will not be automatically reinstated.
USER (ospid: 3587): terminating the instance
Instance terminated by USER, pid = 3587
...

alert.log on standby
...
Attempting Fast-Start Failover due to detection of condition ORA-00001: unique constraint (%s.%s) violated.
Database bee will not be automatically reinstated.
...
Failover succeeded. Primary database is now beesby.
...

observer output
...
19:31:05.81 Wednesday, May 27, 2009
Initiating Fast-Start Failover to database "beesby"...
Performing failover NOW, please wait...
Failover succeeded, new primary is "beesby"
19:31:19.15 Wednesday, May 27, 2009
...

Wonderful, failover was done and now I had to reinstate the standby

And that was completed but through several steps (I did not use static configuration of global database name
in listener - that might was the cause of failure to accomplish it at once - I need to check it):

  1. startup new standby database in mount mode (bee)
  2. REINSTATE DATABASE bee (from dgmgrl connected as sys/sys from beesby environment)
  3. startup standby database in mount mode (bee)
  4. REINSTATE DATABASE bee (from dgmgrl connected as sys/sys from beesby environment)
=== 1 ===
SQL>
startup mount
...
Database mounted.

=== 2 ===
[oracle@oel1 beesby ~]$ dgmgrl sys/sys
DGMGRL for Linux: Version 11.1.0.7.0 - Production

Copyright (c) 2000, 2005, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected.
DGMGRL>
REINSTATE DATABASE bee
Reinstating database "bee", please wait...
Operation requires shutdown of instance "bee" on database "bee"
Shutting down instance "bee"...
ORA-01109: database not open

Database dismounted.
ORACLE instance shut down.
Operation requires startup of instance "bee" on database "bee"
Starting instance "bee"...
Unable to connect to database
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor

Failed.
You are no longer connected to ORACLE
Please connect again.
Unable to start instance "bee"
You must start instance "bee" manually
Reinstatement of database "bee" failed
DGMGRL> exit

=== 3 ===
SQL> startup mount
...
Database mounted.

===4 ===
[oracle@oel1 beesby ~]$ dgmgrl sys/sys
DGMGRL for Linux: Version 11.1.0.7.0 - Production

Copyright (c) 2000, 2005, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected.
DGMGRL>
REINSTATE DATABASE bee
Reinstating database "bee", please wait...
Reinstatement of database "bee" succeeded


DGMGRL>
show configuration verbose

Configuration
Name: bee
Enabled: YES
Protection Mode: MaxPerformance
Databases:
beesby - Primary database
bee - Physical standby database
- Fast-Start Failover target

Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: bee
Observer: oel1.pythian
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE

Current status for "bee":
SUCCESS


Have a good day!

1 comment:

Unknown said...

Hi Andrey, I believe that you can also try the recover corrupted sql system database in 2005 utility, it fixes database errors automatically