Oracle DBA / Architect: 2012

Friday, October 5, 2012

Data Guard Errors: ORA-00308 ORA-17503 ORA-15012

Error
DataGuard redo apply is broken. No more logs are applied. Alert Log of Standby shows following error.
This happened after server was crashed.
Errors with log +PSSFLASH/PRODSBY/archivelog/2012_10_04/thread_3_seq_19610.689.795807751
Errors in file /ora/app/oracle/diag/rdbms/PRODSBY/PRODSBY2/trace/PRODSBY2_pr00_18575.trc:
ORA-00308: cannot open archived log '+PSSFLASH/PRODSBY/archivelog/2012_10_04/thread_3_seq_19610.689.795807751'
ORA-17503: ksfdopn:2 Failed to open file +PSSFLASH/PRODSBY/archivelog/2012_10_04/thread_3_seq_19610.689.795807751

Missing File
As per alertlog below is missing file
+PSSFLASH/PRODSBY/archivelog/2012_10_04/thread_3_seq_19610.689.795807751

Production (PROD)
Login to asmcmd and copy file from prod(PROD) to filesystem
asmcmd -p
cd +PSSFLASH/PROD/ARCHIVELOG/2012_10_04
cp thread_3_seq_19610.2065.795819191 /tmp/thread_3_seq_19610.2065.795819191

Standby Side (PRODSBY)
Stop Recovery if any in progress [ It would have already broken with error ]
SQL > ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
scp <primaryhost>:/tmp/thread_3_seq_19610.2065.795819191 /prodstage01/keepsafe
SQL> ALTER DATABASE REGISTER OR REPLACE PHYSICAL LOGFILE '/prodstage01/keepsafe/thread_3_seq_19610.2065.795819191';
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE;

Problem is fixed
Standby Starts Media recovery and will soon catchup

Tuesday, September 4, 2012

ORA-00600: internal error code, arguments: [kqd-objerror$]

Database : 11.2.0.2
Problem   : ORA-00600: internal error code, arguments: [kqd-objerror$]
Reason    : This issue is due to inconsistency in data dictionary
SQL>alter TRIGGER ADDRESS_PREVENT_OVERLAP compile ;
alter TRIGGER ADDRESS_PREVENT_OVERLAP compile
              *
ERROR at line 1:
ORA-04045: errors during recompilation/revalidation of
PUBLIC.TRIGGER_OVERLAP_PKG
ORA-00600: internal error code, arguments: [kqd-objerror$], [U], [0], [369],
[EMAIL_PREVENT_OVERLAP_AFTER], [], [], [], [], [], [], []
SQL>alter TRIGGER CUSP.ADDRESS_PREVENT_OVERLAP compile ;
alter TRIGGER ADDRESS_PREVENT_OVERLAP compile
              *
ERROR at line 1:
ORA-04045: errors during recompilation/revalidation of
PUBLIC.TRIGGER_OVERLAP_PKG
ORA-00600: internal error code, arguments: [kqd-objerror$], [U], [0], [369],
[EMAIL_PREVENT_OVERLAP_AFTER], [], [], [], [], [], [], []
drop public synonym TRIGGER_OVERLAP_PKG
                    *
ERROR at line 1:
ORA-00600: internal error code, arguments: [kqd-objerror$], [U], [0], [369], [EMAIL_PREVENT_OVERLAP], [], [], [], [], [], [], []

Actions
1] Tried purge dba_recyclebin - Did not help
2] Tried recyclebin off and bounce Instance - Did not help
3] Tried dropping object EMAIL_PREVENT_OVERLAP_AFTER - Gives same ORA-600
3] Tried dropping public synonym TRIGGER_OVERLAP_PKG - Gives same ORA-600

Solution1] Find the object_id of the object giving problem for e.g. synonym TRIGGER_OVERLAP_PKG
select object_id from dba_objects where object_name='TRIGGER_OVERLAP_PKG' and object_type='SYNONYM';
2] insert record in sys.objerror$
insert into sys.objerror$ values(99601);
commit;
3] Very important - Shutdown abort. Do not gracefully shutdown. Abort will not let rowcache to be flushed to dictionary
4] Start Instance
5] Drop public synonym TRIGGER_OVERLAP_PKG - Success
6] recreate public synonym TRIGGER_OVERLAP_PKG - Success
7] compile trigger CUSP.ADDRESS_PREVENT_OVERLAP - Success

Disclaimer : UNLESS ORACLE SUPPORT PROVIDES SUCH INSTRUCTION THE DATA DICTIONARY SHOULD NOT BE TAMPERED. RUN YOUR FINDINGS WITH ORACLE BEFORE EXECUTING ON YOUR ORACLE/VENDOR SUPPORTED ENVIRONMENT.

Thursday, August 23, 2012

Wait Event : log file sequential read

Case study: log file sequential read

Recently one of my clients asked me to analyze his database problem post factum. At the time of request the problem itself was already solved of "self-solved", so the task was to explain it's causes and prevent similar issues in future.

Symptoms

Here's problem's description given by client:

third party software developers were rebuilding (REBUILD ONLINE) some indexes in the database;
suddenly, the whole system became much slower; main database wait event observed was "log file sequential read";
IO workload for storage partition holding online redo was so significant that all users of that storage device noticed performance problems - and it was hi-end storage system, main storage device for that enterprise;
problem disappeared when Oracle instance had been bounced.

Diagnostics

As far as "log file sequential read" isn't the most frequently seen wait event, I've looked up Oracle documentation for it's description. It says: "Waiting for the read from this log file to return. This is used to read redo records from the log file". Obvious but useless, as for me, as there's no information why may Oracle read online redo. Google and Metalink gave me a couple of link describing ARCH and Logminer performance problems, but they were not the case.

However, there was an interesting message in alert.log: "Some indexes or index [sub]partitions of table SOMESCHEMA.SOMETABLE have been marked unusable" followed by number of "Errors in file..." records without actual error messages. All indexes on SOMESCHEMA.SOMETABLE were valid at the time of investigation. But in trace files mentioned in alert.log I saw, inter alia, message "oer 8102.2 - obj# 55780", i.e. "ORA-8102: index key not found" for index with object_id = 55780.

After that it was easy to:

find a bug #7329252 "Index corruption after rebuild index ONLINE" saying "If there is a heavy concurrency on a table during online index rebuild, the index can be corrupt missing some keys." Fixed in 10.2.0.4.4 and 11.2.0.1;
understand that "log file sequential read" wait event was caused by sessions that had found corrupted block and tried to dump corresponding redo blocks. By the way, documentation hints that "log file sequential read" is used when dumping redo in event parameters description: "log# - The relative sequence number of the logfiles within a log group (used only when dumping the logfiles)";
understand that initial information "problem disappeared when Oracle instance had been bounced" was incorrect. In fact, software developers just rebuilt corrupted index after RDBMS reboot and concealed that fact from client

Key findings:

now I know at least three situations when "log file sequential read" can be observed: when ARCH as reading file to archive it; when Logmine is working and when some session is trying to create dump of redo;
again, REBUID ONLINE is not so ONLINE as it seems - at least, in Oracle 10g.

ref : http://ua-dba.blogspot.com/2010/09/case-study-log-file-sequential-read.html

Tuesday, July 24, 2012

AWR data is not purged - SYSAUX keeps growing

Today noticed SYSAUX tablespace crossed 80% usage limit. First few things came in mind was
1] SYSAUX is used by some DBA collegues.
2] New database features are installed
3] AWR retention might be too big
4] Snapshot frequency is altered/increased

On examiniation none of above was true. In fact, AWR retention was just 8 days.

Found link to metalink document
WRH$_ACTIVE_SESSION_HISTORY Does Not Get Purged [ID 387914.1]
Folowed this document but setting event as per doc never errored out and it was always successful. So this document scenerio is different from the one I am facing

AWR Retention

select snap_interval, retention, most_recent_purge_time from sys.wrm$_wr_control;
SNAP_INTERVAL        RETENTION            MOST_RECENT_PURGE_TIME
-------------------- -------------------- ---------------------------------
+00000 00:30:00.0    +00008 00:00:00.0    23-JUL-12 10.06.25.744 PM

select INSTANCE_NUMBER,SNAP_ID,BEGIN_INTERVAL_TIME,END_INTERVAL_TIME
from DBA_HIST_SNAPSHOT
where SNAP_ID=(select min(SNAP_ID) from DBA_HIST_SNAPSHOT);
INSTANCE_NUMBER    SNAP_ID BEGIN_INTERVAL_TIME       END_INTERVAL_TIME
--------------- ---------- ------------------------- -------------------------
              1      25385 03-MAY-12 01.45.05.264 PM 03-MAY-12 02.00.00.499 PM
              2      25385 03-MAY-12 01.45.05.144 PM 03-MAY-12 02.00.00.653 PM
              3      25385 03-MAY-12 01.45.05.268 PM 03-MAY-12 02.00.00.863 PM
              4      25385 03-MAY-12 01.45.05.262 PM 03-MAY-12 02.00.00.630 PM

Strange observation

As per DBA_HIST_SNAPSHOT, DBA_HIST_ASH_SNAPSHOT oldest snapshot was of 03-MAY-12 but as per below query oldest retained data was of 07-APR-12

select INSTANCE_NUMBER,min(snap_id),min(SAMPLE_TIME) from sys.WRH$_ACTIVE_SESSION_HISTORY group by INSTANCE_NUMBER;

INSTANCE_NUMBER MIN(SNAP_ID) MIN(SAMPLE_TIME)
--------------- ------------ -------------------------------------------------
              3        22913 07-APR-12 07.59.59.976 PM
              2        22913 07-APR-12 08.00.05.535 PM
              1        22913 07-APR-12 08.00.04.182 PM
              4        22913 07-APR-12 08.00.00.830 PM

Action Taken

exec DBMS_WORKLOAD_REPOSITORY.DROP_SNAPSHOT_RANGE(22913,22913);

It took very long to delete this snapshot but along with it, it also started purging old data.

SQL>select INSTANCE_NUMBER,min(SAMPLE_TIME) from sys.WRH$_ACTIVE_SESSION_HISTORY
2 group by INSTANCE_NUMBER
3 order by 1;

INSTANCE_NUMBER MIN(SAMPLE_TIME)
--------------- --------------------------------
              1 10-APR-12 03.30.07.620 PM
              2 10-APR-12 03.30.13.158 PM
              3 10-APR-12 03.15.01.995 PM
              4 10-APR-12 03.30.07.117 PM
SQL>/
INSTANCE_NUMBER MIN(SAMPLE_TIME)
--------------- --------------------------------
              1 11-APR-12 04.59.56.218 PM
              2 13-APR-12 05.30.30.874 AM
              3 13-APR-12 06.59.07.517 AM
              4 11-APR-12 04.54.07.145 PM
SQL>/
INSTANCE_NUMBER MIN(SAMPLE_TIME)
--------------- --------------------------------
              1 13-APR-12 06.31.06.346 PM
              2 13-APR-12 06.44.59.087 PM
              3 13-APR-12 07.01.01.946 PM
              4 13-APR-12 09.16.29.905 PM
SQL>/
INSTANCE_NUMBER MIN(SAMPLE_TIME)
--------------- --------------------------------
              1 14-APR-12 07.14.42.125 AM
              2 14-APR-12 06.14.46.841 AM
              3 14-APR-12 06.00.13.102 AM
              4 14-APR-12 06.15.20.638 AM
SQL>/
INSTANCE_NUMBER MIN(SAMPLE_TIME)
--------------- --------------------------------
              1 03-MAY-12 01.44.56.561 PM
              2 03-MAY-12 01.45.01.572 PM
              3 03-MAY-12 01.44.59.825 PM
              4 03-MAY-12 01.45.03.027 PM