震惊:Oracle 11gR2 RAC ADG 并没有高可用
作者 | JiekeXu
来源 | JiekeXu之路(ID: JiekeXu_IT)
转载请联系授权 | (微信ID:xxq1426321293)
大家好,我是 JiekeXu,很高兴又和大家见面了,今天分享一篇 Oracle 11gR2 RAC ADG 并没有高可用的案例。本文首发于微信公众号【JiekeXu之路】,欢迎点击上方蓝字关注我,标星或置顶,更多干货第一时间到达!
摘 要:上个月中的一次主机超过运行天数重启主机时发现 Oracle 11gR2 RAC ADG 并没有高可用,Linux 下 11204 的物理备库由于节点一主机关闭导致节点二异常关闭。
因 ADG 灾备环境在上个月时主机超过运行天数需要重启主机 RAC1 即节点一,便直接关闭了 GI 集群,理所当然本节点的数据库将正常关闭,但是却发现另一个节点数据库状态是 Mounted 的,这就有点奇怪了,下面来一起看看。
--检查集群状态全为 ONLINE,数据库为 Open,Readonly
JiekeXu-rac1:/home/grid$crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.DATA.dg
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.LISTENER.lsnr
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.OCR.dg
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.asm
ONLINE ONLINE JiekeXu-rac1 Started
ONLINE ONLINE JiekeXu-rac2 Started
ora.gsd
OFFLINE OFFLINE JiekeXu-rac1
OFFLINE OFFLINE JiekeXu-rac2
ora.net1.network
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.ons
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.registry.acfs
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE JiekeXu-rac1
ora.cvu
1 ONLINE ONLINE JiekeXu-rac1
ora.JiekeXuadg.db
1 ONLINE ONLINE JiekeXu-rac1 Open,Readonly
2 ONLINE ONLINE JiekeXu-rac2 Open,Readonly
ora.JiekeXu-rac1.vip
1 ONLINE ONLINE JiekeXu-rac1
ora.JiekeXu-rac2.vip
1 ONLINE ONLINE JiekeXu-rac2
ora.oc4j
1 ONLINE ONLINE JiekeXu-rac1
ora.scan1.vip
1 ONLINE ONLINE JiekeXu-rac1
JiekeXu-rac1:/home/grid$
JiekeXu-rac1:/home/grid$exit
JiekeXu-rac1:/home/oracle$sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 21:39:51 2020
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL> set linesize 600;
col NAME for a20;
col PCT_FREE for a30;
select GROUP_NUMBER,NAME,TYPE,STATE,total_mb/1024 total_gb,free_mb/1024 free_gb,round((free_mb/total_mb)*100,2)||'%' pct_free from v$asm_diskgroup;
SQL> SQL> SQL>
GROUP_NUMBER NAME TYPE STATE TOTAL_GB FREE_GB PCT_FREE
------------ -------------------- ------ ----------- ---------- ---------- ------------------------------
1 OCR NORMAL MOUNTED 3 2.09570313 69.86%
2 ARCH EXTERN CONNECTED 100 21.8701172 21.87%
3 DATA EXTERN CONNECTED 200 12.4785156 6.24%
--检查数据库磁盘使用率,主备同步情况
SQL> set linesize 150;
set pagesize 20;
column name format a13;
column value format a20;
column unit format a30;
column TIME_COMPUTED format a30;
select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
SQL> SQL> SQL> SQL> SQL> SQL>
NAME VALUE UNIT TIME_COMPUTED
------------- -------------------- ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 21:39:58
apply lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 21:39:58
SQL> set line 345
col instance_name for a15
col host_name for a30
select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance;
SQL> SQL> SQL>
INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME
---------- --------------- ------------------------------ ------------ ----------------- -------------------
1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-08-16 03:36:59
2 JiekeXuadg2 JiekeXu-rac2 OPEN 11.2.0.4.0 2020-08-16 02:01:57
--检查数据库主机名,状态,启动时间,打开模式等等
SQL> select name,DATABASE_ROLE,OPEN_MODE,DB_UNIQUE_NAME,GUARD_STATUS,CREATED from gv$database;
NAME DATABASE_ROLE OPEN_MODE DB_UNIQUE_NAME GUARD_S CREATED
------------- ---------------- -------------------- ------------------------------ ------- -------------------
JiekeXu PHYSICAL STANDBY READ ONLY WITH APPLY JiekeXuadg NONE 2018-10-09 09:48:19
JiekeXu PHYSICAL STANDBY READ ONLY WITH APPLY JiekeXuadg NONE 2018-10-09 09:48:19
--检查 MRP0 进程应用日志情况
SQL> SELECT PROCESS, STATUS,SEQUENCE#,thread# FROM V$MANAGED_STANDBY;
PROCESS STATUS SEQUENCE# THREAD#
--------- ------------ ---------- ----------
ARCH CLOSING 4450 2
ARCH CLOSING 4471 2
ARCH CONNECTED 0 0
ARCH CLOSING 4477 2
ARCH CLOSING 4508 2
ARCH CLOSING 4477 1
ARCH CLOSING 4475 1
ARCH CLOSING 4446 2
ARCH CLOSING 4476 1
ARCH CLOSING 4476 2
MRP0 APPLYING_LOG 4513 2
RFS IDLE 0 0
RFS IDLE 0 0
13 rows selected.
SQL> exit
查看应用连接,直接停止集群,然后 reboot
JiekeXu-rac1:/home/oracle$exit
JiekeXu-rac1:~ # cd /app/product/11.2.0/grid/bin
JiekeXu-rac1:/app/product/11.2.0/grid/bin # ps -ef | grep LOCAL=NO | grep -v grep | awk '{print $2}' | wc -l
2
JiekeXu-rac1:/app/product/11.2.0/grid/bin # ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.crsd' on 'JiekeXu-rac1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.OCR.dg' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.registry.acfs' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.JiekeXuadg.db' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.oc4j' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.cvu' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.cvu' on 'JiekeXu-rac1' succeeded
CRS-2672: Attempting to start 'ora.cvu' on 'JiekeXu-rac2'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac1'
CRS-2676: Start of 'ora.cvu' on 'JiekeXu-rac2' succeeded
CRS-2677: Stop of 'ora.ARCH.dg' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.registry.acfs' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.scan1.vip' on 'JiekeXu-rac1' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'JiekeXu-rac2'
CRS-2677: Stop of 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac1' succeeded
CRS-2672: Attempting to start 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac2'
CRS-2676: Start of 'ora.scan1.vip' on 'JiekeXu-rac2' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac2'
CRS-2676: Start of 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac2' succeeded
CRS-2677: Stop of 'ora.oc4j' on 'JiekeXu-rac1' succeeded
CRS-2672: Attempting to start 'ora.oc4j' on 'JiekeXu-rac2'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac2' succeeded
CRS-2677: Stop of 'ora.JiekeXuadg.db' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.DATA.dg' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.DATA.dg' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.OCR.dg' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.asm' on 'JiekeXu-rac1' succeeded
CRS-2676: Start of 'ora.oc4j' on 'JiekeXu-rac2' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.ons' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.net1.network' on 'JiekeXu-rac1' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'JiekeXu-rac1' has completed
CRS-2677: Stop of 'ora.crsd' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.evmd' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.asm' on 'JiekeXu-rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.ctssd' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'JiekeXu-rac1' succeeded
CRS-2677: Stop of 'ora.asm' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.cssd' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.crf' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.gipcd' on 'JiekeXu-rac1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'JiekeXu-rac1'
CRS-2677: Stop of 'ora.gpnpd' on 'JiekeXu-rac1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'JiekeXu-rac1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
JiekeXu-rac1:/app/product/11.2.0/grid/bin #
Broadcast message from root (pts/0) (Fri Oct 16 21:45:46 2020):
The system is going down for reboot NOW!
Connection to 192.16.11X.1XX closed by remote host.
Connection to 192.16.11X.1XX closed.
bye
重启主机后检查集群状态
JiekeXu-rac1:/home/grid$crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.DATA.dg
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.LISTENER.lsnr
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.OCR.dg
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.asm
ONLINE ONLINE JiekeXu-rac1 Started
ONLINE ONLINE JiekeXu-rac2 Started
ora.gsd
OFFLINE OFFLINE JiekeXu-rac1
OFFLINE OFFLINE JiekeXu-rac2
ora.net1.network
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.ons
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
ora.registry.acfs
ONLINE ONLINE JiekeXu-rac1
ONLINE ONLINE JiekeXu-rac2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE JiekeXu-rac2
ora.cvu
1 ONLINE ONLINE JiekeXu-rac2
ora.JiekeXuadg.db
1 ONLINE OFFLINE Instance Shutdown
2 ONLINE INTERMEDIATE JiekeXu-rac2 Mounted (Closed)
ora.JiekeXu-rac1.vip
1 ONLINE ONLINE JiekeXu-rac1
ora.JiekeXu-rac2.vip
1 ONLINE ONLINE JiekeXu-rac2
ora.oc4j
1 ONLINE ONLINE JiekeXu-rac2
ora.scan1.vip
1 ONLINE ONLINE JiekeXu-rac2
启动节点一数据库,应用 MRP0 进程检查同步情况
JiekeXu-rac1:/home/grid$exit
logout
JiekeXu-rac1:~ # su - oracle
JiekeXu-rac1:/home/oracle$
JiekeXu-rac1:/home/oracle$sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 22:13:25 2020
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 7516033024 bytes
Fixed Size 2267872 bytes
Variable Size 1996490016 bytes
Database Buffers 5502926848 bytes
Redo Buffers 14348288 bytes
Database mounted.
Database opened.
SQL> set linesize 150;
set pagesize 20;
column name format a13;
column value format a20;
column unit format a30;
column TIME_COMPUTED format a30;
select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
SQL> SQL> SQL> SQL> SQL> SQL>
NAME VALUE UNIT TIME_COMPUTED
------------- -------------------- ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:13:58
apply lag day(2) to second(0) interval 10/16/2020 22:13:58
SQL> alter database recover managed standby database using current logfile disconnect from session;
Database altered.
SQL> select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
NAME VALUE UNIT TIME_COMPUTED
------------- -------------------- ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:31
apply lag day(2) to second(0) interval 10/16/2020 22:14:31
SQL> SELECT PROCESS, STATUS,SEQUENCE#,thread# FROM V$MANAGED_STANDBY;
PROCESS STATUS SEQUENCE# THREAD#
--------- ------------ ---------- ----------
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
ARCH CONNECTED 0 0
RFS IDLE 0 0
RFS IDLE 4516 1
RFS IDLE 4516 2
RFS IDLE 0 0
MRP0 APPLYING_LOG 4516 2
15 rows selected.
SQL> select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
NAME VALUE UNIT TIME_COMPUTED
------------- -------------------- ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:39
apply lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:39
SQL>
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
登陆节点二查看数据库状态也是变成了 MOUNTED 这里并没有做任何操作,实在是奇怪吧。
JiekeXu-rac1:/home/oracle$ssh JiekeXu-rac2
Last login: Sun Aug 16 02:34:33 2020 from JiekeXu-rac1
JiekeXu-rac2:/home/oracle$sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 21:56:04 2020
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL> set line 345
col instance_name for a15
col host_name for a30
select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance;
SQL> SQL> SQL>
INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME
---------- --------------- ------------------------------ ------------ ----------------- -------------------
2 JiekeXuadg2 JiekeXu-rac2 MOUNTED 11.2.0.4.0 2020-08-16 02:01:57
1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-10-16 22:13:29
--这里数据库是 MOUNTED 由之前的 open 直接变成了 MOUNTED。
--故在这里说一句,众所周知数据库启动过程为started、mounted、open 三阶段,
--那么关闭过程是对应有三阶段的过程 close 到 MOUNTED,dismount 到 STARTED 然后 shutdown 关闭实例。
--如下所示,单点数据库实例关闭过程
SYS@JiekeXu>col HOST_NAME for a18
SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;
INSTANCE_NAME HOST_NAME VERSION STATUS
---------------- ------------------ ----------------- ------------
JiekeXu JiekeXu 11.2.0.4.0 OPEN
SYS@JiekeXu>alter database close;
Database altered.
SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;
INSTANCE_NAME HOST_NAME VERSION STATUS
---------------- ------------------ ----------------- ------------
JiekeXu JiekeXu 11.2.0.4.0 MOUNTED
SYS@JiekeXu>alter database dismount;
Database altered.
SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;
INSTANCE_NAME HOST_NAME VERSION STATUS
---------------- ------------------ ----------------- ------------
JiekeXu JiekeXu 11.2.0.4.0 STARTED
SYS@JiekeXu>shutdown;
ORA-01507: database not mounted
ORACLE instance shut down.
SYS@JiekeXu>
这里节点二直接 open 数据库检查同步情况
SQL> alter database open;
Database altered.
SQL> set line 345
col instance_name for a15
col host_name for a30
select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance;
SQL> SQL> SQL>
INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME
---------- --------------- ------------------------------ ------------ ----------------- -------------------
2 JiekeXuadg2 JiekeXu-rac2 OPEN 11.2.0.4.0 2020-08-16 02:01:57
1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-10-16 22:13:29
SQL> set linesize 150;
set pagesize 20;
column name format a13;
column value format a20;
column unit format a30;
column TIME_COMPUTED format a30;
select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
SQL> SQL> SQL> SQL> SQL> SQL>
NAME VALUE UNIT TIME_COMPUTED
------------- -------------------- ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 10/17/2020 00:52:52
apply lag +00 00:00:00 day(2) to second(0) interval 10/17/2020 00:52:52
--查看数据库补丁 PSU 11.2.0.4.180717
SQL> set line 150
SQL> col ACTION_TIME for a30
SQL> col ACTION for a8
col NAMESPACE for a8
col VERSION for a10
col BUNDLE_SERIES for a5
col COMMENTS for a20
select * from dba_registry_history;
ACTION_TIME ACTION NAMESPAC VERSION ID BUNDL COMMENTS
------------------------------ -------- -------- ---------- ---------- ----- --------------------
24-AUG-13 12.03.45.119862 PM APPLY SERVER 11.2.0.4 0 PSU Patchset 11.2.0.2.0
09-OCT-18 09.49.44.442749 AM APPLY SERVER 11.2.0.4 180717 PSU PSU 11.2.0.4.180717
下面则通过 alert 日志查看问题
--节点 2 的 alert 日志
Fri Oct 16 21:41:17 2020
Reconfiguration started (old inc 12, new inc 14)
List of instances:
2 (myinst: 2)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Oct 16 21:41:17 2020
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
--集群重配完成后实例直接 crash
Fri Oct 16 21:41:18 2020
Recovery session aborted due to instance crash
Close the database due to aborted recovery session
SMON: disabling tx recovery
All dispatchers and shared servers shutdown
CLOSE: killing server sessions.
CLOSE: all sessions shutdown successfully.
SMON: disabling tx recovery
SMON: disabling cache recovery
Fri Oct 16 21:41:29 2020
RFS[4]: Possible network disconnect with primary database
Fri Oct 16 21:41:32 2020
idle dispatcher 'D000' terminated, pid = (27, 1)
Fri Oct 16 21:41:50 2020
RFS[5]: Possible network disconnect with primary database
Fri Oct 16 21:42:17 2020
Decreasing number of real time LMS from 2 to 0
Fri Oct 16 21:47:00 2020
Primary database is in MAXIMUM PERFORMANCE mode
RFS[8]: Assigned to RFS process 27227
RFS[8]: Selected log 5 for thread 1 sequence 4514 dbid 1813902435 branch 989056101
Fri Oct 16 21:47:01 2020
Primary database is in MAXIMUM PERFORMANCE mode
RFS[9]: Assigned to RFS process 27236
RFS[9]: Selected log 9 for thread 2 sequence 4514 dbid 1813902435 branch 989056101
Fri Oct 16 21:47:01 2020
RFS[10]: Assigned to RFS process 27264
RFS[10]: Selected log 8 for thread 2 sequence 4513 dbid 1813902435 branch 989056101
Fri Oct 16 21:47:02 2020
Archived Log entry 5708 added for thread 2 sequence 4513 ID 0x6fa9692f dest 1:
Fri Oct 16 21:47:54 2020
Archived Log entry 5709 added for thread 1 sequence 4514 ID 0x6fa9692f dest 1:
Fri Oct 16 21:47:54 2020
RFS[8]: Selected log 5 for thread 1 sequence 4515 dbid 1813902435 branch 989056101
Fri Oct 16 21:47:55 2020
RFS[9]: Selected log 8 for thread 2 sequence 4515 dbid 1813902435 branch 989056101
Fri Oct 16 21:47:55 2020
Archived Log entry 5710 added for thread 2 sequence 4514 ID 0x6fa9692f dest 1:
Fri Oct 16 21:52:25 2020
Reconfiguration started (old inc 14, new inc 16)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
Communication channels reestablished
Fri Oct 16 21:52:25 2020
* domain 0 valid = 0 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Oct 16 21:52:25 2020
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Oct 16 21:52:25 2020
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Fri Oct 16 21:52:32 2020
Managed Standby Recovery starting Real Time Apply
Fri Oct 16 21:53:25 2020
Increasing number of real time LMS from 0 to 2
Reconfiguration started (old inc 16, new inc 18)
List of instances:
2 (myinst: 2)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Oct 16 21:53:34 2020
Fri Oct 16 21:53:34 2020
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Fri Oct 16 21:53:51 2020
RFS[11]: Assigned to RFS process 30085
RFS[11]: Selected log 6 for thread 1 sequence 4513 dbid 1813902435 branch 989056101
Fri Oct 16 21:53:53 2020
Archived Log entry 5711 added for thread 1 sequence 4513 ID 0x6fa9692f dest 1:
Fri Oct 16 21:54:35 2020
Decreasing number of real time LMS from 2 to 0
Reconfiguration started (old inc 18, new inc 20)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
Communication channels reestablished
Fri Oct 16 21:54:44 2020
* domain 0 valid = 0 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Oct 16 21:54:44 2020
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Oct 16 21:54:44 2020
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Fri Oct 16 21:54:51 2020
Managed Standby Recovery starting Real Time Apply
Fri Oct 16 21:54:54 2020
Archived Log entry 5712 added for thread 1 sequence 4515 ID 0x6fa9692f dest 1:
Fri Oct 16 21:54:55 2020
Archived Log entry 5713 added for thread 2 sequence 4515 ID 0x6fa9692f dest 1:
Fri Oct 16 21:55:33 2020
Managed Standby Recovery starting Real Time Apply
Fri Oct 16 21:55:33 2020
Reconfiguration started (old inc 20, new inc 22)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
Communication channels reestablished
Fri Oct 16 21:55:33 2020
* domain 0 valid = 1 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Oct 16 21:55:33 2020
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Oct 16 21:55:33 2020
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Fri Oct 16 21:56:23 2020
alter database open
Picked Lamport scheme to generate SCNs
Fri Oct 16 21:56:23 2020
SMON: enabling cache recovery
No Resource Manager plan active
Physical standby database opened for read only access.
Completed: alter database open
Fri Oct 16 21:56:34 2020
Increasing number of real time LMS from 0 to 2
Fri Oct 16 23:58:42 2020
从日志中发现 "Close the database due to aborted recovery session" 节点二数据库 close 是因为节点一 recover session 终止了,下面查看节点一的日志直接 abort 了,没有相信的日志信息。
节点 1 的 alert 日志
--21:41:16 节点一直接 abort 没有相信日志信息,相当于异常断电
Fri Oct 16 21:41:16 2020
Shutting down instance (abort)
License high water mark = 13
USER (ospid: 26959): terminating the instance
Instance terminated by USER, pid = 26959
Fri Oct 16 21:41:18 2020
Instance shutdown complete
-- 22:11:08 数据库再一次启动,大页相关配置信息,可见这里并没有配置,不过后面有推荐设置 3601
Fri Oct 16 22:11:08 2020
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
Total Shared Global Region in Large Pages = 0 KB (0%)
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
RECOMMENDATION:
Total System Global Area size is 7202 MB. For optimal performance,
prior to the next instance restart:
1. Increase the number of unused large pages by
at least 3601 (page size 2048 KB, total size 7202 MB) system wide to
get 100% of the System Global Area allocated with large pages
********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
[name='eth1:1', type=1, ip=169.254.111.0, mac=00-50-56-b9-xx-xx, net=10.xx.xx.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'eth0' configured from GPnP for use as a public interface.
[name='eth0', type=1, ip=10.xx.xx.8, mac=fa-16-3e-77-xx-xx, net=10.xx.xx.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:1' configured from GPnP for use as a public interface.
[name='eth0:1', type=1, ip=10.xx.xx.1, mac=fa-16-3e-77-xx-xx, net=10.xx.xx.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
--这里全部都是数据库参数相关设置
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /app/product/11.2.0/db
System name: Linux
Node name: JiekeXu-rac1
Release: 3.0.76-0.11-default
Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
Machine: x86_64
VM name: VMWare Version: 6
Using parameter settings in server-side pfile /app/product/11.2.0/db/dbs/initJiekeXuadg1.ora
System parameters with non-default values:
processes = 2000
sessions = 3024
event = "28401 TRACE NAME CONTEXT FOREVER, LEVEL 1"
event = "10949 TRACE NAME CONTEXT FOREVER"
spfile = "+DATA/JiekeXuadg/parameterfile/spfile.256.1046662725"
sga_target = 7200M
control_files = "+DATA/JiekeXuadg/controlfile/current.257.995197687"
db_file_name_convert = "+DATA"
db_file_name_convert = "+DATA"
log_file_name_convert = "+DATA"
log_file_name_convert = "+DATA"
db_block_size = 8192
compatible = "11.2.0.4.0"
log_archive_dest_1 = "location=+ARCH valid_for=(all_logfiles,all_roles) db_unique_name=JiekeXuadg"
log_archive_dest_2 = "service=JiekeXu LGWR ASYNC valid_for=(online_logfiles,primary_role) db_unique_name=JiekeXu"
log_archive_dest_state_1 = "enable"
log_archive_dest_state_2 = "enable"
fal_client = "JiekeXuadg"
fal_server = "JiekeXu"
log_archive_config = "dg_config=(JiekeXuadg,JiekeXu)"
log_archive_format = "%t_%s_%r.dbf"
log_archive_max_processes= 10
db_files = 1024
cluster_database = TRUE
db_create_file_dest = "+DATA"
db_recovery_file_dest = ""
db_recovery_file_dest_size= 30G
standby_file_management = "AUTO"
thread = 1
undo_tablespace = "UNDOTBS1"
undo_retention = 1440
instance_number = 1
remote_login_passwordfile= "EXCLUSIVE"
db_domain = ""
service_names = "JiekeXu"
dispatchers = "(PROTOCOL=TCP) (SERVICE=JiekeXuXDB)"
remote_listener = "db-cluster-scan:1521"
session_cached_cursors = 100
job_queue_processes = 15
audit_file_dest = "/app/oracle/admin/JiekeXu/adump"
audit_trail = "NONE"
db_name = "JiekeXu"
db_unique_name = "JiekeXuadg"
open_cursors = 500
pga_aggregate_target = 2397M
deferred_segment_creation= TRUE
enable_ddl_logging = TRUE
diagnostic_dest = "/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
169.254.111.0
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Fri Oct 16 22:11:16 2020
PMON started with pid=2, OS id=8557
Fri Oct 16 22:11:16 2020
PSP0 started with pid=3, OS id=8559
Fri Oct 16 22:11:17 2020
VKTM started with pid=4, OS id=8748 at elevated priority
VKTM running at (1)millisec precision with DBRM quantum (100)ms
这种情况还真是比较少见的,于是只能通过搜索引擎来查看相关问题了。MOS 中 Doc ID 1357597.1 给出了明确解释:
In an Active Data Guard RAC standby, if the redo apply instance crashes in the middle of media recovery, it leaves the RAC cache fusion locks on the surviving instances
and the data files on disk in an in-flux state. In such a situation, queries on the surviving instances can potentially see inconsistent data. To resolve this in-flux state, the entire standby database is closed.
Upon opening the first instance after such a close, the buffer caches and datafiles are made consistent again.
翻译下来就是:在 Active Data Guard RAC 备库中,如果 redo apply 实例在恢复过程中崩溃,它会使幸存实例和磁盘上的数据文件上的 RAC 缓存融合锁定处于通量状态。在这种情况下,对幸存实例的查询可能会看到不一致的数据。
要解决此处于流动状态,将关闭整个备用数据库。在关闭后打开第一个实例时,缓冲区缓存和数据文件再次保持一致。
大概意思就是说,如果 apply redo 应用日志的实例进程异常终止后,其它所有OPEN READ ONLY 的实例会 close, 因为在 RAC ADG 环境中,如果实例在应用日志过程中中断 crash, 会把CACHE FUSION 的锁留到残留幸存的实例中,会导致数据查询不一致,因次需要关闭数据库,重新打开来保证 buffer cache 和 datafile 的一致状态。如果配置了 DG Broker 这个操作可以自动完成, 版本大于11.2.0.2,如果没有配置 DG broker,手动方式直接 open 就可以了,接着手动执行应用日志命令,不需要重启实例,继续在幸存的节点上应用日志。
在墨天轮中,Anbob 大神写到从 12.1 版本引入了新特性 ”ADG instance recovery” ,解决的是当 redo apply instance crash 时,影响其它实例也 close 问题,从 12.1 以后保存下来的 ADG 实例会自动做 adg instance recovery ,保证数据一致性。11204 中要安装了较新的 PSU,修复了 bug 18331944 和 19516448,同时再配置隐藏参数””_adg_instance_recovery=TRUE””。关于隐含参数介绍可查看我之前的一篇文章:
Oracle 隐含参数查看 https://www.modb.pro/db/24621
SQL> col name for a30
SQL> col value for a30
SQL> col DESCRIPTION for a70
SQL> set line 234
SQL> select a.ksppinm name, b.ksppstvl value,a.ksppdesc description from x$ksppi a, x$ksppcv b where a.indx = b.indx and a.ksppinm like '_adg_instance_recovery';
NAME VALUE DESCRIPTION
------------------------------ ------------------------------ ----------------------------------------------------------------------
_adg_instance_recovery FALSE enable ADG instance recovery
EXPLANATION
--------------
Before this feature (prior to 12.1), when the ADG apply instance crashed in the middle of applying changes (standby media recovery session is on-going), all remaining open instances will be closed.
But from 12.1 when the apply instance crashed in the middle of applying changes, one of the remaining open instances will be automatically posted to do "ADG instance recovery", after the ADG instance recovery.We can see this,
ADG instance recovery by checking the alert log, for the messages like "Beginning ADG Instance Recovery" and "Completed ADG Instance Recovery".
If DG broker is enabled then Broker will start the MRP on any of the surviving instances.
Please note that the new feature is enabled by default in 12.1. We backported it to 11.2.0.4, you can enable it on 11.2.0.4 by setting "_adg_instance_recovery=TRUE".
The default behaviour on 11.2.0.4 is to close all remaining open instances
翻译:在此特性之前(12.1之前),当 ADG apply 实例在应用更改的过程中崩溃(备用媒体恢复会话正在进行)时,所有剩余的打开实例将被关闭。
但是从 12.1 开始,当应用实例在应用更改的过程中崩溃时,在 ADG 实例恢复之后,剩余的一个打开的实例将被自动提交去做“ADG实例恢复”。我们可以通过查看警报日志看到 ADG 实例恢复,对于“开始ADG实例恢复”和“完成ADG实例恢复”这样的消息。如果 DG broker 是启用的,那么 broker 将在任何存活的实例上启动 MRP。
请注意,12.1 中的新特性是默认启用的。我们将它向后移植到 11.2.0.4,您可以通过设置 “_adg_instance_recovery=TRUE” 在 11.2.0.4 上启用它。
11.2.0.4 的默认行为是关闭所有其余打开的实例。
下面是一个 MOS 上测试案例分享一下:
Test case
------------
I have 2 node primary and 3 node physical standby(instance number 2,4 and 5).
MRP running on instance 2 and other instances are in open read only.
1. On instance 2,
SQL> select d.open_mode,i.instance_number from v$database d, v$instance i;
OPEN_MODE INSTANCE_NUMBER
-------------------- ---------------
READ ONLY WITH APPLY 2
SQL> select process,status,sequence#,thread# from v$managed_standby where process like '%MRP%';
PROCESS STATUS SEQUENCE# THREAD#
--------- ------------ ---------- ----------
MRP0 WAIT_FOR_LOG 22 1
SQL> show parameter _adg
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_adg_instance_recovery boolean TRUE
2. On instance 4,
SQL> select process,status,sequence#,thread# from v$managed_standby where process like '%MRP%';
no rows selected
SQL> show parameter _adg
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_adg_instance_recovery boolean TRUE
3. Shut abort instance 2.
SQL> shut abort
ORACLE instance shut down.
SQL> exit
4. Instance 4 still open_read only.
SQL> select d.open_mode,i.instance_number from v$database d, v$instance i;
OPEN_MODE INSTANCE_NUMBER
-------------------- ---------------
READ ONLY 4
NOTE :
1. As said before if DG broker configured then Broker will start MRP on any one of the available instances.
2. If you want to use ADG instance recovery feature on 11.2.0.4 (including 11.2.0.4 BP or PSUs), please make sure the following patches for following fixes are present. The ADG instance recovery feature is not usable without those fixes.
- Bug fix for 18331944.
- Bug fix for 19516448.
当然,我的生产灾备环境是没有相关补丁和参数设置的,PSU 11.2.0.4.180717 过旧,故会产生此类问题。
参考链接:
https://www.modb.pro/db/32021
Active Data Guard RAC Standby – Apply Instance Node Failure Impact (Doc ID 1357597.1)
Behavior of Active Dataguard(ADG) When Apply Node Aborts/Crash (Doc ID 1613719.1)
今天的分享就到这里了,如果本文对您有一丁点儿帮助,请多支持“在看”与转发,不求小费了哪怕是一个小小的赞,您的鼓励都将是我熬夜写文章最大的动力,让我有一直写下去的动力,最后一起加油,奥利给!
Oracle 12c 及以上版本补丁更新说明及下载方法(收藏版)
Oracle 11.2.0.4 RAC 最新补丁下载(11.2.0.4.200714)