Using OCLUMON to analyze Cluster Health

Cluster Health Monitor & OCLUMON

The Cluster Health Monitor (CHM) stores real-time operating system metrics in the CHM repository that you can use for later triage with the help of Oracle Support should you have cluster issues.

It consists of System Monitor Service, Cluster Logger Service, CHM Repository

The OCLUMON command-line tool is included with CHM and you can use it to query the CHM repository to display node-specific metrics for a specified time period. You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period.

System Monitor Service

There is one system monitor service on every RAC node. The system monitor service (osysmond) is the monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in a CHM repository-based database.

[root@pk3-iub-rp-od01 bin]# ps -ef | grep osys
root 7229 27949 0 12:34 pts/0 00:00:00 grep osys
root 9865 1 4 Nov06 ? 2-05:33:49 /u01/app/11.2.0.4/grid/bin/osysmond.bin

Cluster Logger Service

There is one cluster logger service (ologgerd) on only one node in a cluster and another node is chosen by the cluster logger service to house the standby for the master cluster logger service. If the master cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where the master was running is down), the node where the standby resides takes over as master and selects a new node for standby. The master manages the operating system metric database in the CHM repository and interacts with the standby to manage a replica of the master operating system metrics database.

[root@pk3-iub-rp-od01 bin]# ps -ef |grep olog
root 10450 1 0 Nov06 ? 10:54:06 /u01/app/11.2.0.4/grid/bin/ologgerd -m pk3-iub-rp-od02 -r -d /u01/app/11.2.0.4/grid/crf/db/pk3-iub-rp-od01
root 12824 8395 0 12:38 pts/0 00:00:00 grep olog

[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get master
Master = pk3-iub-rp-od01
Done
[root@pn3-esk-rp-od01 bin]# ./oclumon manage -get replica
Replica = pk3-iub-rp-od02
Done

CHM Repository

The CHM repository, by default, resides within the Grid Infrastructure home and requires 1 GB of disk space per node in the cluster. You can adjust its size and location, and Oracle supports moving it to shared storage. You manage the CHM repository with OCLUMON.

[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get reppath
CHM Repository Path = /u01/app/11.2.0.4/grid/crf/db/pk3-iub-rp-od01
Done

[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get repsize
CHM Repository Size = 61511
Done

OCLUMON Usage

Use the oclumon dumpnodeview command to view log information from the system monitor service in the form of a node view. A node view consists of seven views when you display output.

SYSTEM: Lists system metrics such as CPU COUNT, CPU USAGE, and MEM USAGE
TOP CONSUMERS: Lists the top consuming processes
PROCESSES: Lists process metrics such as PID, name, number of threads, memory usage, and number of file descriptors
DEVICES: Lists device metrics such as disk read and write rates, queue length, and wait time per I/O
NICS: Lists network interface card metrics such as network receive and send rates, effective bandwidth, and error rates
FILESYSTEMS: Lists file system metrics, such as total, used, and available space
PROTOCOL ERRORS: Lists any protocol errors, All protocol errors are cumulative values since system startup.

-- Below retrieves the info for the last one hour for a specific node
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -n pk3-iub-rp-od01 -last "01:00:00" > /tmp/oclumon.txt
-- Below retrieves the info for the last one hour for a specific node
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -n pk3-iub-rp-od01 -last "01:00:00" > /tmp/oclumon.txt
-- for specific time
./oclumon dumpnodeview -allnodes -s "time_stamp" -e "time_stamp"
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -s "2014-12-24 10:05:00" -e "2014-12-24 10:10:00" > /tmp/oclumon.txt
-- Without -v you will have only SYSTEM and TOP CONSUMER
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -v -s "2014-12-24 10:05:00" -e "2014-12-24 10:05:10" > /tmp/oclumon.txt
-- with -warning only node views with warning will be shown
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -warning -v -s "2014-12-24 10:05:00" -e "2014-12-24 10:05:10" > /tmp/oclumon.txt
-- To see the objects (nics , OS Processes , Disks ) present in a node at a particular time[root@pn3-esk-rp-od01 bin]# ./oclumon showobjects
Following nodes are attached to the loggerd
pk3-iub-rp-od01
pk3-iub-rp-od02

[root@pn3-esk-rp-od01 bin]# ./oclumon showobjects -n pn3-esk-rp-od01 -time "2014-12-24 15:00:00"

Sample Output

----------------------------------------
Node: pk3-iub-rp-od02 Clock: '12-24-14 10.05.01' SerialNo:64028
----------------------------------------

SYSTEM:
#pcpus: 2 #vcpus: 32 cpuht: Y chipname: Intel(R) cpu: 2.43 cpuq: 3 physmemfree: 195428192 physmemtotal: 264536300 mcache: 29107440 swapfree: 25165816 swaptotal: 25165816 ior: 0 iow: 1135 ios: 222 swpin: 0 swpout: 0 pgin: 0 pgout: 543 netr: 42.281 netw: 115.057 procs: 1282 rtprocs: 80 #fds: 29952 #sysfdlimit: 6815744 #disks: 5 #nics: 4 nicErrors: 0

TOP CONSUMERS:
topcpu: 'osysmond.bin(9813) 5.99' topprivmem: 'java(9716) 408552' topshm: 'ora_lms2_iubDB2(12021) 5600244' topfd: 'ocssd.bin(9865) 196' topthread: 'java(9716) 47'

PROCESSES:

name: 'osysmond.bin' pid: 9813 #procfdlimit: 65536 cpuusage: 5.99 privmem: 32672 shm: 58076 #fd: 66 #threads: 12 priority: -100 nice: 0
name: 'oraagent.bin' pid: 11285 #procfdlimit: 65536 cpuusage: 0.79 privmem: 25532 shm: 17752 #fd: 89 #threads: 26 priority: 20 nice: 0
name: 'tnslsnr' pid: 11607 #procfdlimit: 65536 cpuusage: 0.39 privmem: 4392 shm: 10180 #fd: 19 #threads: 3 priority: 20 nice: 0
name: 'orarootagent.bi' pid: 11289 #procfdlimit: 65536 cpuusage: 0.39 privmem: 11720 shm: 14472 #fd: 32 #threads: 11 priority: 20 nice: 0
name: 'ocssd.bin' pid: 9865 #procfdlimit: 65536 cpuusage: 0.39 privmem: 78568 shm: 55952 #fd: 196 #threads: 26 priority: -100 nice: 0
name: 'ora_dia0_IUBDB2' pid: 11998 #procfdlimit: 65536 cpuusage: 0.39 privmem: 34392 shm: 125936 #fd: 27 #threads: 1 priority: 20 nice: 0
......
name: 'oracle+ASM2' pid: 11074 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2812 shm: 18224 #fd: 16 #threads: 1 priority: 20 nice: 0
name: 'oracle+ASM2' pid: 11335 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2616 shm: 17460 #fd: 18 #threads: 1 priority: 20 nice: 0
DEVICES:
dm-2 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SWAP
dm-3 ior: 0.000 iow: 25.631 ios: 6 qlen: 0 wait: 0 type: SYS
dm-1 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
dm-0 ior: 0.000 iow: 542.253 ios: 135 qlen: 0 wait: 0 type: SYS
sda ior: 0.000 iow: 567.884 ios: 80 qlen: 0 wait: 0 type: SYS
sda3 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
sda2 ior: 0.000 iow: 569.485 ios: 81 qlen: 0 wait: 0 type: SYS
sda1 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
NICS:
lo netrr: 1.741 netwr: 1.741 neteff: 3.482 nicerrors: 0 pktsin: 11 pktsout: 11 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 11 innonunicast: 0 type: PUBLIC
eth0 netrr: 0.000 netwr: 0.000 neteff: 0.000 nicerrors: 0 pktsin: 0 pktsout: 0 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 0 innonunicast: 0 type: PUBLIC
bondeth0 netrr: 35.239 netwr: 109.135 neteff: 144.374 nicerrors: 0 pktsin: 154 pktsout: 157 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 154 innonunicast: 0 type: PUBLIC
bondib0 netrr: 5.300 netwr: 4.181 neteff: 9.480 nicerrors: 0 pktsin: 13 pktsout: 14 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 13 innonunicast: 0 type: PRIVATE latency: <1 font="">

FILESYSTEMS:
mount: /u01 type: ext3 total: 103212320 used: 59980932 available: 37988508 used%: 61 ifree%: 96 [ORACLE_HOME IUBBRM2 IUBDB2]

mount: / type: rootfs total: 0 used: 0 available: 0 used%: 0 ifree%: -1 [IUBBRM2 iubDB2]

PROTOCOL ERRORS:
IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0 TCPFailedConn: 35 TCPEstRst: 10738 TCPRetraSeg: 1203010 UDPUnkPort: 224 UDPRcvErr: 0

Metric DescriptionsSYSTEM View

Metric	Description
#pcpus	Number of physical CPUs in the system
#vcpus	Number of logical compute units
chipname	Type of CPU
cpuht	CPU hyperthreading enabled (Y) or disabled (N)
cpu	Average CPU utilization per processing unit within the current sample interval (%).
cpuq	Number of processes waiting in the run queue within the current sample interval
physmemfree	Amount of free RAM (KB)
physmemtotal	Amount of total usable RAM (KB)
mcache	Amount of physical RAM used for file buffers plus the amount of physical RAM used as cache memory (KB) Note: This metric is not available on Solaris or Windows systems.
swapfree	Amount of swap memory free (KB)
swaptotal	Total amount of physical swap memory (KB)
ior	Average total disk read rate within the current sample interval (KB per second)
iow	Average total disk write rate within the current sample interval (KB per second)
ios	Average total disk I/O operation rate within the current sample interval (I/O operations per second)
swpin	Average swap in rate within the current sample interval (KB per second) Note: This metric is not available on Windows systems.
swpout	Average swap out rate within the current sample interval (KB per second) Note: This metric is not available on Windows systems.
pgin	Average page in rate within the current sample interval (pages per second)
pgout	Average page out rate within the current sample interval (pages per second)
netr	Average total network receive rate within the current sample interval (KB per second)
netw	Average total network send rate within the current sample interval (KB per second)
procs	Number of processes
rtprocs	Number of real-time processes
#fds	Number of open file descriptors Number of open handles on Windows
#sysfdlimit	System limit on number of file descriptors Note: This metric is not available on Windows systems.
#disks	Number of disks
#nics	Number of network interface cards
nicErrors	Average total network error rate within the current sample interval (errors per second)

PROCESSES View Metric Descriptions

Metric	Description
name	The name of the process executable
pid	The process identifier assigned by the operating system
#procfdlimit	Limit on number of file descriptors for this process Note: This metric is not available on Windows, Solaris, AIX, and HP-UX systems.
cpuusage	Process CPU utilization (%) Note: The utilization value can be up to 100 times the number of processing units.
memusage	Process private memory usage (KB)
shm	Process shared memory usage (KB) Note: This metric is not available on Windows, Solaris, and AIX systems.
workingset	Working set of a program (KB) Note: This metric is only available on Windows.
#fd	Number of file descriptors open by this process Number of open handles by this process on Windows
#threads	Number of threads created by this process
priority	The process priority
nice	The nice value of the process

DEVICES View Metric Descriptions

Metric	Description
ior	Average disk read rate within the current sample interval (KB per second)
iow	Average disk write rate within the current sample interval (KB per second)
ios	Average disk I/O operation rate within the current sample interval (I/O operations per second)
qlen	Number of I/O requests in wait state within the current sample interval
wait	Average wait time per I/O within the current sample interval (msec)
type	If applicable, identifies what the device is used for. Possible values are `SWAP`, `SYS`, `OCR`, `ASM`, and `VOTING`.

NICS View Metric Descriptions

Metric	Description
netrr	Average network receive rate within the current sample interval (KB per second)
netwr	Average network sent rate within the current sample interval (KB per second)
neteff	Average effective bandwidth within the current sample interval (KB per second)
nicerrors	Average error rate within the current sample interval (errors per second)
pktsin	Average incoming packet rate within the current sample interval (packets per second)
pktsout	Average outgoing packet rate within the current sample interval (packets per second)
errsin	Average error rate for incoming packets within the current sample interval (errors per second)
errsout	Average error rate for outgoing packets within the current sample interval (errors per second)
indiscarded	Average drop rate for incoming packets within the current sample interval (packets per second)
outdiscarded	Average drop rate for outgoing packets within the current sample interval (packets per second)
inunicast	Average packet receive rate for unicast within the current sample interval (packets per second)
type	Whether PUBLIC or PRIVATE
innonunicast	Average packet receive rate for multi-cast (packets per second)
latency	Estimated latency for this network interface card (msec)

FILESYSTEMS View Metric Descriptions

Metric	Description
total	Total amount of space (KB)
used	Amount of used space (KB)
available	Amount of available space (KB)
used%	Percentage of used space (%)
mft%	Percentage of master file table utilization
ifree%	Percentage of free file nodes (%) Note: This metric is not available on Windows systems.

ROTOCOL ERRORS View Metric Descriptions

Metric	Description
IPHdrErr	Number of input datagrams discarded due to errors in their IPv4 headers
IPAddrErr	Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity
IPUnkProto	Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol
IPReasFail	Number of failures detected by the IPv4 reassembly algorithm
IPFragFail	Number of IPv4 discarded datagrams due to fragmentation failures
TCPFailedConn	Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state
TCPEstRst	Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state
TCPRetraSeg	Total number of TCP segments retransmitted
UDPUnkPort	Total number of received UDP datagrams for which there was no application at the destination port
UDPRcvErr	Number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Wednesday, December 24, 2014