Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Wednesday, December 24, 2014

Using OCLUMON to analyze Cluster Health

Cluster Health Monitor & OCLUMON



The Cluster Health Monitor (CHM) stores real-time operating system metrics in the CHM repository that you can use for later triage with the help of Oracle Support should you have cluster issues.

It consists of System Monitor Service, Cluster Logger Service, CHM Repository


The OCLUMON command-line tool is included with CHM and you can use it to query the CHM repository to display node-specific metrics for a specified time period. You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period.

System Monitor Service

There is one system monitor service on every RAC node. The system monitor service (osysmond) is the monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in a CHM repository-based database.

[root@pk3-iub-rp-od01 bin]# ps -ef | grep osys
root 7229 27949 0 12:34 pts/0 00:00:00 grep osys
root 9865 1 4 Nov06 ? 2-05:33:49 /u01/app/11.2.0.4/grid/bin/osysmond.bin

Cluster Logger Service

There is one cluster logger service (ologgerd) on only one node in a cluster and another node is chosen by the cluster logger service to house the standby for the master cluster logger service. If the master cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where the master was running is down), the node where the standby resides takes over as master and selects a new node for standby. The master manages the operating system metric database in the CHM repository and interacts with the standby to manage a replica of the master operating system metrics database.

[root@pk3-iub-rp-od01 bin]# ps -ef |grep olog
root      10450      1  0 Nov06 ?        10:54:06 /u01/app/11.2.0.4/grid/bin/ologgerd -m pk3-iub-rp-od02 -r -d /u01/app/11.2.0.4/grid/crf/db/pk3-iub-rp-od01
root      12824   8395  0 12:38 pts/0    00:00:00 grep olog

[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get master
Master = pk3-iub-rp-od01
 Done
[root@pn3-esk-rp-od01 bin]# ./oclumon manage -get replica
Replica = pk3-iub-rp-od02
 Done

CHM Repository

The CHM repository, by default, resides within the Grid Infrastructure home and requires 1 GB of disk space per node in the cluster. You can adjust its size and location, and Oracle supports moving it to shared storage. You manage the CHM repository with OCLUMON.

[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get reppath
CHM Repository Path = /u01/app/11.2.0.4/grid/crf/db/pk3-iub-rp-od01
 Done

[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get repsize
CHM Repository Size = 61511
 Done

OCLUMON Usage

Use the oclumon dumpnodeview command to view log information from the system monitor service in the form of a node view. A node view consists of seven views when you display output.

SYSTEM: Lists system metrics such as CPU COUNT, CPU USAGE, and MEM USAGE
TOP CONSUMERS: Lists the top consuming processes 
PROCESSES: Lists process metrics such as PID, name, number of threads, memory usage, and number of file descriptors
DEVICES: Lists device metrics such as disk read and write rates, queue length, and wait time per I/O
NICS: Lists network interface card metrics such as network receive and send rates, effective bandwidth, and error rates
FILESYSTEMS: Lists file system metrics, such as total, used, and available space
PROTOCOL ERRORS: Lists any protocol errors, All protocol errors are cumulative values since system startup.

-- Below retrieves the info for the last one hour for a specific node
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -n pk3-iub-rp-od01 -last "01:00:00" > /tmp/oclumon.txt
-- Below retrieves the info for the last one hour for a specific node
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -n pk3-iub-rp-od01 -last "01:00:00" > /tmp/oclumon.txt
-- for specific time
./oclumon dumpnodeview -allnodes -s "time_stamp" -e "time_stamp"
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -s "2014-12-24 10:05:00" -e "2014-12-24 10:10:00" > /tmp/oclumon.txt
-- Without -v you will have only SYSTEM and TOP CONSUMER
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -v -s "2014-12-24 10:05:00" -e "2014-12-24 10:05:10" > /tmp/oclumon.txt
-- with -warning only node views with warning will be shown
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -warning -v -s "2014-12-24 10:05:00" -e "2014-12-24 10:05:10" > /tmp/oclumon.txt
-- To see the objects (nics , OS Processes , Disks ) present in a node at a particular time[root@pn3-esk-rp-od01 bin]# ./oclumon showobjects
 Following nodes are attached to the loggerd
pk3-iub-rp-od01
pk3-iub-rp-od02

[root@pn3-esk-rp-od01 bin]# ./oclumon showobjects -n pn3-esk-rp-od01 -time "2014-12-24 15:00:00"


Sample Output


----------------------------------------
Node: pk3-iub-rp-od02 Clock: '12-24-14 10.05.01' SerialNo:64028 
----------------------------------------

SYSTEM:
#pcpus: 2 #vcpus: 32 cpuht: Y chipname: Intel(R) cpu: 2.43 cpuq: 3 physmemfree: 195428192 physmemtotal: 264536300 mcache: 29107440 swapfree: 25165816 swaptotal: 25165816 ior: 0 iow: 1135 ios: 222 swpin: 0 swpout: 0 pgin: 0 pgout: 543 netr: 42.281 netw: 115.057 procs: 1282 rtprocs: 80 #fds: 29952 #sysfdlimit: 6815744 #disks: 5 #nics: 4  nicErrors: 0

TOP CONSUMERS:
topcpu: 'osysmond.bin(9813) 5.99' topprivmem: 'java(9716) 408552' topshm: 'ora_lms2_iubDB2(12021) 5600244' topfd: 'ocssd.bin(9865) 196' topthread: 'java(9716) 47'

PROCESSES:

name: 'osysmond.bin' pid: 9813 #procfdlimit: 65536 cpuusage: 5.99 privmem: 32672 shm: 58076 #fd: 66 #threads: 12 priority: -100 nice: 0
name: 'oraagent.bin' pid: 11285 #procfdlimit: 65536 cpuusage: 0.79 privmem: 25532 shm: 17752 #fd: 89 #threads: 26 priority: 20 nice: 0
name: 'tnslsnr' pid: 11607 #procfdlimit: 65536 cpuusage: 0.39 privmem: 4392 shm: 10180 #fd: 19 #threads: 3 priority: 20 nice: 0
name: 'orarootagent.bi' pid: 11289 #procfdlimit: 65536 cpuusage: 0.39 privmem: 11720 shm: 14472 #fd: 32 #threads: 11 priority: 20 nice: 0
name: 'ocssd.bin' pid: 9865 #procfdlimit: 65536 cpuusage: 0.39 privmem: 78568 shm: 55952 #fd: 196 #threads: 26 priority: -100 nice: 0
name: 'ora_dia0_IUBDB2' pid: 11998 #procfdlimit: 65536 cpuusage: 0.39 privmem: 34392 shm: 125936 #fd: 27 #threads: 1 priority: 20 nice: 0
......
name: 'oracle+ASM2' pid: 11074 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2812 shm: 18224 #fd: 16 #threads: 1 priority: 20 nice: 0
name: 'oracle+ASM2' pid: 11335 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2616 shm: 17460 #fd: 18 #threads: 1 priority: 20 nice: 0
DEVICES:
dm-2 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SWAP
dm-3 ior: 0.000 iow: 25.631 ios: 6 qlen: 0 wait: 0 type: SYS
dm-1 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
dm-0 ior: 0.000 iow: 542.253 ios: 135 qlen: 0 wait: 0 type: SYS
sda ior: 0.000 iow: 567.884 ios: 80 qlen: 0 wait: 0 type: SYS
sda3 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
sda2 ior: 0.000 iow: 569.485 ios: 81 qlen: 0 wait: 0 type: SYS
sda1 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
NICS:
lo netrr: 1.741  netwr: 1.741  neteff: 3.482  nicerrors: 0 pktsin: 11  pktsout: 11  errsin: 0  errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 11  innonunicast: 0  type: PUBLIC 
eth0 netrr: 0.000  netwr: 0.000  neteff: 0.000  nicerrors: 0 pktsin: 0  pktsout: 0  errsin: 0  errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 0  innonunicast: 0  type: PUBLIC 
bondeth0 netrr: 35.239  netwr: 109.135  neteff: 144.374  nicerrors: 0 pktsin: 154  pktsout: 157  errsin: 0  errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 154  innonunicast: 0  type: PUBLIC 
bondib0 netrr: 5.300  netwr: 4.181  neteff: 9.480  nicerrors: 0 pktsin: 13  pktsout: 14  errsin: 0  errsout: 0  indiscarded: 0  outdiscarded: 0  inunicast: 13  innonunicast: 0  type: PRIVATE latency: <1 font="">

FILESYSTEMS:
mount: /u01 type: ext3 total: 103212320 used: 59980932 available: 37988508 used%: 61 ifree%: 96 [ORACLE_HOME IUBBRM2 IUBDB2]

mount: / type: rootfs total: 0 used: 0 available: 0 used%: 0 ifree%: -1 [IUBBRM2 iubDB2]

PROTOCOL ERRORS:
IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0 TCPFailedConn: 35 TCPEstRst: 10738 TCPRetraSeg: 1203010 UDPUnkPort: 224 UDPRcvErr: 0  

Metric DescriptionsSYSTEM View 

MetricDescription
#pcpus
Number of physical CPUs in the system
#vcpus
Number of logical compute units
chipname
Type of CPU
cpuht
CPU hyperthreading enabled (Y) or disabled (N)
cpu
Average CPU utilization per processing unit within the current sample interval (%).
cpuq
Number of processes waiting in the run queue within the current sample interval
physmemfree
Amount of free RAM (KB)
physmemtotal
Amount of total usable RAM (KB)
mcache
Amount of physical RAM used for file buffers plus the amount of physical RAM used as cache memory (KB)
Note: This metric is not available on Solaris or Windows systems.
swapfree
Amount of swap memory free (KB)
swaptotal
Total amount of physical swap memory (KB)
ior
Average total disk read rate within the current sample interval (KB per second)
iow
Average total disk write rate within the current sample interval (KB per second)
ios
Average total disk I/O operation rate within the current sample interval (I/O operations per second)
swpin
Average swap in rate within the current sample interval (KB per second)
Note: This metric is not available on Windows systems.
swpout
Average swap out rate within the current sample interval (KB per second)
Note: This metric is not available on Windows systems.
pgin
Average page in rate within the current sample interval (pages per second)
pgout
Average page out rate within the current sample interval (pages per second)
netr
Average total network receive rate within the current sample interval (KB per second)
netw
Average total network send rate within the current sample interval (KB per second)
procs
Number of processes
rtprocs
Number of real-time processes
#fds
Number of open file descriptors
Number of open handles on Windows
#sysfdlimit
System limit on number of file descriptors
Note: This metric is not available on Windows systems.
#disks
Number of disks
#nics
Number of network interface cards
nicErrors
Average total network error rate within the current sample interval (errors per second)
PROCESSES View Metric Descriptions
MetricDescription
name
The name of the process executable
pid
The process identifier assigned by the operating system
#procfdlimit
Limit on number of file descriptors for this process
Note: This metric is not available on Windows, Solaris, AIX, and HP-UX systems.
cpuusage
Process CPU utilization (%)
Note: The utilization value can be up to 100 times the number of processing units.
memusage
Process private memory usage (KB)
shm
Process shared memory usage (KB)
Note: This metric is not available on Windows, Solaris, and AIX systems.
workingset
Working set of a program (KB)
Note: This metric is only available on Windows.
#fd
Number of file descriptors open by this process
Number of open handles by this process on Windows
#threads
Number of threads created by this process
priority
The process priority
nice
The nice value of the process
DEVICES View Metric Descriptions
MetricDescription
ior
Average disk read rate within the current sample interval (KB per second)
iow
Average disk write rate within the current sample interval (KB per second)
ios
Average disk I/O operation rate within the current sample interval (I/O operations per second)
qlen
Number of I/O requests in wait state within the current sample interval
wait
Average wait time per I/O within the current sample interval (msec)
type
If applicable, identifies what the device is used for. Possible values are SWAPSYSOCRASM, and VOTING.

NICS View Metric Descriptions
MetricDescription
netrr
Average network receive rate within the current sample interval (KB per second)
netwr
Average network sent rate within the current sample interval (KB per second)
neteff
Average effective bandwidth within the current sample interval (KB per second)
nicerrors
Average error rate within the current sample interval (errors per second)
pktsin
Average incoming packet rate within the current sample interval (packets per second)
pktsout
Average outgoing packet rate within the current sample interval (packets per second)
errsin
Average error rate for incoming packets within the current sample interval (errors per second)
errsout
Average error rate for outgoing packets within the current sample interval (errors per second)
indiscarded
Average drop rate for incoming packets within the current sample interval (packets per second)
outdiscarded
Average drop rate for outgoing packets within the current sample interval (packets per second)
inunicast
Average packet receive rate for unicast within the current sample interval (packets per second)
type
Whether PUBLIC or PRIVATE
innonunicast
Average packet receive rate for multi-cast (packets per second)
latency
Estimated latency for this network interface card (msec)

FILESYSTEMS View Metric Descriptions
MetricDescription
total
Total amount of space (KB)
used
Amount of used space (KB)
available
Amount of available space (KB)
used%
Percentage of used space (%)
mft%
Percentage of master file table utilization
ifree%
Percentage of free file nodes (%)
Note: This metric is not available on Windows systems.

ROTOCOL ERRORS View Metric Descriptions
MetricDescription
IPHdrErr
Number of input datagrams discarded due to errors in their IPv4 headers
IPAddrErr
Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity
IPUnkProto
Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol
IPReasFail
Number of failures detected by the IPv4 reassembly algorithm
IPFragFail
Number of IPv4 discarded datagrams due to fragmentation failures
TCPFailedConn
Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state
TCPEstRst
Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state
TCPRetraSeg
Total number of TCP segments retransmitted
UDPUnkPort
Total number of received UDP datagrams for which there was no application at the destination port
UDPRcvErr
Number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port



1 comment:

Tejuteju said...

The blog is so interactive and Informative , you should write more blogs like this Big Data Hadoop Online course Bangalore