Nmon cpu_summ vs cpu_all in a SMT environment

In a SMT CPU you have real primary core and SMT core, all can have process allocated to them but the performance of the primary logical processor compared to the SMT one are not equivalent. Usually the SMT power is about 20-30% of the primary one.

This impact the way the performance is measured

The NMON tool reports two different informations related to this : if you consider CPU_ALL report you will get the real CPU consumption, I mean the real capacity of you environment to get some more load.

CPU_SUMM is giving you information CPU by CPU applying the efficiency factor for the SMT logical cores. As a consequence the load on these CPU is lower to the primary one (and never reach 100%) as a consequence the overall information given is lower than CPU_ALL and should never reach 100% (this point would be interesting to be confirmed). But the load of each CPU can be compared and the average load is expressed with the primary CPU capacity reference.

More over, as the SMT VP are using primary VP loss cycles, the sum of the primary + SMT cores should not be more than 100%.

So we can also consider that the average of the sum of each primary+SMT cpu given by CPU_SUM is equivalent to the CPU_ALL.

I discover this point by trying to understand why the Oracle enterprise manager gave a CPU utilization under the report we had using nmon. OEM is using another way to do the CPU usage computation, based on the counters and calculating the counter difference every 5 minutes.

Counters measure the User, System, wait, idle time consumed for each of the Virtual CPUs. The problem in my point of view is that the timers are not taking into account the SMT factor, I mean that it is not fully relevant in this case. Imagine we have 2 core, 1 primary, 1 SMT, the system starts 2 processes for 1ms, they are both consuming this time but the work made by the primary one is higher than the SMT one during this time.  In a consequence, if I consider my system actually loaded under 50% (running on the primary core only) and I decide to double the load on it by adding users for example, my system will not be able to support it. The real margin I have is not 50% but something like 25% corresponding to the SMT factor.

But, this method as also a consequence, if all my core are used the whole time, the load measured will be 100% compared to CPU_SUM we have seen before.

In a conclusion, I think we have to be careful with Oracle way to measure CPU utilization as it is not a linear indicator. More the SMT ratio (number of SMT core / physical core) is high and SMT factor (power ratio) is low, more the CPU utilization will be exponential. This is, also, really depending the way the scheduler is affecting the jobs to the primaries or SMT VP.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.