SQL Server Diary: Summary for Accidental DBA Book / Chapter 3: High CPU Utilization

Chapter 3: High CPU Utilization

If you notice that one or more of the CPUs are working at close to maximum capacity, along with a dramatic decrease in server performance, then you've likely got a CPU issue. The source of CPU pressure is not always easy to pinpoint since what seems like a CPU problem may actually have its root cause elsewhere, such as insufficient memory, causing SQL Server to constantly move data in and out of memory. The source of the CPU pressure may also be a non-SQL Server process running on the server … P73.
The three main tools used to measure CPU usage, and diagnose CPU pressure, in SQL Server: Performance Monitor, SQLTrace, and Dynamic Management Views.
This tool will confirm whether the excessive CPU usage is due to SQL Server activity, or is caused by other processes on the server, or the operating system itself.
The primary PerfMon counters that are of value for monitoring CPU usage are the following with brief explanations:

Processor/ %Privileged Time – percentage of time the processor spends on execution of Microsoft Windows kernel commands such as core operating system activity and device drivers.
Processor/ %User Time – percentage of time the processor spends on executing user processes such as SQL Server. This includes I/O requests from SQL Server.
Process (sqlservr.exe)/ %Processor Time – the sum of processor time on each processor for all threads of the process … P74.

The main usage for SQL Trace is to identify specific queries that are consuming large amounts of CPU.
Along with a wait_type column, indicating the type of wait, the sys.dm_os_wait_stats DMV returns several useful wait times, including:

wait_time_ms – total amount of time that tasks have waited on this given wait type; this value includes the time in the signal_wait_time_ms column. The value increments from the moment a task stops execution, to wait for a resource, to the point it resumes execution.
signal_wait_time_ms – the total amount of time tasks took to start executing after being signaled (i.e. after the resource it was waiting for became available); this is time spent on the runnable queue, and is pure CPU wait … P78.

If the signal wait time is a significant portion of the total wait time, it means that tasks are waiting a relatively long time to resume execution after the resources that they were waiting for became available. This can indicate either that there are lots of CPU-intensive queries, which may need optimizing, or that the server needs more CPU. The query in Listing 3.1 will provide a measure of how much of the total wait time is signal wait time … P79.
Since the wait times in this DMV are aggregated, it is hard to relate a wait time to a specific query, unless you're on a test system with only one session running. However, in SQL Server 2008, it's possible to do this if you use Extended Events. See my blog post: http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/12/30/an-event-a-day-30-of-31-tracking-session-and-statement-level-waits.aspx ... P79.
We can also use the sys.dm_os_wait_stats DMV to find out which resource waits are the most common in our CPU-bound system, as shown in Listing 3.2, where we identify the top wait events, ordered according to the total amount of time processes have waited (wait_time_ms) on this event. Three interesting wait types to look out for, in regard to CPU pressure, are SOS_SCHEDULER_YIELD, CXPACKET and CMEMTHREAD … P80.
SOS_SCHEDULER_YIELD is the time a task voluntarily relinquishes the CPU and begins waiting to resume execution. If queries show high wait times in sys.dm_exec_requests or sys.dm_os_waiting_tasks for the SOS_SCHEDULER_YIELD wait type, it's an indication that the query is extremely CPU-intensive. If there are high wait times for this wait type overall on the server it can indicate either that there are lots of CPU-intensive queries, which may need optimizing, or that the server needs more CPU … P81.
The query shown in Listing 3.4 returns the top ten most costly queries in cache by total worker time. The *_worker_time columns represent the time spent on the CPU, and the *_elapsed_time columns show the total execution time … P83.
Lack of appropriate indexing is one of the most common causes of heavy CPU and I/O utilization in SQL Server.
Outdated statistic leads SQL Server Optimizer to choose a non-appropriate plan to execute a query, leading to a bad performance. To know if a query has an outdated statistic, compare the estimated and actual row counts in the execution plan of the query, if the two counts are significantly different, then one possibility is that the statistics are outdated. UPDATE STATISTICS <Table name> statement updates all statistics on a table, or just for one specific statistics set UPDATE STATISTICS <Table name> <statistic name> … P89.
To prevent outdated statistics set Auto_Update_statistics to ON … P99.
SARGable, where SARG stands for Search Argument; it means that that a predicate can be used in an index seek operation. The rules for SARGable predicates, in general, are that the column should be directly compared (equality or inequality) to an expression, and that any functions specified on the column will make the predicate non-SARGable. In other words, WHERE SomeFunction(Column) = @Value is not SARGable, whereas WHERE Column = SomeOtherFunction(@Value) is SARGable. Non-SARGable predicates can result in table or index scans and, similar to the case of missing indexes, this will cause significant CPU usage as SQL has to read and process far more rows than necessary. Look at the execution plan of the following:

For example:
select * from table1 where CONVERT(VARCHAR(10), recordInsertedDate) = '01/01/2010'
is non SARGable (assuming that recordInsertedDate is of type datetime and that there is an index on that feild). Such case can be solved by something like:
select * from table1 where recordInsertedDate>='01/01/2010' and recordInsertedDate< '02/01/2010'.
Another example for non-SARGable WHERE ISNULL(SomeCol,0) > 0 … P89, P90, P91, P92.

The implicit conversions that happens sometimes in JOINs, WHERE clause (example converting varchar to nvarchar) has the non-SARGable problem. To fix this make sure that any variables, parameters or constants are of the same type as the columns to which they are being compared. If they are not, make careful use of conversion functions (CAST, CONVERT) on the variables, parameters or constants so that they match the data type of the column … P94.
Parameter sniffing is a process used by SQL Server when creating an execution plan for a stored procedure, function, or parameterized query. The first time the plan is compiled, SQL Server will examine, or "sniff", the input parameter values supplied, and use them, in conjunction with the column statistics, to estimate the number of rows that will be touched by the query. It then uses that estimate in its costing of various possible execution plans. A problem only arises if the values that were passed as input parameters on initial plan creation, result in a row count that is atypical of that which will result from future executions of the procedure, which may have a significant impact on the performance of some of the future executions of the procedure. Parameter sniffing only occurs at the time a plan is compiled or recompiled, and all subsequent executions of the stored procedure, function, or parameterized query will use the same plan. The following are some possible solutions (P95-P99):

Enable Trace Flag 4136: doing so prevents parameter sniffing for the whole SQL Server instance, as a result the optimizer makes a process that often results less accurate estimation of the number of rows. This will benefit the procedures with the sniffing problem, yet although this option is available, parameter sniffing is beneficial to most procedures that are written to use typical values. Turning parameter sniffing off may inadvertently affect these plans in a negative way. As such, this Trace Flag should be considered an absolute last resort if nothing else fixes the problem … P100.
Using the OPTIMIZE FOR hint to specify a parameter value for the optimizer to use when compiling a plan. Or OPTIMIZE FOR UNKNOWN hint, which instructs SQL Server to not use parameter sniffing at all … P101.
Use the OPTION(RECOMPILE) or WITH RECOMPILE option. It forces a recompile, and generation of a new plan on every execution. This means that row estimations will always be based on the current parameter value, but at the cost of increasing the execution time of the procedure; so techniques are useful when the overhead of the additional compilations is small in comparison with the performance degradation caused by reuse of inappropriate plans … P103.

Ad hoc queries are statements sent to the optimizer that are not predefined by using stored procedures, sp_executesql or other ways to force reuse of execution plans. Ad hoc queries will cause execution plans to be generated for each and every statement. For example if you have same three queries but each with a different value hard-coded into value assignment in the WHERE clause. These three statements should produce the same execution plan, but they don't, as the optimizer sees them as completely different queries, and hence generate separate execution plans (know that: For very simple queries, SQL Server can use a technique called simple parameterization to replace the fixed values with parameters, and so allow for plan reuse). The problem with non-parameterized queries is two-fold:

The plan cache fills up with lots of single-use plans from ad hoc queries. This may lead that the reusable plan get discard from the cache due to memory pressure.
The compilation of these single-use plans wastes CPU. Compilation is expensive, using relatively large amounts of CPU … P104.

Cases where a lack of parameterization is causing excessive plan compilation, or where simple (or forced) parameterization is attempted but fails, can be identified using the following counters

SQLServer: SQL Statistics: SQL Compilations/Sec
SQLServer: SQL Statistics: Auto-Param Attempts/Sec
SQLServer: SQL Statistics: Failed Auto-Param/Sec … P104.

There are two ways to solve non-parameterized ad hoc queries (the first option is preferable):

Fixing the application by moving those queries from the application to stored procedures, or changing those ad hoc queries to their parameterized version (Listing 3.18 and Listing 3.19 is an example) … P105.
If the first option is not possible, there are 2 options in SQL Server that can help alleviate the problem:

Set the database-level PARAMETERIZATION option to FORCED, as follows:
ALTER DATABASE YOUR_DATABASE_NAME SET PARAMETERIZATION FORCED
The downside of this setting that it forces SQL to use one plan for all matching queries, no matter what the values in the WHERE clause are, so we may fall in the same parameter sniffing problems we’ve mentioned earlier … P106.
Use optimize for ad hoc workloads. With this server-level setting, the first time a query runs, SQL Server just put a plan-stub (mark) that this query was seen before, and caches the query at the second run. This won't reduce the number of compiles for ad hoc queries, but it will make it less likely that the plan cache will grow as much, since the initial stub takes up very little memory. As such, it reduces the chances that other plans which could be reusable will be discarded due to memory pressure (Listing 3.21) … P107.

Inappropriate parallelism

Query parallelism is the mechanism used by the SQL query execution engine to split the work of a query into multiple threads, each of which will execute on a separate scheduler. Parallel query processing can reduce the time required to process a query by partitioning the job across multiple logical CPUs. This can be very beneficial to data warehouse and reporting operations, which have a few large queries that deal with volumes of data and only a few requests occur concurrently. However, it can be very detrimental to OLTP environments where the workload consists of lots of smaller queries executing concurrently, since the parallel operation can utilize up to all of the processor cores on the server, causing other requests to wait to execute. SQL Server has two configuration options that control the parallel execution of queries by the engine:

The cost threshold for parallelism option determines a threshold, when exceeded, will cause a parallel execution plan to be generated, in order to execute the user request. Analyzing the result of the query in Listing 3.22 we can determine what might be an appropriate value for the cost threshold for parallelism option … P109-110.
The max degree of parallelism option is to limit the number of the processors that will share the load of executing a query; so this option exists to prevent a single query from utilizing all of the processor cores on a SQL Server. Generally speaking, the appropriate value for the max degree of parallelism option depends largely on the type of workload being executed, and the ability of the other hardware subsystems to cope with the additional workload associated with parallel execution in the system. If your system is experiencing parallelism-related issues (see the Diagnosing inappropriate parallelism points), then it may be necessary to limit the degree of parallelism, in conjunction with tuning the cost threshold for parallelism to resolve the problem ... P110-113.

Hyper-threading is an Intel technology designed to improve parallel execution by presenting to the operating system two logical cores for each physical core. This means that instead of one scheduler per processor core you get two, and so two threads can be executed "simultaneously". Generally, enable hyper-threading unless you find a good reason to turn it off … P113-115
If parallel processing is causing a resource bottleneck in a specific system, you will find CXPACKET wait type shows up as one of the top waits for SQL Server. This wait type is set whenever a parallel process has to wait in the exchange iterator for another worker to continue processing. Sometimes the CXPACKET waits are symptoms of a problem not the actual problem. for example, when the disk I/O subsystem can't keep up with the demand required by the parallel execution of a query, the root wait type may be an IO_COMPLETION, ASYNC_IO_COMPLETION, or PAGEIOLATCH_* wait type, and scaling out the I/O subsystem is not possible. When this occurs, reducing the level of parallelism can improve overall system performance. Another example is if one of the top latch waits is ACCESS_METHODS_DATASET_PARENT, in conjunction with CXPACKET, LATCH_*, and SOS_SCHEDULER_YIELD wait types as the top waits, the level of parallelism on the system is the cause of bottlenecking and reducing the 'max degree of parallelism' sp_configure option may solve the problems. But always and before messing with the cost threshold for parallelism option or max degree of parallelism option we have to make sure that we don’t have missing indexes, outdated statistics or badly written queries because they can easily cause inappropriate parallelism … P116 -117.

The TokenAndPermUserStore cache is an optimization that would allow caching of the results of permissions checks by users against database objects. However, could be the cause of performance problems, especially in earlier builds of SQL Server 2005. Using the query in listing 3.23, we can check is the cache constantly grows in size, and that growth is accompanied by queries waiting with a CMEMTHREAD wait type, then the size of the cache may be the cause of the high CPU usage. To solve the problem on SQL Server 2005 apply SP4, and generally also better long-term solution is to try reduce the usage of ad hoc or dynamic SQL, and move as much logic as possible into stored procedures. This cache problem should not appear on SQL 2008 and later, as it did on SQL 2005 … P118.

Windows Server or BIOS power saving options can have negative impact on processor performance, as this feature causes under-clocking of the processors on the server. Power saving has impact on the % Processor Usage performance counter. The value for this counter is calculated based on the currently used CPU frequency, divided by the available CPU frequency. As such, an under-clocked CPU causes Windows to report higher CPU usage values, leading people to believe that the server is under heavier load than it is in reality.
To identify if this is a problem on your system, use a free tool named CPU-Z and look for CPU Specification, which will show the type of processor and its rated clock speed, and the Core Speed, which shows the current clock speed of the processors in the system. If the Core Speed is lower that the rated specification, then power management is reducing the performance of the system. The first thing to do is to check the current Windows power management scheme. If it is set to Balanced, change it to High Performance. If it is already High Performance, check the BIOS setting if it is set to Hardware, change it to OS Control … P121.

Reference: http://www.amazon.com/Troubleshooting-SQL-Server-Guide-Accidental/dp/1906434786

SQL Server Diary

Wednesday, June 3, 2015

Summary for Accidental DBA Book / Chapter 3: High CPU Utilization

Chapter 3: High CPU Utilization

Inappropriate parallelism

1 comment: