블로그 이미지
LifeisSimple

calendar

1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30

Notice

2012. 4. 5. 19:19 Brain Trainning/DataBase

대기 상태에 대한 글입니다. 

간단히 읽어보면 도움이 많이 됩니다. 


출처 : http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!)  

How many times have you walked up to a SQL Server that has a performance problem and wondered where to start looking? 

One of the most under-utilized performance troubleshooting methodologies in the SQL Server world is one called "waits and queues" (also known simply as "wait stats"). The basic premise is that SQL Server is permanently tracking why execution threads have to wait. You can ask SQL Server for this information and then use the results to narrow down where to start digging to unearth the cause of performance issues. The "waits" are what SQL Server tracks. The "queues" are the resources that the threads are waiting for. There are a myriad of waits in the system and they all indicate different resources being waited for. For example, a PAGEIOLATCH_EX wait means a thread is waiting for a data page to be read into the buffer pool from disk. A LCK_M_X wait means a thread is waiting to be granted an exclusive lock on something.

The great thing about all of this is the SQL Server *knows* where the performance issues are, and you just need to ask it.... and then interpret what it tells you, which can be a little tricky.

Now - where people sometimes get hung up is trying to track down every last wait and figure out what's causing it. Waits *always* occur. It's the way SQL Server's scheduling system works.

A thread is using the CPU (called RUNNING) until it needs to wait for a resource. It then moves to an unordered list of threads that areSUSPENDED. In the meantime, the next thread on the FIFO (first-in-first-out) queue of threads waiting for the CPU (called being RUNNABLE) is given the CPU and becomes RUNNING. If a thread on the SUSPENDED list is notified that it's resource is available, it becomes RUNNABLE and is put on the bottom of the RUNNABLE queue. Threads continue this clockwise movement from RUNNING to SUSPENDED to RUNNABLE to RUNNING again until the task is completed. You can see processes in these states using the sys.dm_exec_requests DMV. 

SQL Server keeps track of the time that elapses between leaving the RUNNING state and becoming RUNNING again (called the "wait time") and the time spent on the RUNNABLE queue (called the "signal wait time" - i.e. how long does the thread need to wait for the CPU after being signaled that its resource is available). We need to work out the time spent waiting on the SUSPENDED list (called the "resource wait time") by subtracting the signal wait time from the overall wait time.

Tom Davidson of Microsoft wrote a fantastic and very accessible whitepaper on "waits and queues" which I encourage you to read:Performance Tuning Using Waits and Queues. My good friend Joe Sack (blog|twitter) who runs the MCM program also published an excellent slide deck on the subject that you can download here. And of course Books Online has a section on the sys.dm_os_wait_stats DMVthat gives info on some of the newer wait types. PSS is putting together a repository of info on all the wait types but not much progress has been made. And there's a free video+demo recording as part of the MCM online training we recorded for Microsoft - see here.

You can ask SQL Server for the cumulative wait statistics using the sys.dm_os_wait_stats DMV, and many people prefer to wrap the DMV call in some aggregation code. I use code based on a query that I got from fellow-MVP Glenn Berry (blog|twitter) and then modified quite a bit. See below for the version updated to take account of the results discussed below:

WITH Waits AS
    (SELECT
        wait_type,
        wait_time_ms / 1000.0 AS WaitS,
        (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
        signal_wait_time_ms / 1000.0 AS SignalS,
        waiting_tasks_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_wait_stats
    WHERE wait_type NOT IN (
        'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
        'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
        'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
        'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
        'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',
        'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',
        'SLEEP_BPOOL_FLUSH', 'SQLTRACE_LOCK')
    )
SELECT
    W1.wait_type AS WaitType, 
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S,
    CAST ((W1.ResourceS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgRes_S,
    CAST ((W1.SignalS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgSig_S
FROM Waits AS W1
    INNER JOIN Waits AS W2 ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

This will show the waits grouped together as a percentage of all waits on the system, in decreasing order. The waits to be concerned about (potentially) are those at the top of the list as this represents the majority of where SQL Server is spending it's time waiting. You can see that a bunch of waits are being filtered out of consideration - as I said above, waits happen all the time and these are the benign ones we can usually ignore.

You can also reset the aggregated statistics using this code:

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

And of course you can very easily come up with a way to persist the results every few hours or every day and do some time-series analysis to figure out trends or automatically spot problems as they start to happen. You can also use Performance Dashboard to see these graphically in 2005 and Data Collector in 2008. On SQL Server 2000 you can use DBCC SQLPERF ('waitstats').

Once you get the results, you then start figuring out how to interpret them and where to go looking. The whitepaper I referenced above has a ton of good info on most of the wait types (except those added in 2008). There are various ways you can dig in deeper to this information that I'll go into in later posts.

I'm going to start blogging about wait stats analysis, either as standalone posts or as part of other things - and I've already done so in the last post (at time of writing this) in my benchmarking series.

For now, I want to report on the results of the wait stats survey I posted a couple of months back. I asked people to run the code above and let me know the results. I received results for a whopping 1823 SQL Servers out there - thank you!

Here's a graphical view of the results:

 

I'm not surprised at all by the top four results as I see these over and over on client systems.

For the remainder of this post, I'm going to list all the top wait types reported by survey respondents, in descending order, and give a few words about what they might mean if they are the most prevalent wait on your system. The list format shows the number of systems with that wait type as the most prevalent, and then the wait type.

  • 505: CXPACKET
    • This is commonly where a query is parallelized and the parallel threads are not given equal amounts of work to do, or one thread blocks. One thread may have a lot more to do than the others, and so the whole query is blocked while the long-running thread completes. If this is combined with a high number of PAGEIOLATCH_XX waits, it could be large parallel table scans going on because of incorrect non-clustered indexes, or out-of-date statistics causing a bad query plan. If neither of these are the issue, you might want to try setting MAXDOP to 4, 2, or 1 for the offending queries (or possibly the whole instance). Make sure that if you have a NUMA system that you try setting MAXDOP to the number of cores in a single NUMA node first to see if that helps the problem. You also need to consider the MAXDOP effect on a mixed-load system.
  • 304: PAGEIOLATCH_XX
    • This is where SQL Server is waiting for a data page to be read from disk into memory. It commonly indicates a bottleneck at the IO subsystem level, but could also indicate buffer pool pressure (i.e. not enough memory for the workload).
  • 275: ASYNC_NETWORK_IO
    • This is commonly where SQL Server is waiting for a client to finish consuming data. It could be that the client has asked for a very large amount of data or just that it's consuming it reeeeeally slowly because of poor programming.
  • 112: WRITELOG
    • This is the log management system waiting for a log flush to disk. It commonly indicates a problem with the IO subsystem where the log is, but on very high-volume systems it could also be caused by waiting for the LOGCACHE_ACCESS spinlock (which you can't do anything about). To be sure it's the IO subsystem, use the DMV sys.dm_io_virtual_file_stats to examine the IO latency for the log file.
  • 109: BROKER_RECEIVE_WAITFOR
    • This is just Service Broker waiting around for new messages to receive. I would add this to the list of waits to filter out and re-run the wait stats query. 
  • 086: MSQL_XP
    • This is SQL Server waiting for an extended stored-proc to finish. This could indicate a problem in your XP code. 
  • 074: OLEDB
    • As its name suggests, this is a wait for something communicating using OLEDB - e.g. a linked server. It could be that the linked server has a performance issue.
  • 054: BACKUPIO
    • This shows up when you're backing up directly to tape, which is slooooow. I'd be tempted to filter this out.
  • 041: LCK_M_XX
    • This is simply the thread waiting for a lock to be granted and indicates blocking problems. These could be caused by unwanted lock escalation or bad programming, but could also be from IOs taking a long time causing locks to be held for longer than usual. Look at the resource associated with the lock using the DMV sys.dm_os_waiting_tasks.
  • 032: ONDEMAND_TASK_QUEUE
    • This is normal and is part of the background task system (e.g. deferred drop, ghost cleanup).  I would add this to the list of waits to filter out and re-run the wait stats query.
  • 031: BACKUPBUFFER
    • This shows up when you're backing up directly to tape, which is slooooow. I'd be tempted to filter this out.
  • 027: IO_COMPLETION
    • This is SQL Server waiting for IOs to complete and is a sure indication of IO subsystem problems.
  • 024: SOS_SCHEDULER_YIELD
    • If this is a very high percentage of all waits (had to say, but Joe suggests 80%) then this is likely indicative of CPU pressure.
  • 022: DBMIRROR_EVENTS_QUEUE
  • 022: DBMIRRORING_CMD
    •  These two are database mirroring just sitting around waiting for something to do. I would add these to the list of waits to filter out and re-run the wait stats query.
  • 018: PAGELATCH_XX
    • This is contention for access to in-memory copies of pages. The most well-known cases of these are the PFS, SGAM, and GAM contention that can occur in tempdb under certain workloads. To find out what page the contention is on, you'll need to use the DMV sys.dm_os_waiting_tasks to figure out what page the latch is for. For tempdb issues, my friend Robert Davis (blog|twitter) has a good post showing how to do this. Another common cause I've seen is an index hot-spot with concurrent inserts into an index with an identity value key.
  • 016: LATCH_XX
    • This is contention for some non-page structure inside SQL Server - so not related to IO or data at all. These can be hard to figure out and you're going to be using the DMV sys.dm_os_latch_stats. More on this in future posts.
  • 013: PREEMPTIVE_OS_PIPEOPS
    • This is SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means.
  • 013: THREADPOOL
    • This says that there aren't enough worker threads on the system to satisfy demand. You might consider raising the max worker threads setting.
  • 009: BROKER_TRANSMITTER
    • This is just Service Broker waiting around for new messages to send. I would add this to the list of waits to filter out and re-run the wait stats query. 
  • 006: SQLTRACE_WAIT_ENTRIES
    • Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.
  • 005: DBMIRROR_DBM_MUTEX
    •  This one is undocumented and is contention for the send buffer that database mirroring shares between all the mirroring sessions on a server. It could indicate that you've got too many mirroring sessions.
  • 005: RESOURCE_SEMAPHORE
    • This is queries waiting for execution memory (the memory used to process the query operators - like a sort). This could be memory pressure or a very high concurrent workload. 
  • 003: PREEMPTIVE_OS_AUTHENTICATIONOPS
  • 003: PREEMPTIVE_OS_GENERICOPS
    • These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means.
  • 003: SLEEP_BPOOL_FLUSH
    • This is normal to see and indicates that checkpoint is throttling itself to avoid overloading the IO subsystem. I would add this to the list of waits to filter out and re-run the wait stats query.
  • 002: MSQL_DQ
    • This is SQL Server waiting for a distributed query to finish. This could indicate a problem with the distributed query, or it could just be normal.
  • 002: RESOURCE_SEMAPHORE_QUERY_COMPILE
    • When there are too many concurrent query compilations going on, SQL Server will throttle them. I don't remember the threshold, but this can indicate excessive recompilation, or maybe single-use plans.
  • 001: DAC_INIT
    • I've never seen this one before and BOL says it's because the dedicated admin connection is initializing. I can't see how this is the most common wait on someone's system... 
  • 001: MSSEARCH
    • This is normal to see for full-text operations.  If this is the highest wait, it could mean your system is spending most of its time doing full-text queries. You might want to consider adding this to the filter list. 
  • 001: PREEMPTIVE_OS_FILEOPS
  • 001: PREEMPTIVE_OS_LIBRARYOPS
  • 001: PREEMPTIVE_OS_LOOKUPACCOUNTSID
  • 001: PREEMPTIVE_OS_QUERYREGISTRY
    • These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means. 
  • 001: SQLTRACE_LOCK
    • Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.

I hope you found this interesting! Let me know if there's anything in particular you're interested in seeing or just that you're following along and enjoying the ride!

posted by LifeisSimple