SQL Server Troublshooting

by Steve Gray
steve@4penny.net
VP Software Development
4Penny.net
Steve is a MCDBA specializing in both SQL Server administration and development. He has published
numerous internet articles and works full time in the field.
At times, SQL Server can seem like a frustrating black box. Your application performance
is in the pits, the network engineers
are blaming the programmers, the programmers are blaming the network engineers... you
get the picture.
Here's a primer on how to start to look at your SQL Server and determine where the bottleneck is.
In this article, we'll look at five trouble spots: CPU, memory, disk, network, and code. We'll
use the Windows Performance Monitor to look inside the box. Performance Monitor is the tool
of choice here, find it under Start > Administrative Tools > Performance.
I'm asked occasionally
about tools that monitor SQL Server, but I don't use most of them. The good ones come with a
hefty price tag, and it would be a stretch to get all our clients to spring for a license
of whatever tool we chose. I suppose they're fine for in-house DBAs, but that's not our
milieu.
CPU
Open Performance Monitor and add Processor: % Processor Time. This counter is in the default that
is already there when you open Perf Mon.

An overloaded processor has a distinctive and unmistakable performance monitor profile.
The % Processor Time trace looks like a curtain hanging down from an imaginary ceiling.
Text books quote thresholds of between 70 - 85 percent for % Processor Time, but practical
experience tells me that users will start to complain if the processor is constantly past
30-40%. The key point is that the counter is continuously high. It is normal for the trace to
show a sharp increase when any program executes; you can safely ignore spikes. Here is a good
reference for troublshooting high CPU numbers:
http://www.sql-server-performance.com/performance_monitor_counters.asp
Memory
- Memory: Available Bytes
- Memory: Page Reads/sec
- Process: Working Set:sqlserver
- SQL Server: Buffer Manager: Buffer Cache Hit Ratio
- SQL Server: Memory Manager: Total Server Memory (KB)
The
Memory: Available Bytes Shows the amount of physical memory, in bytes,
immediately available for allocation to a process or for system use. This is memory that is
not currently in use by the system.
The Memory: Page Reads/sec Shows the rate, in incidents per second, at which the
disk was read to resolve hard page faults. This counter shows numbers of read operations, without
regard to the number of pages retrieved in each operation. Hard page faults occur when a process
references a page in virtual memory that must be retrieved from disk because it is not in its
working set or elsewhere in physical memory. This counter is a primary indicator for the kinds
of faults that cause system-wide delays.
A low number for Available Bytes indicates that there may not be enough memory available;
or processes, including SQL Server, may not be releasing memory. A high number of Pages Faults/sec
indicate excessive paging. Taking a more in-depth look at individual instances of Process:Page
Faults/sec, to see if the SQL Server process, for example, has excessive paging, may be
necessary. A low rate of Pages Faults/sec (commonly 5-10 per second) is normal, as the
operating system will continue to do some house keeping on the working set.
Optimally, we'd like for SQL Server to be the only application on the server, and using most of
the memory on the box. Adding memory to a server and configuring SQL Server to use it is
beyond the scope of this article, but the more, the better.
The Process: Working Set:sqlserver instance shows the amount of memory that SQL Server
is using, in bytes. If the number is consistently lower than the amount SQL Server is configured
to use by the MIN SERVER MEMORY and MAX SERVER MEMORY options, then SQL Server is
configured for too much memory. Otherwise, you may need to increase RAM and MAX SERVER MEMORY.
Buffer Cache Hit Ratio should be consistently greater than 90. This indicates that the
data cache supplied 90 per cent of the requests for data. If this value is consistently
low, it is a very good indicator that more memory is needed by SQL Server. If Available
Bytes is low, this means that we need to add more RAM.
If Total Server Memory for SQL Server is consistently higher than the overall server memory,
it indicates that there is not enough RAM. This counter is in KB, as opposed to bytes for Process: Working Set:sqlserver
Disk
Begin disk performance monitoring by looking at the following counters:
- PhysicalDisk: Percent Disk Time
- PhysicalDisk: Current Disk Queue Length
- PhysicalDisk: Avg. Disk Queue Length
Applications and systems that are I/O-bound may keep the disk constantly active.
This is called disk thrashing.
The PhysicalDisk: Percent Disk Time counter monitors the percentage of time that the
disk is working. Check the PhysicalDisk: Current Disk Queue Length counter to see the
number of requests that are queued up waiting for disk access.
It is important at this point to be familiar with your disk subsystem. If the number
of waiting I/O requests has a sustained value more than 1.5 to 2 times the number of
spindles making up the physical disk, you have a disk bottleneck. For example, a RAID 5
configuration with seven spindles/disks would be a candidate for disk performance tuning
should the Current Disk Queue Length continually rest above 12-14.
To improve performance in this situation, consider adding faster disk drives, moving
some processes to an additional controller-disk subsystem, or adding additional disks to a RAID 5 array.
Excessive disk I/O might also be a symptom of a fragmented SQL Server database.
This article from SQL Server Magazine covers that, but it
requires a login (subscription). If you are reading this article and you don't have one,
you need it anyway.
Network
Here, we'll look at
Network Interface:Bytes Total/sec
This refers to the number of bytes per second that were sent and received by the
Network Interface. In essence, this counter measures how
busy your network interface card is, and it should be tracked over time. Analyzing
this counter would enable you to determine that you need to add extra items such as
another NIC card. Note that this value is selected by instance (with the instance
being your adapters). Note also that this counter assumes you are using TCP/IP with
the SNMP service installed (the SNMP adds several additional TCP/IP-related counters
to Performance Monitor).
Code
Now that we've got a good look at our server, it's time to analyze the code. While
the exact procedures needed to optimize queries for SQL Server are beyond the
scope of this article, we do need to be able to identify the queries that are taking
a long time to complete. To do this, open SQL Profiler and add the TSQL:SQL:StmtCompleted
event. Then go to the Filters tab and add under Duration add a value to 'Greater than or
equal'. The value is in milliseconds, so if you wanted 5 seconds you'd put 5000.
I generally start with 10 seconds or so and let it run for a few hours. Identify the longest
running queries and attack them. See what can be done to make them run faster and more efficiently.
Maybe adding an index or two would help. Also at issue are queries that run in a second, but are
called thousands of times. There is no good way to count them except by running a complete trace
and reading through it.
If you have a SQL Server and it is running slowly, the solution is probably above. I'd love to
hear your feedback, comments, and suggestions regarding the above, or any SQL topic. Reach me at
steve@4penny.net.