Monitoring for Suspect Pages

Recently I have spoken with a number of people about the key things that they monitor when it comes to SQL Server, with the key metrics CPU, Memory, and I/O mentioned. Some others set up alerts for specific errors or severity of errors, however one thing that comes up very rarely is monitoring for Suspect Pages in the dbo.suspect_pages table located in the MSDB database. It is my view that having a mechanism in place to monitor and alert on this table is something that every DBA should be doing as part of their standard monitoring and server health management.

Database Corruption

Avoiding and preventing database corruption is something that we all want, however the simple answer is that you cannot prevent corruption. As a result, we need to look at identifying potential corruption as early as possible, and then recover from it with as little impact as we can. The key to this is taking a defense-in-depth approach. There are a number of key tasks that we ought to be doing to help catch corruption, and these are:

Ensure that user databases are using CHECKSUM page verification
Perform DBCC CHECKDB on a regular basis as often as is feasible
Use the CHECKSUM option when taking backups
Monitor and alerts for 823, 824, & 825 errors
Monitor the content of dbo.suspect_pages in MSDB

It is the last item in this list that we will be looking at in this post.

What is dbo.suspect_pages?

dbo.suspect_pages is a table that resides in the MSDB database and is where SQL Server logs information about corrupt database pages (limited to 1,000 rows) that it encounters, not just when DBCC CHECKB is run but during normal querying of the database. So if you have a DML operation that accesses a corrupt page, it will be logged here, this means that you have a chance of identifying a corruption in your database outside of the normal DBCC CHECKDB routine.

A full explanation of dbo.suspect_pages can be found here, and even more details are available here.

Monitoring For Corruption

When monitoring this table it is important to understand that there are a number of different status values that will be set based on the type of corruption detected and whether it has been fixed or not. As such a simple check for more than one record returned from the table is not going to be sufficient.

Status ID	Status Description
1	823 or 824 Error that was not a bad Checksum or Torn Page
2	Bad Checksum
3	Torn Page
4	Restored
5	Repaired
7	Deallocated by DBCC

First and foremost we are interested in the first four status values, these indicate that the corruption is present and has not yet been resolved and as such needs urgent attention. We want to identify these pages and which database they are in, this is easy enough to do when we join out to sys.databases and sys.master_files, as seen here:

SELECT d.name AS databaseName,
  mf.name AS logicalFileName,
  mf.physical_name AS physicalFileName,
  sp.page_id,
  case sp.event_type
    when 1 then N'823 or 824 error'
    when 2 then N'Bad Checksum'
    when 3 then N'Torn Page'
    when 4 then N'Restored'
    when 5 then N'Repaired'
    when 7 then N'Deallocated'
  end AS eventType,
  sp.error_count,
  sp.last_update_date
from msdb.dbo.suspect_pages as sp
join sys.databases as d ON sp.database_id = d.database_id
join sys.master_files as mf on sp.[file_id] = mf.[file_id]
  and d.database_id = mf.database_id;

The result of this query will give you a high level view of where you have potential corruption in your databases, from here it is important to use tools such as DBCC CHECKDB and your backups to recover from in line with your RPO and RTO.

Monitoring Tasks – Suspect Pages

When it comes to monitoring the dbo.suspect_pages table, we are most interested in identifying the number of suspect pages that have been encountered and logged; when this value is greater than 0, then we want to have some form of notification. Here we have a couple of options; if you have a monitoring tool in place such as SQL Sentry Performance Advisor, then writing a custom condition to perform this check is probably your best option. If not, then I would use a scheduled job to query the table, and then notify from there via database mail or failing the job; it is my personal preference to have the job step to send an e-mail if I am not using more standardized monitoring and alerting.

Monitoring Tasks – Fixed Pages

In addition to tracking the presence of suspect pages, it is also very important to track where you have records that indicate suspect pages have been fixed in one form or another, especially so if you are using Availability Groups or Database Mirroring in Enterprise Edition. The reason for this is that both of these features have a capability known as Automatic Page Repair, which allows the suspect page to be repaired by retrieving a good (not-yet-corrupted) copy of the page from the mirror/replica. As such, monitoring for when you have fixed records is just as important, as you could have a potential corruption issue that you would not otherwise know about.

If you have SQL Sentry software, a set of Custom Conditions can be downloaded here. There are three conditions:

one to monitor current suspect pages;
one to watch for an increase in fixed pages; and,
one to alert when the table gets close to 1,000 rows.

(In a future post, I’ll go into more detail about these conditions.)

If you don’t have SQL Sentry, a code sample for a SQL Agent job step can be found here. This code uses Database Mail to send an e-mail with a short summary of the number of suspect and fixed pages in the subject line and the results of the query as the e-mail body.

Managing dbo.suspect_pages

As I mentioned earlier in this post, the dbo.suspect_pages table only has a capacity of 1,000 rows; as a result, it is important to have a maintenance job in place to clear out entries that have been dealt with. Microsoft has an article on MSDN here that gives some basic insight into this process, however the general recommendation that is made around simply deleting rows where the event_type value is set to 4 / 5 / 7 does not really take into account the fact that Automatic Page Repair could be at work in the background. So I would recommend that you not automate this clean-up but rather make it part of your recovery process when dealing with suspect pages or fixed pages that have been identified as part of your monitoring.

Summary

Hopefully now you have a little bit more insight into another tool that you can use in the battle against database corruption; this little table has the potential to give you insight into possible corruption issues in your databases sooner rather than later if you are encountering issues during normal processing. Unfortunately this will not highlight suspect pages that are not being regularly accessed; but, for the minimal effort of putting some monitoring in place, it has the potential to add a lot of benefit.