Subscribe

RSS Feed (xml)

Powered By

Skin Design:
Free Blogger Skins

Powered by Blogger


Friday, September 26, 2008

How SQL Cluster Resource DLLs work with Cluster Components to Improve the Failover Process

The objective of this article is to explain the SQL Server clustering internal process. This article will explain the functions used to interact with the cluster components. The article doesn’t really show how to cluster SQL Server. Most of this article talks about SQL Server 2000 but the same information applies to later versions of SQL Server. The article doesn’t talk about SQL Resources and their functionality in a clustered environment. The functionality of SQL Server resources are the same as in a standard environment (e.g. without clustering). The focus of this article is more on the Resource DLLs used by SQL Server in a clustered environment.

Windows Clustering and SQL Server
Microsoft introduced its first version of clustering software in Windows NT 4.0 Enterprise Edition. Microsoft has significantly improved the clustering software in Windows 2000, Windows Server 2003 and Windows Server 2008. There are two types of clustering technologies: Server Cluster (formerly known as MSCS) and NLB. MSCS or Server Cluster is basically used for High Availability and NLB is used to load balance the TCP/IP traffic. The MSCS or Server Cluster is also known as Failover Clustering.

SQL Server supports Failover Clustering. Clustering support for SQL Server was first introduced in Version 6.5 Enterprise Edition.

Windows Clustering includes many components such as Cluster Service, Resource Monitors, Node Manager, Membership Manager, Event Log Processor, Failover Manager, and Cluster Database Manager and so on. The whole purpose of Failover clustering is to provide high availability of application resources. Clustering doesn’t really need to decide/know how much CPU and Memory should be utilized by an application.

An application running in the clustering environment must be cluster-aware. A cluster-aware application supports the functions executed by the cluster service or its components as mentioned below in the Figure 1.1. There is no way for Cluster Service to know about the availability of resources of an application in the cluster unless the application is cluster-aware. For example, if a node holding the application resources fails, the Cluster Service running on the failed node must be notified in order to start the failover process for the application resources. Cluster Service does this by receiving the responses from Resource Monitor. The Resource Monitors monitor the application resources with the help of Resource DLLs of the application. The Resource DLLs for SQL Server are SQSRVRES.DLL and SQAGTRES.DLL.

The SQL Server 6.5 is not a fully cluster-aware application because it didn’t include its own resource DLL and it used generic resource DLL of Windows Clustering software. On other side, SQL Server 2000 and 2005 are fully cluster-aware applications. It supports/responds all the functions executed by cluster service. The main Resource DLLs of SQL Server, which supports all the functions, is SQSRVRES.DLL.

Every cluster-aware application uses its own resource DLL to monitor the status of its resources. If application is not cluster-aware then it can’t respond to the functions executed by the clustering software. As an example, SQSVRRES.DLL is the SQL Resource DLL which is used by the Resource Monitor to check the resources availability by performing two checks: IsAlive and LooksAlive. These two are the basic checks performed by the Resource Monitor for every cluster-aware application and are supported by the Resource DLL (SQSRVRES.DLL). A cluster-aware application should implement application specific functions in its Resource DLL. Clustering software doesn’t need to know about application-specific functions. Cluster Service just executes its functions and these functions are supported by the Resource DLLs. SQL Server implements many other functions in its Resource DLL. These functions are SQL application-specific and not related to cluster in anyway.



FIGURE 1.1–Cluster Components and SQL Resource DLLs.
In Figure 1.1 you can see the relationship between clustering components and Resource DLLs of the SQL Server. There are three Resource DLLs shown in above figure which controls the SQL Resources: SQSRVRES.DLL, SQAGTRES.DLL and GATHERCL.DLL.

The above DLLs are installed when the SQL Server setup realizes that is it going to operate in a clustered environment along with the following files:
VDVAPI32.DLL
VERNEL32.DLL
VSRVSVC.EXE
VTWDBLIB.DLL

SQSRVRES.DLL is the core Resource DLL which controls all the SQL Resources except SQL Server Agent Service resource and SQL Server Full-text search. The SQL Server Agent Service resource is controlled by the SQAGTRES.DLL Resource DLL. The SQL Server Full-text search resource is controlled by the GATHERCL.DLL. Other DLLs just help in failover process but not related to cluster. The SQSRVRES.DLL which sits between Resource Monitor and SQL Resources plays an important role in failover process. Without this the SQL can not function as a cluster-aware application. Also if this DLL is missing the whole purpose of SQL clustering also fails.

The above Figure also shows the functions defined in the SQSRVRES.DLL. The SQL-Specific functions are mapped with the cluster-specific functions. For example, Cluster’s IsAlive and LooksAlive functions are mapped with SQL’s IsAlive and LooksAlive respectively. However, there is no static mappings defined in the SQSRVRES.DLL but it knows which function to execute.

SQL’s IsAlive and LooksAlive functions, the basic functions, are executed by SQL Resource DLL at a predefined interval defined in the Resource DLL. LooksAlive is executed every 5 seconds and IsAlive is executed every 60 seconds. By default, the default interval for LooksAlive and IsAlive of GATHERCL.DLL is 5 seconds and 30 seconds respectively. Both the queries play an important role for monitoring of the SQL Resources. You can change the LooksAlive and IsAlive internal on the property of the Resource DLL as shown in below figure 1.2



FIGURE 1.2-Changing the default IsAlive and LooksAlive polling interval for SQL Resource DLLs



LooksAlive

As shown in above figure, LooksAlive call is executed from the SQAGTRES.DLL which in turn only checks the SQL Services such as SQL Server Agent. It does the following to check the SQL Agent Server resource:

a. Make a call to Service Control Manager (SCM) to check the registry entries of SQL Server Agent and also make sure that service is up and running. SCM uses its own internal mechanism to check the status of the server. The status returned by the SCM to SQAGTRES.DLL may be either FALSE or TRUE. SQAGTRES.DLL decides the next action on basis of the status returned by the SCM.
b. After SCM, the SQL Server Agent resource is checked in the cluster database with the help of Resource Monitor.

LooksAlive runs under the security context of Cluster Service Account to check the status of SQL Server Agent resource. LooksAlive call doesn’t guarantee that the SQL Virtual Server, system database and SQL instance are operational. This is taken care by the IsAlive call which is discussed in the next section.

IsAlive
The IsAlive is more detailed check against all the SQL resources. SQL’s IsAlive in SQSRVRES.DLL is implemented in such a way that it performs all the checks for SQL Resources. It checks to make sure all the:
  • SQL resources are online.
  • SQL resources are configured with correct dependencies.
  • Dependent SQL Resources are online.
  • The registry entries for SQL resources are configured correctly.
  • SQL Virtual Server Instance is operating normally.
  • SQL System Databases are functional by performing a SQL Transact query.

Note: The IsAlive call doesn’t perform any check against the user databases.

Please note that Resource Monitor executes IsAlive and LooksAlive queries against the whole Cluster Group. It is the responsibility of the Resource DLL (SQSRVRES.DLL) to execute its own IsAlive and LooksAlive against its resources.

When you setup SQL cluster for the first time, the Cluster Service running on the node takes a snapshot of the cluster configuration and saves in HKLM\Cluster key as shown in Figure 1.1. This Key contains the cluster configuration such as resources name, their GUID, node holding the resources and status. This is generally called cluster configuration database. As an example, for SQL it includes the following resources:

Resource Name GUID Node Name Status Flags
SQL Server Network Name {GUID1} Node1 Online 1
SQL Server IP Address {GUID2} Node1 Online 1
SQL Server {GUID3} Node1 Online 1
SQL Server Agent {GUID4} Node1 Online 1
SQL Server Full-Text {GUID5} Node1 Online 0
MSTDC {GUID6} Node1 Online 0

Before Resource Monitor executes any cluster function against the SQL Cluster Group, it decides to look at the cluster configuration database to check the status of all the resources and their GUIDs. For example, there is a cluster group by name “SQLVS”. All the SQL resources reside in this group. When IsAlive interval expires, the Resource Monitor executes the IsAlive call against the “SQLVS” Cluster Group. It passes the Resource GUID and the Status of SQL resources to SQL Resource DLL (SQSRVRES.DLL). SQSRVRES.DLL in turn executes IsAlive call to check the resources availability. Please note that SQSRVRES.DLL doesn’t really know about the status of SQL Resources. It is the Resource Monitor who supplies this information to SQSRVRES.DLL at the time of executing IsAlive queries.

Next functions are Open, Close, Online and Offline. These functions are called whenever the SQL Resources are moved or taken offline/online or when there is a need to call them. For example, you might want to take SQL Resources offline for maintenance purpose on a node. In that case, Resource Monitor executes Offline function and in turn SQSRVRES.DLL executes the Offline function to take the resources offline. We will discuss these functions later in this article with an example. As a whole, these functions are executed by the Cluster Service and supported by SQL Resource DLL. That’s why SQL 2000 and later versions are known as a pure cluster-aware application!

Resource Monitors determine the state of resources by checking the flag value in the registry. This value could be either 1 or 0. 1 is for Online and 0 is for Offline. If you stop an SQL Service on a cluster node from the Cluster Administrator, the value 0 is set for that service or resource in the registry. If you stop the service using command line or any other tool, the value is not set. It is left intact because this operation occurred out of the cluster operation. Any operation occurred out of the cluster doesn’t reflect any changes to the cluster configuration database. In this case, the IsAlive query may not function correctly because in next interval, IsAlive will not take any action on stopped service because the value supplied by the Resource Monitor indicates that the SQL Service or resource is already online. Please note these values are not maintained by the SQL Resource DLL rather maintained by the Resource Monitor and supplied at the time of performing IsAlive queries.


Status Messages and Resource Monitor
The status messages shown above are generated through IsAlive and LooksAlive calls. The messages generated by both the calls are same. In our example, we will discuss using IsAlive queries. When IsAlive interval expires, Resource Monitor executes IsAlive call against the whole cluster group and this, in turn, is handled by SQL Resource DLL to run the query against its resources except SQL Service and SQL Server Text-Search resource. The messages returned by these calls include the following:
  • Online/Offline
  • Online/Offline Pending
  • Failed
  • FALSE

The above status messages are passed back to Resource Monitor which in turn report to Cluster Service to take any action.

As shown in Figure 1.1, the Resource Monitor sits between the SQL Resource DLLs and Cluster Service. Any call made to SQL resources has to take place in the Resource DLL of the SQL Resource DLLs (either SQSRVRES.DLL or SQAGTRES.DLL). For example, if Cluster Service needs to check the availability of SQL resources, it will make a call to Resource Monitor; the Resource Monitor in turn will poll the SQL Resource DLLs to check the status of the SQL Resources. If Resource Monitor doesn’t receive any response from SQSRVRES.DLL or SQAGTRES.DLL or it can not detect the resources availability, it will pass the status back to Cluster Service. Cluster Service then passes this status message to related Managers as shown in the Figure 1.1. The related Managers could be Failover Manager. Managers take the action as per the status passed by lower layer components. The status message could indicate a failure of SQL resources or could indicate a simple status message. These messages and cluster actions are discussed later in this article with an example.

Moreover, if function executed by Resource Monitor doesn’t exist in the Resource DLL of the SQL Server, the request is simply discarded and no operation is carried out. In this case a event is logged in the event viewer. In fact, this happens only in case of the cluster-unaware applications.

How does SQL Resource DLLs help in failover process?
SQL Server doesn’t really utilize its own mechanism to failover the resources on the surviving node rather SQL Resource DLLs are written to “support” the failover process. The following figure shows a simple failover process:



FIGURE 1.3 – SQSRVRES.DLL and Status Messages in Cluster Failover Process

After IsAlive interval expires, Cluster Service asks the Resource Monitor to report the status of SQL resources, obviously after 60 seconds.

Resource Monitor checks the status of SQL Resources in Cluster configuration database (HKLM\Cluster). It provides SQSRVRES.DLL with the SQL Resource GUID and its current status.

SQSRVRES.DLL executes its IsAlive after it receives a signal from Resource Monitor to perform a check on the SQL Resources. It checks and reports back the status messages back to Resource Monitor. SQSRVRES.DLL reports the following status messages: Online/Offline, Online/Offline Pending, Failed, FALSE

After Resource Monitor receives the status, it, first, compares the status messages received from SQSRVRES.DLL with the stored one (Cluster configuration database). It takes the action as per the status messages reported by the SQSRVRES.DLL listed below:

a. If comparison is successful, no action is taken. For example, status message received in step 2 is “Online” and SQSRVRES.DLL also reports the same status.
b. If comparison is unsuccessful, the following actions are taken.

For example, if status message received in step 2 is “Online”, and SQSRVRES.DLL reports the status message: “Offline”. Resource Monitor executes an “Online” function. SQSRVRES.DLL receives this message from Resource Monitor and executes Online function to bring the SQL resource online. If resource is restarted successfully or it comes online, the end of this process is the status messages will reflect same in next interval.

Note: Actually speaking, Resource Monitor doesn’t take any action for Online/Offline status messages because an Administrator might have stopped the resource for maintenance purpose but the same should also reflect in Cluster configuration database before IsAlive queries. Resource Monitor takes action only when the comparison is not successful as stated above. Moreover, there shouldn’t be any inconsistencies in the Cluster configuration database. If there is any inconsistencies, that wouldn’t be longer than 60 seconds because IsAlive calls performed by Resource Monitor always updates the status in Cluster configuration database.

5. The mechanism isn’t really straight forward. There could be one more message returned by the SQSRVRES.DLL that is “Failed”. Before Resource DLL reports the “Failed” message back to Resource Monitor, it executes the Online function to bring the SQL resources online. When Online function fails, It retries five more times by performing a SQL Transact query; “SELECT @@SERVERNAME Transact SQL”. If all five retries fail, the resource is considered to be failed. The status is sent back to Resource Monitor.

(Please note SQSRVRES.DLL doesn’t really implement a separate Restart function rather it always uses its own implemented Online function to bring the “Failed” resources online). If resource doesn’t come online within the specified interval or after few attempts, the resource is considered to be failed as stated above.

6. After a resource has failed, the message is passed back to Resource Monitor. Cluster Service receives this message from Resource Monitor and starts the failover process with the help of Failover Manager.

7. If the resource is started successfully after few attempts, the failover process doesn’t occur. After next IsAlive interval these steps are repeated.

8. One more status message is “FALSE” that is returned by the IsAlive call to Resource Monitor when SQL Server is busy and can’t respond to the IsAlive requests. In this case, Resource DLL doesn’t retry five times and failover process starts.

If there is no Resource DLL for SQL Server, the failover process could take a longer time to move the resources from one node to surviving node. Because SQL Resource DLL is competent enough to handle the cluster functions executed by the Clustering Software, it doesn’t need to wait to decide which action to take. As stated above, the cluster-aware functions are mapped with SQL-specific functions, so it is easier for SQL Resource DLL to execute the functions as soon as they are executed from the Resource Monitor.

Conclusion
To summaries, SQL 6.5 was not a fully cluster-aware application. It made use of a generic DLL of the Windows Clustering. SQL 2000 and later versions are the cluster-aware application through the use of its own Resource DLL (SQSRVRES.DLL).

You also read that the Cluster Service doesn’t talk to SQL Resource DLL directly. In fact, it uses its Resource Monitors to talk to SQSRVRES.DLL. The status messages passed by the SQL Resource DLLs are received by the Resource Monitor and then appropriate actions are taken by the Resource Monitor.

Now we also know that the SQL Resource DLLs play an important role for its resources. The Resource DLLs make SQL Server a fully cluster-aware database application. The functions executed by the Resource Monitor on behalf of cluster service are supported by the SQL Resource DLL. This makes failover process faster.

You also read about the IsAlive and LooksAlive which play an important role in failover process. Without these calls, it’s not possible for SQL Resource DLL to report back the status of its resources.

To summarise:

  • There are different intervals for LooksAlive and IsAlive calls
  • LooksAlive is executed every 5 seconds
  • LooksAlive only check the SQL Services with the help of SQAGTRESS.DLL. It doesn’t perform any check against the instance of the SQL Virtual Server.
  • IsAlive is executed every 60 seconds.
  • IsAlive is executed against the system database and SQL Virtual Instance by running a SQL Transact query. The query include “SELECT @@SERVERNAME Transact SQL”
  • The failover process starts when IsAlive 5 retires attempts fail.
  • The IsAlive and LooksAlive calls run under the Cluster Service Account
  • Both the SQL Resource DLLs run under the security context of cluster service account.

No comments:

Post a Comment

Recent Posts

Archives