This Is Your SQL Server on Machine Learning

This Is Your SQL Server on Machine Learning

delta bravo, sql server machine learning, database machine learning, ai

Applying Machine Learning models to database management turns the old paradigms upside down. Folks of a certain age remember the old “this is your brain on drugs” commercials from the 80s. For this post, we are going to borrow from this analogy to observe your SQL Server on Machine Learning.

What Is the Benefit of Applying Machine Learning Models to SQL Server?

Machine Learning enables you to:

  • Predict performance trends, capacity and potential security and/or compliance breaches
  • Correlate system spikes and/or anomalous behavior to specific events, actions and code
  • Model all possible fixes and identify the remediation that has the highest likelihood for success

The Power of Influence

It all starts with understanding what factors within the database itself influence each other. This varies with each use case and is influenced by business requirements, maintenance patterns and available system resources. Basically, databases are like people. Would you expect your doctor to prescribe the same medication for three random people just because they share the characteristic of being human?

Delta Bravo’s machine learning algorithms track the relationships between critical performance metrics for each SQL Server database. Here is a heatmap that shows, for this particular database, what metrics influence each other the most. High influence is reflected by a positive number and dark red tones, no influence is zero and gray tones. Negative influence is reflected by negative numbers and black tones.

delta bravo, machine learning, AI for the databaseTranslating Models into Action

For the sake of brevity (further detail is available in our whitepaper), we’re going to focus on the following Use case:

  • Identify a problematic system trend that has NOT reached a threshold*/been alerted on
  • Quantify the trend and verify that trend is going to continue into the future
  • Associate the trend with a specific event, measure impact of event
  • Identify root cause, quantify impact, identify specific action causing impact
  • Provide remediation recommendation

The work you are about to see was performed in 4 clicks (45 seconds) using the Delta Bravo UI. 

Let’s start with a quick view of the Delta Bravo System Health panel for SQL Server Instance DemoSQL-2.

We observe a problematic trend with this SQL Server Instance’s CPU. Is this trend temporary?  Seasonal? Let’s use Predictive Analytics to find out.

We see that the problematic trend is forecasted to continue, growing at a rate of nearly 90% over the next 14 days. However, our system thresholds* have not been hit yet. This means the system is acting in an anomalous fashion. Let’s identify the specific anomalies that are influencing this CPU trend.

delta bravo, predictive analytics for database, SQL Server

In the graphs above, the gray shadow is a machine learning algorithm that represents the “acceptable range” or baseline for system behavior associated with that metric. We see that, while no thresholds have been reached for these metrics, behavior is outside the scope of the baselined “norm.” Why?

By selecting one of the graphs, we’re able to zoom in for more detail. The Blue lines represent specific Events that influenced the rise in that metric.

delta bravo, SQL Server, machine learning

By selecting the line prior to the large red spike, we see that an Object was altered. This procedure impacted Query behavior adversely. We are able to see the code that was used to alter the Object, as well as the quantified impact this change had on Query performance.

delta bravo, machine learning, AI for the database

Using AI to Recommend and Implement a Fix

From here, the AI runs through a series of possible fixes and identifies which ones will have the highest likelihood of success and prioritizes their impact. In this case, the recommended fix is adding a series of Indexes.

delta bravo, database AI, SQL Server machine learning

Similar workflows are applied to Security, Capacity planning and other aspects of database management. We believe the use case is changing; its no longer about monitoring, daily care and feeding. Using Machine Learning and AI to manage large database deployments helps your best people scale where you need them most, and for your systems to run at peak efficiency and performance.

*Delta Bravo has the ability to set thresholds, but we feel this is a dated and reactive way to monitor/manage system behavior.

Delta Bravo Performance Counters:   SQL Server Target vs Total Memory

Delta Bravo Performance Counters: SQL Server Target vs Total Memory

Why is this an important SQL Server Performance Indicator?

Delta Bravo uses this counter to assess the degree of memory pressure the system is under.  High memory pressure is a cost driver, necessitating additional resources before user experience is impacted.

Target Server Memory (KB) is the amount of memory that SQL Server is willing (potential) to allocate to the buffer pool under its current load. Total Server Memory (KB) is what SQL currently has allocated.  The Total Server Memory is the current amount of memory currently assigned to SQL Server. Upon staring SQL Server its total memory will be low and it will grow throughout the warm-up period while SQL Server is bringing pages into its buffer pool and until it reaches a steady state. Once the steady state is reached, the Total Server Memory measurement should not decrease importantly as that would indicate that SQL Server is being forced to dynamically deallocate its memory due to system-level memory pressure.

delta bravo

If this counter is still growing the server has not yet reached its steady-state, and it is still trying to populate the cache and get pages loaded into memory.  Performance will likely be somewhat slower during this time since more disk I/O is required at this stage.  This behavior is normal.  Eventually Total Server Memory should approximate Target Server Memory, keeping a ratio close to 1.

If the Total Server Memory value is significantly lower than the Target Server Memory value during normal SQL Server operation, it can mean that there’s memory pressure on the server so SQL Server cannot get as much memory as needed, or that the Maximum server memory option is set too low.

So when do I need to add more memory?

If Total Server Memory is less than Target Server Memory it can be a sign of memory pressure, but before going to the business asking for more money for more memory, evaluate some other counters to validate SQL is in memory contention.

Start with Page Life Expectancy, which should be well above the 300.  This tells you how long pages are staying in the buffer pool, and a value of 300 equates to 5 minutes.  If you have 120GB of buffer pool and it is churning over 5 minutes, that equates to 409.6 MB/sec sustained disk I/O for the system which is a lot of disk activity to have to sustain.

Examine Lazy Writes/sec, which tells you that number of times the buffer pool flushed dirty pages to disk outside of the CHECKPOINT process.  This should be near zero.  Also review Free Pages/sec and Free List Stalls/sec.  You don’t want to see Free Pages bottom out which will result in a Free List Stall while the buffer pool has to free pages for usage. Lastly, look at Memory Grants Pending which will tell you if you have processes waiting on workspace memory to execute.

If these supporting counters exhibit excessive behavior, then it may be time to increase memory allocation.

Delta Bravo Performance Counters:   CPU % vs. Process Privileged Time (Total)

Delta Bravo Performance Counters: CPU % vs. Process Privileged Time (Total)

Why is this an important SQL Server Performance Indicator?

Delta Bravo uses this metric to determine whether the processor problems originate from internal Windows processes, or are caused by a user application. If Delta Bravo identifies high CPU usage on a SQL Server instance, the next step is to narrow down the high CPU problem to the lowest possible level–the component which is causing high CPU.

The CPU % vs. Process Privileged Time (Total) counter helps Delta Bravo understand the time and energy the system spends on Windows kernel commands (SQL Server I/O requests). If the CPU % vs. Process Privileged Time value is high, kernel mode processes are using a lot of processor time, the machine is busy executing basic operating system tasks and cannot run user processes and other applications, such as SQL Server. The recommended values for CPU@ vs. Process Privilege Time are 5 to 10%, or maximum 20% of the % Total Processor Time.

delta bravo

Do you know what’s keeping your processor busy?

There are two different states to be aware of when talking about processors executing instructions: Privileged mode and User mode.  Some operating system threads and interrupts (including all device driver functions) as well as Kernel-mode threads execute in privileged mode.

When dealing with Privileged mode operations, there are two modes to consider – Interrupt mode and Deferred Procedure Call (DPC) mode.  Interrupt mode is reserved for interrupt service routines which are device driver functions.  When looking at this in Performance Monitor, % Interrupt Time is the time the processor spends receiving and servicing hardware interrupts during sample intervals.  This value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards and other peripheral devices.  These devices normally interrupt the processor when they have completed a task or require attention.  Normal thread execution is suspended during interrupts.  Most system clocks interrupt the processor every 10 milliseconds, creating a background of interrupt activity.  DPC mode is time spent in routines known as deferred procedure calls – these are routines scheduled by device drivers to complete interrupt processing.  DPC’s are often referred to as soft interrupts.  From the Performance Monitor perspective, the % DPC Time counter shows the percentage of the time that the system was executing in DPC mode.  Measuring these counters separately can provide insight into whether there are issues with the interrupt service routine or its DPC.

Taking Action with Delta Bravo

If you noticed that the % Interrupt Time counter is much higher, you may have a problem with a device driver or piece of hardware.  Comparing the Interrupts / sec counter between the baseline and your current performance log, if the current rate is proportional to the level in the baseline, then the device driver code is the most likely culprit.  If the Interrupt rate is significantly higher, you are probably experiencing a hardware issue.