ScaleOut Watchdog Service for Microsoft Azure

The ScaleOut Watchdog Service is a Windows service that enables a ScaleOut in-memory data grid to detect and handle maintenance events when running on Microsoft Azure virtual machines. This helps maintain the stability of the in-memory data grid.

Motivation

Microsoft Azure virtual machines are frequently frozen or migrated when Microsoft updates the underlying Azure infrastructure. A freeze may last up to 30 seconds, which can disrupt the heartbeating of ScaleOut’s peer-to-peer membership protocols.

The ScaleOut Azure Watchdog Service polls the Azure Metadata Service to learn about upcoming maintenance events. If a new maintenance event is about to occur, the Watchdog Service will attempt to cleanly leave and stop the local ScaleOut service to prevent disruption of other hosts in the in-memory data grid.

Installing the Watchdog Service

  1. Configure the local ScaleOut service to automatically join the store upon service startup (“Auto-Join”), otherwise the local ScaleOut host may be left in an inactive state after a maintenance event completes. Auto-Join can be enabled either through the ScaleOut Management Console’s “Host Configuration” tab or from the command line by running soss.exe set auto_join=1.

  2. Run the soss_azure_watchdog.msi installer on each VM running the ScaleOut service. After installation, a new service called “ScaleOut Azure Watchdog” will be visible in the Windows Services control panel, running as a process named soss_azure_watchdog.exe.

Operation

The Watchdog Service is designed to run in the background of a virtual machine and operate without user intervention.

When the Watchdog Service detects an imminent Azure maintenance event, it will instruct the local ScaleOut service to gracefully leave the distributed data grid, and then the ScaleOut service will be stopped.

Note

If there is only one host in your ScaleOut data grid then the Watchdog Service will allow it to remain active to prevent data loss.

It is typical for the local ScaleOut service to remain stopped for about 15 minutes during a maintenance event. (In most cases, Microsoft Azure provides a notification 15 minutes before the start of maintenance, and most maintenance events complete within 5 to 30 seconds.)

Once the maintenance event is complete, the Watchdog Service will restart the local ScaleOut service. Assuming the ScaleOut service is configured to Auto-Join, it will rejoin the other active hosts in the ScaleOut data grid and resume normal operation after load balancing.

Tip

During normal day-to-day operation, do not allow inactive (unjoined) instances of the ScaleOut to sit idle in your ScaleOut host group. Unused hosts should have their Azure VMs shut down to prevent the Watchdog Service from inadvertently rejoining them after a maintenance event.

Configuration

The service can be configured by editing the soss_azure_watchdog.exe.config file, located in the installation directory. The following appSettings may be edited:

<appSettings>
  <add key="pollingIntervalSecs" value="1" />
  <add key="mgtRestPort" value="4000" />
  <add key="serverPort" value="721" />
</appSettings>
pollingIntervalSecs

Controls how frequently the Watchdog Service queries the Azure Metadata Service for information about maintenance events. Microsoft recommends polling every 1 second, as emergency maintenance may only provide several seconds of warning.

mgtRestPort

The port used by the local ScaleOut Management REST service, as configured in C:\Program Files\ScaleOut_Software\StateServer\soss_mgt_rest\soss_mgt_rest.json. The default is 4000.

serverPort

The server port for local ScaleOut StateServer service. The default is 721.

Logging

Log information is available in two locations.

  • Windows Event Log: Startup events and errors are written to the Windows Application Event Log, where the event source is soss_azure_watchdog.

  • Trace Files: Detailed tracing is written to the %PROGRAMDATA%\ScaleOut Software\soss_azure_watchdog directory. Log files in this directory roll over every day, and they are retained for the prior 10 days. All activity performed by the Watchdog Service is recorded in these log files.