ScaleOut Watchdog Service for Microsoft Azure

The ScaleOut Watchdog Service is a Windows service that enables a ScaleOut in-memory data grid to detect and handle maintenance events when running on Microsoft Azure virtual machines. This helps maintain the stability of the in-memory data grid.

Motivation

Microsoft Azure virtual machines are frequently frozen, migrated, or stopped when Microsoft updates the underlying Azure infrastructure or adjusts the size of a Virtual Machine Scale Set. This infrastructure activity can disrupt the heartbeating of ScaleOut’s peer-to-peer membership protocols.

The ScaleOut Azure Watchdog Service polls the Azure Metadata Service to learn about upcoming maintenance events. If a new maintenance event is about to occur, the Watchdog Service will attempt to cleanly leave and stop the local ScaleOut service to prevent disruption of other hosts in the in-memory data grid.

In addition to maintenance events, the Watchdog Service exposes an endpoint that can be used by the Azure Application Health extension. This supports running rolling upgrades of the ScaleOut data grid when deployed as an Azure Scale Set.

More Information

Installing the Watchdog Service

Configure the local ScaleOut service to automatically join the store upon service startup (“Auto-Join”), otherwise the local ScaleOut host may be left in an inactive state after a maintenance event completes. Auto-Join can be enabled either through the ScaleOut Management Console’s “Host Configuration” tab or from the command line by running soss.exe set auto_join=1.
Run the soss_azure_watchdog.msi installer on each VM running the ScaleOut service. After installation, a new service called “ScaleOut Azure Watchdog” will be visible in the Windows Services control panel, running as a process named soss_azure_watchdog.exe.
If you are running the ScaleOut service in an Azure Scale Set, optionally configure your Scale Set to receive termination notifications. The Watchdog service will use these notifications to coordinate graceful shutdowns of the service when VM instances in your Scale Set are terminated.

Operation

The Watchdog Service is designed to run in the background of a virtual machine and operate without user intervention.

Maintenance Events

When the Watchdog Service detects an imminent Azure maintenance event, it will instruct the local ScaleOut service to gracefully leave the distributed data grid, and then the ScaleOut service will be stopped.

Note

If there is only one host in your ScaleOut data grid then the Watchdog Service will allow it to remain active to prevent data loss.

It is typical for the local ScaleOut service to remain stopped for about 15 minutes during a maintenance event. (In most cases, Microsoft Azure provides a notification 15 minutes before the start of maintenance, and most maintenance events complete within 5 to 30 seconds.)

Once the maintenance event is complete, the Watchdog Service will restart the local ScaleOut service. Assuming the ScaleOut service is configured to Auto-Join, it will rejoin the other active hosts in the ScaleOut data grid and resume normal operation after load balancing.

Tip

During normal day-to-day operation, do not allow inactive (unjoined) instances of the ScaleOut to sit idle in your ScaleOut host group. Unused hosts should have their Azure VMs shut down to prevent the Watchdog Service from inadvertently rejoining them after a maintenance event.

Azure Scale Set Termination Events

The Watchdog service can coordinate graceful shutdowns when running the ScaleOut service in Azure Scale Set instances. If your Scale Set reduces the number of instances due to a scale-in policy, the Watchdog Service will detect the event and coordinate the terminations so that the ScaleOut service will leave and stop service instances one at a time to minimize disruption and avoid data loss. The Watchdog service gives each terminated instance 2 minutes to leave and shut down before allowing Azure to terminate the next host in the Scale Set.

Important

Your Azure Scale Set must be configured to receive termination notifications for the Watchdog service to coordinate graceful terminations. See Terminate notification for Azure Virtual Machine Scale Set instances for details on how to opt-in to receive these notifications.

Azure Scale Set Rolling Upgrades

The Watchdog Service exposes an HTTP endpoint that can be used by the Azure Application Health extension to support rolling upgrades of the ScaleOut data grid when deployed as an Azure Scale Set. The extension can be configured to use the Watchdog Service’s health check endpoint to determine when a VM instance is ready to perform the upgrade operation.

Configuration

The service can be configured by editing the soss_azure_watchdog.exe.config file, located in the installation directory. The following appSettings may be edited:

<appSettings>
  <add key="pollingIntervalSecs" value="1" />
  <add key="mgtRestPort" value="4000" />
  <add key="minNotificationTimeForLeave" value="00:04:00" />
  <add key="healthCheckPort" value="9910" />
</appSettings>

pollingIntervalSecs: Controls how frequently the Watchdog Service queries the Azure Metadata Service for information about maintenance events. Microsoft recommends polling every 1 second, as emergency maintenance may only provide several seconds of warning.
mgtRestPort: The port used by the local ScaleOut Management REST service, as configured in C:\Program Files\ScaleOut_Software\StateServer\soss_mgt_rest\soss_mgt_rest.json. The default is 4000.
minNotificationTimeForLeave: The minimum amount of advance notice that the Watchdog Service needs to perform a clean leave operation. If the notification time is less than this, the watchdog will just stop the local ScaleOut service instead of doing a leave first. The default is four minutes.
healthCheckPort: Port exposed at HTTP endpoint for use by the Azure Application Health extension.

Logging

Log information is available in two locations.

Windows Event Log: Startup events and errors are written to the Windows Application Event Log, where the event source is soss_azure_watchdog.
Trace Files: Detailed tracing is written to the %PROGRAMDATA%\ScaleOut Software\soss_azure_watchdog directory. Log files in this directory roll over every day, and they are retained for the prior 10 days. All activity performed by the Watchdog Service is recorded in these log files.