Implementing Graceful Shutdown in Windows Container

Kubernetes Linux Pod中,当通过kubectl删除一个Pod或rolling update一个Pod时, 每Terminating的Pod中的每个Container中PID为1的进程会收到SIGTERM信号, 通知进程进行资源回收并准备退出. 如果在Pod spec.terminationGracePeriodSeconds指定的时间周期内进程没有退出, 则Kubernetes接着会发出SIGKILL信号KILL这个进程。

通过 kubectl delete –force –grace-period=0 … 的效果等同于直接发SIGKILL信号.

但SIGTERM和SIGKILL方式在Windows Container中并不工作, 目前Windows Container的表现是接收到Terminating指令5秒后直接终止。。。

参见:https://v1-18.docs.kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#v1-pod

  • V1.Pod.terminationGracePeriodSeconds – this is not fully implemented in Docker on Windows, see: reference. The behavior today is that the ENTRYPOINT process is sent CTRL_SHUTDOWN_EVENT, then Windows waits 5 seconds by default, and finally shuts down all processes using the normal Windows shutdown behavior. The 5 second default is actually in the Windows registry inside the container, so it can be overridden when the container is built.

基于社区的讨论结果及多次尝试, 目前Windows容器中行之有效的Graceful Shutdown方法是:

1. Build docker image时通过修改注册表延长等待时间

...
RUN reg add hklm\system\currentcontrolset\services\cexecsvc /v ProcessShutdownTimeoutSeconds /t REG_DWORD /d 300 && \
    reg add hklm\system\currentcontrolset\control /v WaitToKillServiceTimeout /t REG_SZ /d 300000 /f
...

上面两个注册表位置, 第1个单位为秒, 第2个为毫秒

2. 在应用程序中注册kernel32.dll中的SetConsoleCtrlHandler函数捕获CTRL_SHUTDOWN_EVENT事件, 进行资源回收

以一个.net framework 的Console App为例说明用法:

using System;
using System.Runtime.InteropServices;
using System.Threading;

namespace Q1.Foundation.SocketServer
{
    class Program
    {
        internal delegate bool HandlerRoutine(CtrlType CtrlType);
        private static HandlerRoutine ctrlTypeHandlerRoutine = new HandlerRoutine(ConsoleCtrlHandler);

        private static bool cancelled = false;
        private static bool cleanupCompleted = false;

        internal enum CtrlType
        {
            CTRL_C_EVENT = 0,
            CTRL_BREAK_EVENT = 1,
            CTRL_CLOSE_EVENT = 2,
            CTRL_LOGOFF_EVENT = 5,
            CTRL_SHUTDOWN_EVENT = 6
        }

        [DllImport("Kernel32")]
        internal static extern bool SetConsoleCtrlHandler(HandlerRoutine handler, bool add);

        static void Main()
        {
            var result = SetConsoleCtrlHandler(handlerRoutine, true);

            // INITIAL AND START APP HERE

            while (true)
            {
                if (cancelled) break;
            }

            // DO CLEANUP HERE
            ...
            cleanupCompleted = true;
        }

        private static bool ConsoleCtrlHandler(CtrlType type)
        {
            cancelled = true;

            while (!cleanupCompleted)
            {
                // Watting for clean-up to be completed...
            }

            return true;
        }
    }
}

代码解释:

  • 引入Kernel32并声明extern函数SetConsoleCtrlHandler
  • 创建static的HandlerRoutine.
  • 调用SetConsoleCtrlHandler注册处理函数进行事件捕获
  • 捕获后在HandlerRoutine应用程序中进行资源清理
  • 清理完成后在HandlerRoutine中返回true允许应用程序退出

上述两个步骤即完成了Graceful Shutdown.

 

需要注意的点是:

1. 传统.net Console App中的事件捕获( 比如: Console.CancelKeyPressSystemEvents.SessionEnding )在容器中都不会生效,AppDomain.CurrentDomain.ProcessExit的触发时间又太晚, 只有SetConsoleCtrlHandler可行. 更多的尝试代码请参见: https://github.com/moby/moby/issues/25982#issuecomment-250490552

2. 要防止程序退出前HandlerRoutine实例被回收, 所以上面示例中使用了static的HandlerRoutine. 这点很重要, 如果HandlerRoutine在应用程序未结束的时候被回收掉, 就会引发错误, 看如下代码:

static void Main()
{
    // Initialize here

    ...
    using
    {
        var sysEventHandler = new HandlerRoutine(type =>
        {
            cancelled = true;

            while (!cleanCompleted)
            {
                // Watting for clean-up to be completed...
            }

            return true;
        });
		
        var sysEventSetResult = SetConsoleCtrlHandler(sysEventHandler, true);
        ...
    }
    ...

    // Cleanup here
}

在应用程序退出前, HandlerRoutine实例已经被回收掉了,在CTRL_SHUTDOWN_EVENT 被触发时就会引发NullReferenceException, 具体错误信息如下:

Managed Debugging Assistant 'CallbackOnCollectedDelegate':
A callback was made on a garbage collected delegate of type 'Program+HandlerRoutine::Invoke'. This may cause application crashes, corruption and data loss. When passing delegates to unmanaged code, they must be kept alive by the managed application until it is guaranteed that they will never be called.

类似场景: CallbackOnCollectedDelegate was detected

 

关于SetConsoleCtrlHandler的使用参考:

SetConsoleCtrlHandler function

HandlerRoutine callback function

 

最后, 如果要处理的应用程序类型不是Console App, 而是图形化的界面应用,则要处理的消息应该是WM_QUERYENDSESSION, 参见文档:

https://docs.microsoft.com/en-us/windows/console/setconsolectrlhandler#remarks

WM_QUERYENDSESSION message