google-compute-engine,google-kubernetes-engine,oom"/>
  • 4
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

I have issue with Container-Optimized OS on GKE. If I run this simple command https://pastebin.com/raw/0WPAnAzn to consume all the RAM, at some point host freezes and doesn't respond to anything. Expected behaviour: the process should be killed by OOM killer. I've tried this on stock Ubuntu and CentOS images and they work perfect: process gets killed without a freeze.

There are three possible kmsg outputs in Serial console:

  1. In some cases log doesn't contain anything related to freeze
  2. Sometimes there is series of OOM kills of other processes followed by freeze without any related message
  3. And the most interesting: OOM kills followed by kernel panic (https://pastebin.com/raw/gtdsg6vQ)

Freezes are accompanied by near 100% CPU load.

So this is expected behaviour or there is something wrong?

      • 2
    • I was able to reproduce the behavior using the code you provided. In my case I only got about 30% of CPU load but the node certainly crashed. There could be a significant difference in the way the OOM killer acts in Chrome OS . I will try to dig in to it.
      • 2
    • I tested using a standalone VM with the COS image and it worked as any other Linux distribution. It seems the issue is strictly related on how Kuberentes manages OOM. I believe containers inside the node are getting killed but not the main process that is taking out the memory.

After some experiments I've found that it's not GKE or GCP related. And it's not even related to COS image.

Actually it's how Linux kernel handles OOM. OOM killer starts too late and is acting in highly memory limited environment. It decides which process to kill using processes' oom_score.

When running Kubernetes on a host, there are many processes with high oom_score_adjust value (these are pods without memory limits in their spec). If your RAM eater pod has limits set, its resulting oom_score will probably be lower than many other processes.

In this case OOM killer will first kill those many processes with highest oom_score before it will chance to kill really greedy process. I don't know why, but in these situation Linux totally freezes.

As a workaround I found this tool. Installing it as DaemonSet solves the problem. It kills greedy processes without mercy.

  • 2
Reply Report

The behavior is expected and is not being caused by the COS image per se. Instead it is related on how Kubernetes handles Node OOM. In this case the script is being run in the node and not in a POD. Containers are being killed but not the main process starving the memory.

There are some proposals and implementations to help reserve resources for the node OS daemons.

Beginning on nodes which hold version 1.7.6, Google Container Engine will reserve a portion of each node's compute resources for system overhead using the Kubernetes Node Allocatable feature. This will increase reliability of system components and is not an increase in system overhead. This change explicitly reserves compute resources that system components may already consume.

  • 0
Reply Report

Warm tip !!!

This article is reproduced from Stack Exchange / Stack Overflow, please click

Trending Tags