google-compute-engine,kubernetes,google-kubernetes-engine"/>
  • 13
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

[Question rewritten with details of findings.]

I am running a Google Container Engine cluster with about 100 containers which perform about 100,000 API calls a day. Some of the pods started getting 50% failure in DNS resolution. I dug into this and it only happens for pods on nodes that are running kube-dns. I also noticed that this only happens just before a node in the system gets shut down for being out-of-memory.

The background resque jobs are attaching to Google APIs and then uploading data to S3. When I see failed jobs, they fail with "Temporary failure in name resolution." This happens for "accounts.google.com" and "s3.amazonaws.com".

When I log into the server and try to connect to these (or other hosts) with host, nslookup, or dig it seems to work just fine. When I connect to the rails console and run the same code that's failing in the queues I can't get a failure to happen. Howerver, as I said these background failures seem to be intermittent (about 50% of the time for the workers running on nodes running kube-dns).

So far, my interim fix was to delete the pods that were failing, and let kubernetes reschedule them, and keep doing this until kubernetes scheduled them to a node not running kube-dns.

Incidentally, removing the failing node did not resolve this. It just caused kubernetes to move everything to other nodes and moved the problem.

      • 1
    • Failing other options I deleted the VM in the cluster. GKE re-created it but in doing so re-located the pods to one of the other running nodes which immediately began exhibiting the DNS resolution problem. So now I'm thinking it's a kubernetes problem. I noticed that kube-dns is running in two pods both on the same node and on the node that was exhibiting the problem.
    • I noticed in the logs that it appears to be trying to resolve local addresses for things that should not be local. From the kube-dns logs: Received DNS Request:accounts.google.com.default.svc.cluster.local., exact:false
      • 1
    • Also, we are having this problem again and it's only happening for containers that are on the nodes where kube-dns is running (in this case it's running two pods on two different nodes).
      • 2
    • I think that line from the kubedns logs may be a red herring. I think that's just that it tries to check the locals first before trying remotes. I was able to resolve this again, temporarily, by deleting the affected pods until k8s scheduled them to nodes that are not running kube-dns.

I solved this by upgrading to Kubernetes 1.4.

The 1.4 release included several fixes to keep kubernetes from crashing under out-of-memory conditions. I think this helped reduce the likelihood of hitting this issue, although I'm not convinced that the core issue was fixed (unless the issue was that one of the kube-dns instances was crashed or non-responsive due to kubernetes system being unstable when a node hit OOM).

  • 0
Reply Report
      • 2
    • @TrevorHartman agreed that this issue is still happening although we see it less than we used to. We have had a couple instances where the DNS inside the cluster just seems to fail for several minutes and then recover. But these have been very rare. We also added memory requests and limits to our containers to help the kubernetes scheduler work better and to ensure that they containers didn't crash kubernetes.

Warm tip !!!

This article is reproduced from Stack Exchange / Stack Overflow, please click

Trending Tags