Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP #oom #kubernetes #beijing


bw372p@...
 

After a power outage about 75% of the pods come back. And for the most part the functionality is not working. Seeing a bunch of errors for pods that look like this: container "portal-db-job" in pod "onap-portal-db-config-n6lrn" is waiting to start: PodInitializing


Syed Atif Husain
 

I have faced the same issue. But haven’t found a solution so far except for reinstalling ONAP.

 

Regards,

Atif

 

From: onap-discuss@... <onap-discuss@...> On Behalf Of bw372p@...
Sent: Tuesday, November 27, 2018 12:10 AM
To: onap-discuss@...
Subject: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

After a power outage about 75% of the pods come back. And for the most part the functionality is not working. Seeing a bunch of errors for pods that look like this: container "portal-db-job" in pod "onap-portal-db-config-n6lrn" is waiting to start: PodInitializing


Michael O'Brien <frank.obrien@...>
 

Hi,

   This is a kubernetes undercloud issue – not specific to onap

   I regularly restart my VMware VM’s – especially when duplicating VMs across laptops – in that case there are 3 scenarios – I would like to cover off it you are experiencing #3 – IP change – that will render your cluster unusable until the .kube/config IP is modified to match – if this is the issue?

 

  1. OK: I shutdown and restart the original VM to change the # of vCores – this VM starts up within a couple min with all pods up – sometimes with ONAP pods running – sometimes without – all 1/1, 2/2, 3/3 up
  2. IP: I do above but move or copy the VM – this renders the IP for the VM different – which requires a change in ~/.kube/config – these startup fine as well
  3. Upgrade: when I upgrade the system from say k8s 1.10 to 1.11 – in this case everything goes and I do a docker stop/rm on the rancher server/client and reinstall – no need in this case for a full clean - https://wiki.onap.org/display/DW/ONAP+Development#ONAPDevelopment-RemoveaDeployment
  4. OK: on public cloud I register an elastic IP and create a route53 dns record – if my spot VM restarts with a different IP – no problem I have re-attached the persistent EIP that rancher was originally registered to.

 

Could you post details of which pods are having issues – particularly if the kubernetes pods are up first – if your cluster IP changes – it will be more evident.

 

Thank you

/michael

From: onap-discuss@... <onap-discuss@...> On Behalf Of Syed Atif Husain
Sent: Tuesday, November 27, 2018 12:03 AM
To: onap-discuss@...; bw372p@...
Subject: Re: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

I have faced the same issue. But haven’t found a solution so far except for reinstalling ONAP.

 

Regards,

Atif

 

From: onap-discuss@... <onap-discuss@...> On Behalf Of bw372p@...
Sent: Tuesday, November 27, 2018 12:10 AM
To: onap-discuss@...
Subject: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

After a power outage about 75% of the pods come back. And for the most part the functionality is not working. Seeing a bunch of errors for pods that look like this: container "portal-db-job" in pod "onap-portal-db-config-n6lrn" is waiting to start: PodInitializing

This email and the information contained herein is proprietary and confidential and subject to the Amdocs Email Terms of Service, which you may review at https://www.amdocs.com/about/email-terms-of-service


Mike Elliott
 

Jobs are used in ONAP primarily to perform one-time database initialization. A limitation in using a Job is that it runs to completion and will not be restarted. If there are Pods that depend on the completion of a Job those Pods will become stuck, indefinitely. As was observed when infrastructure failures occur. Unfortunately, the only course of action is to reinstall the Helm Charts that apply the above-mentioned Job/Dependency pattern.

 

Some of this behavior may have been corrected in Casablanca, but I suspect there still may be ONAP components that suffer from this. I encourage you to raise defects against the ONAP components that failed to restart after the power outage. This will help drive the need for better resiliency testing in future releases. A production-grade platform is a priority for the OOM team. In the Dublin release our team will use feedback like this to further push this agenda.  

 

Thanks,

Mike

 

--

Mike Elliott

ONAP OOM PTL

Amdocs Senior Architect

 

 

From: <onap-discuss@...> on behalf of Syed Atif Husain <Syed_ah@...>
Reply-To: "onap-discuss@..." <onap-discuss@...>, "Syed_ah@..." <Syed_ah@...>
Date: Tuesday, November 27, 2018 at 12:03 AM
To: "onap-discuss@..." <onap-discuss@...>, "bw372p@..." <bw372p@...>
Subject: Re: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

I have faced the same issue. But haven’t found a solution so far except for reinstalling ONAP.

 

Regards,

Atif

 

From: onap-discuss@... <onap-discuss@...> On Behalf Of bw372p@...
Sent: Tuesday, November 27, 2018 12:10 AM
To: onap-discuss@...
Subject: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

After a power outage about 75% of the pods come back. And for the most part the functionality is not working. Seeing a bunch of errors for pods that look like this: container "portal-db-job" in pod "onap-portal-db-config-n6lrn" is waiting to start: PodInitializing

This email and the information contained herein is proprietary and confidential and subject to the Amdocs Email Terms of Service, which you may review at https://www.amdocs.com/about/email-terms-of-service


Victor Morales <victor.morales@...>
 

Hey there,

 

Maybe this can help with this topic, I found that Kured[1] helps to drain the workers before they’re restarted. Its installation is thru a Kubernetes objects and it supports several K8s versions.

 

Regards,

Victor Morales

 

[1] https://github.com/weaveworks/kured

 

From: <onap-discuss@...> on behalf of Mike Elliott <mike.elliott@...>
Reply-To: "onap-discuss@..." <onap-discuss@...>, "mike.elliott@..." <mike.elliott@...>
Date: Tuesday, November 27, 2018 at 7:13 AM
To: "onap-discuss@..." <onap-discuss@...>, "Syed_ah@..." <Syed_ah@...>, "bw372p@..." <bw372p@...>
Subject: Re: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

Jobs are used in ONAP primarily to perform one-time database initialization. A limitation in using a Job is that it runs to completion and will not be restarted. If there are Pods that depend on the completion of a Job those Pods will become stuck, indefinitely. As was observed when infrastructure failures occur. Unfortunately, the only course of action is to reinstall the Helm Charts that apply the above-mentioned Job/Dependency pattern.

 

Some of this behavior may have been corrected in Casablanca, but I suspect there still may be ONAP components that suffer from this. I encourage you to raise defects against the ONAP components that failed to restart after the power outage. This will help drive the need for better resiliency testing in future releases. A production-grade platform is a priority for the OOM team. In the Dublin release our team will use feedback like this to further push this agenda.  

 

Thanks,

Mike

 

--

Mike Elliott

ONAP OOM PTL

Amdocs Senior Architect

 

 

From: <onap-discuss@...> on behalf of Syed Atif Husain <Syed_ah@...>
Reply-To: "onap-discuss@..." <onap-discuss@...>, "Syed_ah@..." <Syed_ah@...>
Date: Tuesday, November 27, 2018 at 12:03 AM
To: "onap-discuss@..." <onap-discuss@...>, "bw372p@..." <bw372p@...>
Subject: Re: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

I have faced the same issue. But haven’t found a solution so far except for reinstalling ONAP.

 

Regards,

Atif

 

From: onap-discuss@... <onap-discuss@...> On Behalf Of bw372p@...
Sent: Tuesday, November 27, 2018 12:10 AM
To: onap-discuss@...
Subject: [onap-discuss] Has anyone had success in restarting the #kubernetes cluster after a power outage with an #OOM #Beijing ONAP

 

After a power outage about 75% of the pods come back. And for the most part the functionality is not working. Seeing a bunch of errors for pods that look like this: container "portal-db-job" in pod "onap-portal-db-config-n6lrn" is waiting to start: PodInitializing

This email and the information contained herein is proprietary and confidential and subject to the Amdocs Email Terms of Service, which you may review at https://www.amdocs.com/about/email-terms-of-service