APPC install error: It may not be safe to bootstrap the cluster from this node. #appc


jkzcristiano
 

Dear community,

after more than 2 months (with Casabalanca) APPC ansible server went to CrashLoopBackOff state. I tried to update its deployment without success. One day after, other APPC pods went to CrashLoopBackOff state too.

Now I am trying with a simple deployment that consists only of APPC service.

Below is the state of pods from Kubernetes dashboard:



Describing "dev-appc-appc-db-0":

ubuntu@rancher:~/oom/kubernetes$ kubectl describe pod/dev-appc-appc-db-0 -n onap
Name:           dev-appc-appc-db-0
Namespace:      onap
Node:           k8s-dev/10.0.0.31
Start Time:     Wed, 27 Mar 2019 14:16:01 +0000
Labels:         app=dev-appc-appc-db
                controller-revision-hash=dev-appc-appc-db-7cf88ff6b7
                statefulset.kubernetes.io/pod-name=dev-appc-appc-db-0
Annotations:    pod.alpha.kubernetes.io/initialized=true
Status:         Running
IP:             10.42.5.166
Controlled By:  StatefulSet/dev-appc-appc-db
Init Containers:
  mariadb-galera-prepare:
    Container ID:  docker://610ebd70b379ac6e7a223ba7f4dfbb7b6becf2a8965f0850ad8b9cf11f28b520
    Image:         nexus3.onap.org:10001/busybox
    Image ID:      docker-pullable://nexus3.onap.org:10001/busybox@sha256:4415a904b1aca178c2450fd54928ab362825e863c0ad5452fd020e92f7a6a47e
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      chown -R 27:27 /var/lib/mysql
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 27 Mar 2019 14:16:14 +0000
      Finished:     Wed, 27 Mar 2019 14:16:14 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/mysql from dev-appc-appc-db-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sbhz4 (ro)
Containers:
  appc-db:
    Container ID:   docker://03cfd676d76fc852dd7c75fb795e0223c731b4f4a22ee903f9784df70737d080
    Image:          nexus3.onap.org:10001/adfinissygroup/k8s-mariadb-galera-centos:v002
    Image ID:       docker-pullable://nexus3.onap.org:10001/adfinissygroup/k8s-mariadb-galera-centos@sha256:fbcb842f30065ae94532cb1af9bb03cc6e2acaaf896d87d0ec38da7dd09a3dde
    Ports:          3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 27 Mar 2019 14:19:42 +0000
      Finished:     Wed, 27 Mar 2019 14:19:45 +0000
    Ready:          False
    Restart Count:  5
    Liveness:       exec [mysqladmin ping] delay=30s timeout=5s period=10s #success=1 #failure=3
    Readiness:      exec [/usr/share/container-scripts/mysql/readiness-probe.sh] delay=15s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:        onap (v1:metadata.namespace)
      MYSQL_USER:           my-user
      MYSQL_PASSWORD:       <set to the key 'user-password' in secret 'dev-appc-appc-db'>  Optional: false
      MYSQL_DATABASE:       my-database
      MYSQL_ROOT_PASSWORD:  <set to the key 'db-root-password' in secret 'dev-appc-appc-db'>  Optional: false
    Mounts:
      /etc/localtime from localtime (ro)
      /var/lib/mysql from dev-appc-appc-db-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sbhz4 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  dev-appc-appc-db-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  dev-appc-appc-db-data-dev-appc-appc-db-0
    ReadOnly:   false
  localtime:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/localtime
    HostPathType:
  default-token-sbhz4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sbhz4
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age              From               Message
  ----     ------            ----             ----               -------
  Warning  FailedScheduling  4m (x3 over 4m)  default-scheduler  pod has unbound PersistentVolumeClaims
  Normal   Scheduled         4m               default-scheduler  Successfully assigned onap/dev-appc-appc-db-0 to k8s-dev
  Normal   Pulling           4m               kubelet, k8s-dev   pulling image "nexus3.onap.org:10001/busybox"
  Normal   Pulled            4m               kubelet, k8s-dev   Successfully pulled image "nexus3.onap.org:10001/busybox"
  Normal   Created           4m               kubelet, k8s-dev   Created container
  Normal   Started           4m               kubelet, k8s-dev   Started container
  Normal   Pulled            3m (x4 over 4m)  kubelet, k8s-dev   Container image "nexus3.onap.org:10001/adfinissygroup/k8s-mariadb-galera-centos:v002" already present on machine
  Normal   Created           3m (x4 over 4m)  kubelet, k8s-dev   Created container
  Normal   Started           3m (x4 over 4m)  kubelet, k8s-dev   Started container
  Warning  BackOff           3m (x9 over 4m)  kubelet, k8s-dev   Back-off restarting failed container


And, logs from "appc-db" container:

+ CONTAINER_SCRIPTS_DIR=/usr/share/container-scripts/mysql
+ EXTRA_DEFAULTS_FILE=/etc/my.cnf.d/galera.cnf
+ '[' -z onap ']'
+ echo 'Galera: Finding peers'
Galera: Finding peers
++ hostname -f
++ cut -d. -f2
+ K8S_SVC_NAME=appc-dbhost
+ echo 'Using service name: appc-dbhost'
+ cp /usr/share/container-scripts/mysql/galera.cnf /etc/my.cnf.d/galera.cnf
Using service name: appc-dbhost
+ /usr/bin/peer-finder -on-start=/usr/share/container-scripts/mysql/configure-galera.sh -service=appc-dbhost
2019/03/27 15:27:45 Peer list updated
was []
now [dev-appc-appc-db-0.appc-dbhost.onap.svc.cluster.local]
2019/03/27 15:27:45 execing: /usr/share/container-scripts/mysql/configure-galera.sh with stdin: dev-appc-appc-db-0.appc-dbhost.onap.svc.cluster.local
2019/03/27 15:27:45
2019/03/27 15:27:46 Peer finder exiting
+ '[' '!' -d /var/lib/mysql/mysql ']'
+ exec mysqld
2019-03-27 15:27:46 139879155882240 [Note] mysqld (mysqld 10.1.24-MariaDB) starting as process 1 ...
2019-03-27 15:27:47 139879155882240 [Note] WSREP: Read nil XID from storage engines, skipping position init
2019-03-27 15:27:47 139879155882240 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2019-03-27 15:27:47 139879155882240 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy &lt;info@...&gt; loaded successfully.
2019-03-27 15:27:47 139879155882240 [Note] WSREP: CRC-32C: using hardware acceleration.
2019-03-27 15:27:47 139879155882240 [Note] WSREP: Found saved state: 84b0f5c0-12b6-11e9-a817-1b6ad3281ac6:-1, safe_to_bootsrap: 0
2019-03-27 15:27:47 139879155882240 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = dev-appc-appc-db-0.appc-dbhost.onap.svc.cluster.local; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_
2019-03-27 15:27:47 139879155882240 [Note] WSREP: GCache history reset: old(84b0f5c0-12b6-11e9-a817-1b6ad3281ac6:0) -&gt; new(84b0f5c0-12b6-11e9-a817-1b6ad3281ac6:-1)
2019-03-27 15:27:47 139879155882240 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2019-03-27 15:27:47 139879155882240 [Note] WSREP: wsrep_sst_grab()
2019-03-27 15:27:47 139879155882240 [Note] WSREP: Start replication
2019-03-27 15:27:47 139879155882240 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2019-03-27 15:27:47 139879155882240 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
2019-03-27 15:27:47 139879155882240 [ERROR] WSREP: wsrep::connect(gcomm://) failed: 7
2019-03-27 15:27:47 139879155882240 [ERROR] Aborting

Seems it is related to "safe_to_bootstrap" feature (some info here). Not sure how to change this parameter and why now deploying APPC has this issue (I am using same images, same environment, everything; just deploy/undeploy).

Would appreciate some help!

Kind regards,
Xoan


Join onap-discuss@lists.onap.org to automatically receive all group messages.