One of our internal production is running an OpenShift 3.7 cluster. As OKD 3.10 is rolling out (and renamed from OpenShift Origin), 3.7 is far too old and missing some useful new features.
"Let's get it upgraded!" I shouted out, and nightmares started too to pour out.

Day 1: Let's Upgrade, Ansibly

OKD is managed heavily dependent on Ansible scripts. By the documents, upgrading OpenShift could be done in either in-place way or blue-green way. In the latter process, control planes, or masters in our cluster, are upgraded first to the next release, then a number of nodes with blue color label is added into the cluster. Then the pods are evacuate to blue nodes, and old nodes are deleted from the cluster.

Blue-green upgrade should be a safer way. Since it is the first time I upgrade OpenShift cluster, I would like to run the ansible scripts to upgrade masters first. Since the upgrade process could not skip a release, running the script would only bump version to the next release. There was, of course, an exception which is from 3.7 to 3.9 because 3.8 hasn't been officially released. Running upgrade on OpenShift 3.7 will let the script upgrade from 3.7 to 3.8, then again from 3.8 to 3.9 automatically.

Unfortunatly, the upgrade is incredibly slow, and failed at TASK [Upgrade all storage] for the first time. OpenShift Web Console and all the pods are still working, however. So I just retried running the ansible script before bed.

And that was the first day.

Day 2: Versions in Chaos

When I returned to the terminal, everyting looks fine - and still the same for me. Everything is working and familiar until I checked the version.

Masters and nodes are running the different version, as expected. But OpenShift masters are upgraded to 3.10 instead of expected 3.9.

It was a mess! I quickly confirmed that openshift-ansible repository is on release-3.9 branch and openshift_version is set to 3.9. I ran the ansible playbook to upgrade nodes, but since the masters are running 3.10, nodes are not able to be upgraded to 3.9. Neither them to be upgraded to 3.10 because they are not running 3.10!

Day 4: Rebuild Masters and Node Config Maps

OpenShift 3.10 has a new openshift-node project and node configmaps to bootstrap nodes. Therefore, every node must specify a openshift_node_group_name property in inventory files. Somehow by magic, This cluster with OKD 3.10 masters is runnning without it! I had to fix it first by running

$ ansible-playbook -i <inventory_file> openshift_node_group.yml

and then I could perform most other operations.

Day 5: Scale-up and Replace

To fix the version inconsistency between nodes and masters, I came out with an new idea to scale-up the cluster and knock out old nodes. While pods could be evacuated to other nodes, this method should just work like blue-green upgrade. And it worked. For at least a while.

Day 8: Certificates and LDAP

Our LDAP server is fored to be accessed under TLS, and the server certificate is signed by our own CA. So we need to download the CA certificate, and specify oauthConfig:identityProviders:provider:ca to the file. After adding the line and restart master API, logging is working.

But there is still a thing strange: we haven't done this before. How did it work? From OKD 3.10, masters is running containerized. Thus, system CA roots could not be used as default in the container. We have to specify one if we need a CA certificate.

Day 9: Logs and Terminal Outage

Logs and Terminal is was still not available. I googled and checked for masters' log and found a huge amount of x509: certificate signed by unknown authority error. I redepolyed the certificate using:

$ ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-certificates.yml

However, this caused etcd failed to start. By checking the nodes, I surprisingly found that the playbook wrongly set the permission of /etc/etcd directory and its certificates. I corrected the permission and restarted etcd, and things went fine then.

# chmod +x /etc/etcd
# chmod 644 /etc/etcd/ca.crt /etc/etcd/peer.* /etc/etcd/server.*

After redeploying the certificate, OpenShift web console will constantly return 502. OpenShift Origin issue #20005 show a similar problem which caused by openshift-ansible not regenerating the certificate of web console. Simply delete the secret and running pods will solve the problem:

$ oc delete secret webconsole-serving-cert
$ oc delete pods webconsole=true

Day 10: Rebuild Registry, Router and Metrics

Most services seemed to be running, but the cluster could not schedule new pods with Image Pull Back-off error. I tried to build a image to test the functionality of the built-in docker registry and it failed pushing docker image. I checked the registry console but it worked as a charm and images were out there!

I decided to rebuild the whole default project. I deleted docker-registry deploymentconfig, and tried to follow Setting up the Registry chatper in the document with

$ oc adm registry --config=admin.kubeconfig --service-account=registry

but it still didn't work. I tried pulling image on a node and got HTTP response to a HTTPS client. I quickly realized that I need to follow the rest of the procedures to secure and expose the registry which I didn't before. Is there any automatical way?

Yes! There is actually a openshift-hosted part in openshift-ansible. We could utilize it to deploy the registry with console and router within 1-click. I deleted all of them, set up variables and ran:

$ ansible-playbook -i <inventory_file> playbooks/openshift-hosted/config.yml
$ ansible-playbook -i <inventory_file> playbooks/openshift-hosted/deploy-router.yml

In my deployment, there was a unsupported parameter IMAGE_NAME in roles/cockpit-ui/tasks/main.yml. It would be fine to find the line and delete it.

And I was happy to see that they are up and working.

Day 12: Logs and Terminal Outage, Again

Logs is down again without a known reason. By checking the logs on running nodes, I found the following error:

http: TLS handshake error from 10.128.4.69:46722: no serving certificate available for the kubelet
http: TLS handshake error from 10.128.4.69:46740: no serving certificate available for the kubelet

It's strange that the TLS handshake failure is between the user and node but not host and node.

Through googling, I found a bug on Red Hat BZ#1571515 related to it. It should be fixed in latest version of openshift-ansible by PR#9711.

To verify the bug, use

$ oc get csr
NAME        AGE       REQUESTOR             CONDITION
csr-25tvg   8h        system:node:infra0    Pending
csr-29mhr   12h       system:node:infra0    Pending
csr-2cswz   17h       system:node:master0   Pending
csr-2gj4h   22h       system:node:infra1    Pending
csr-2h8sx   4h        system:node:master1   Pending
csr-2hf4t   10h       system:node:infra1    Pending
csr-2hzhc   8h        system:node:master1   Pending
csr-2j4jq   11h       system:node:infra0    Pending
csr-2jcnl   20h       system:node:infra1    Pending
....

It would print all the CSRs where we could find all of them are in pending conditon. To make it work properly, we need to approve all the CSRs.

$ oc get csr | grep Pending | awk -F ' ' '{print $1}' | xargs oc adm certificate approve

Ending and Conclusion

After all this week, OpenShift is working fine though some services needs to be rebuilt. But what I want to say most is, it is extremely important to keep OpenShift cluster up-to-date with the latest major version, or just never touch it after a success deployment if the deployment type is origin. If you did and met problems, I hope this post would help a little.

Happy K8S and wish you a happy ops life!

Day 1: Let's Upgrade, Ansibly

Day 2: Versions in Chaos

Day 4: Rebuild Masters and Node Config Maps

Day 5: Scale-up and Replace

Day 8: Certificates and LDAP

Day 9: Logs and Terminal Outage

Day 10: Rebuild Registry, Router and Metrics

Day 12: Logs and Terminal Outage, Again

Ending and Conclusion

Related Articles

Working on Fedora Sway Spin

Install FreeDOS - but Manually

Proxmox VE for Workstation: Install Windows with GPU Passthrough

Proxmox VE for Workstation: Configure GPU Passthrough

Labels in OpenShift 3 Web Console

Why does Nextcloud not Work with MariaDB>10.6?