Round 1: etcdmain: read .wal: input/output error
2021-11-21 20:43:04.262587 I | embed: initial cluster =
2021-11-21 20:43:04.280202 W | wal: ignored file 000000000000035c-00000000045d83a6.wal.broken in wal
2021-11-21 20:43:04.280256 W | wal: ignored file 0000000000000375-0000000004764aca.wal.broken in wal
2021-11-21 20:43:08.251726 C | etcdmain: read /var/lib/rancher/etcd/member/wal/0000000000000375-0000000004764aca.wal: input/output error
We can find the missing file is renamed from .wal
to .wal.broken
. We could rename the file back to .wal
.
Round 2: etcdmain: walpb: crc mismatch
2021-11-21 20:50:19.312596 I | embed: initial cluster =
2021-11-21 20:50:19.323572 W | wal: ignored file 000000000000035c-00000000045d83a6.wal.broken in wal
2021-11-21 20:50:19.507366 C | etcdmain: walpb: crc mismatch
To solve this problem, one choice is to save and restore a snapshot.
Before starting, we could get some preparation:
# export NODE_NAME=container-1-paas-central-sakuragawa-cloud
# export KUBE_CA=/etc/kubernetes/ssl/kube-ca.pem
# export ETCD_PEM=/etc/kubernetes/ssl/kube-etcd-$NODE_NAME.pem
# export ETCD_KEY=/etc/kubernetes/ssl/kube-etcd-$NODE_NAME-key.pem
# export ETCD_SERVER=https://container-2:2379
# alias etcdctl="etcdctl --cert=$ETCD_PEM --key=$ETCD_KEY --cacert=$KUBE_CA --endpoints=$ETCD_SERVER"
Save a snapshot from other working nodes:
# etcdctl snapshot save snapshot.db
{"level":"info","ts":1637576476.7098856,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"snapshot.db.part"}
{"level":"info","ts":1637576476.722268,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1637576476.7223496,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://container-2:2379"}
{"level":"info","ts":1637576478.531543,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":1637576479.3304148,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://container-2:2379","size":"25 MB","took":"2 seconds ago"}
{"level":"info","ts":1637576479.3305554,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"snapshot.db"}
Snapshot saved at snapshot.db
Then stop etcd
server and restore the snapshot:
# docker stop etcd
# mv /var/lib/etcd/ /var/lib/etcd.old/
# etcdctl snapshot restore snapshot.db --data-dir=/var/lib/etcd/
Round 3: Restore from Snapshot (Again)
After etcd
starts, it is not back to work as expected:
2021-11-22 11:46:37.003240 E | rafthttp: request cluster ID mismatch (got e8061948cdccadb7 want cdf818194e3a8c32)
First stop the node agent (again) and etcd
container and delete all the etcd
local data:
# docker stop etcd
# rm -f /var/lib/etcd/member/wal/*
# rm -f /var/lib/etcd/member/snap/*
To restore the node, we need some parameters from the origin node. We can run docker inspect etcd
and collect some information from Args
and Envs
:
--name
--initial-cluster
--initial-cluster-token
--initial-advertise-peer-urls
Then use these parameters when restoring from a snapshot:
# etcdctl snapshot restore snapshot.db \
--cert=$ETCD_PEM \
--key=$ETCD_KEY \
--cacert=$KUBE_CA \
--name=etcd-container-1 \
--initial-cluster=etcd-container-1="https://container-1:2380,etcd-container-2=https://container-2:2380" \
--initial-cluster-token="etcd-cluster-1" \
--initial-advertise-peer-urls="https://container-1:2380" \
--data-dir /var/lib/etcd
Deprecated: Use `etcdutl snapshot restore` instead.
2021-11-22T21:07:11+08:00 info netutil/netutil.go:112 resolved URL Host {"url": "https://container-1:2380", "host": "container-1:2380", "resolved-addr": "192.168.1.21:2380"}
2021-11-22T21:07:11+08:00 info netutil/netutil.go:112 resolved URL Host {"url": "https://container-1:2380", "host": "container-1:2380", "resolved-addr": "192.168.1.21:2380"}
2021-11-22T21:07:11+08:00 info snapshot/v3_snapshot.go:251 restoring snapshot {"path": "snapshot-20211122.db", "wal-dir": "/var/lib/etcd/member/wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:257\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/tmp/etcd-release$3.5.1/etcd/release/etcd/etcdctl/ctlv3/command/snapshot_command.go:128\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/remote/sbatsche/.gvm/pkgsets/go1.$6.3/global/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/$lobal/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global$pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/etcdctl/ctlv3/ctl$go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/tmp/etcd-release-3.$.1/etcd/release/etcd/etcdctl/main.go:59\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
2021-11-22T21:07:12+08:00 info membership/store.go:141 Trimming membership information from the backend...
2021-11-22T21:07:12+08:00 info membership/cluster.go:421 added member {"cluster-id": "e8061948cdccadb7", "local-member-id": "0", "added-p$er-id": "94447210df55f5df", "added-peer-peer-urls": ["https://container-2:2380"]}
2021-11-22T21:07:12+08:00 info membership/cluster.go:421 added member {"cluster-id": "e8061948cdccadb7", "local-member-id": "0", "added-p$er-id": "e089ac38e0c2282f", "added-peer-peer-urls": ["https://container-1:2380"]}
2021-11-22T21:07:12+08:00 info snapshot/v3_snapshot.go:272 restored snapshot {"path": "snapshot-20211122.db", "wal-dir": "/var/lib/etcd/$ember/wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap"}
Finally, start etcd
and the node will be back to the cluster.
$ docker start etcd
$ docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint status --write-out table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://container-2:2379 | 94447210df55f5df | 3.4.14 | 25 MB | true | false | 82444 | 75080617 | 75080617 | |
| https://container-1:2379 | e089ac38e0c2282f | 3.4.14 | 25 MB | false | false | 82444 | 75080617 | 75080617 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+