Table of Contents
- Overview
- What is etcd?
- Quick Look at etcd Pod on Kubernetes
- Backup etcd Data Using etcdctl
- Automate etcd Data Backup
- Summary
Overview
When come to operating Kubernetes, we need to ensure the cluster backup is one of the key operation as part of our backup strategy. We need to properly backup and secure the application data which is typically stored in persistent volumes. On the other hand, we need to backup the etcd data because it is the operational data store that is used by Kubernetes in order to operate accordingly. In this post we are looking at how to schedule the etcd data backup using out of the box Kubernetes features.
Just for your information, there are 2 ways to backup the Kubernetes etcd data. You can choose either volume snapshot and etcd built-in snapshot.
If your etcd is running on a storage volume that supports backup, you can schedule the etcd data backup by taking snapshot of the storage volume.
Another approach is using the etcd built-in snapshot feature provided by the etcdctl tool. This will be the main focus of the discussion in this post. This is the approach that I choose to backup my Kubernetes cluster on RPI4.
One of the reason of choosing this approach is because my etcd instances are running with the data stored on the operating system disk instead of storage volume. In addition, I need a lightweight and simple approach to backup my etcd data without introducing too much workload to my Kubernetes cluster.
etcdctl tool is lightweight and simple!
What is etcd?
etcd (pronounced et-see-dee) is an open source, distributed, consistent key-value store for shared configuration, service discovery, and scheduler coordination of distributed systems or clusters of machines. etcd helps to facilitate safer automatic updates, coordinates work being scheduled to hosts, and assists in the set up of overlay networking for containers.
etcd is the primary datastore of Kubernetes, the de-facto standard system for container orchestration. By using etcd, cloud-native applications can maintain more consistent uptime and remain working. etcd distributes configuration data providing redundancy and resiliency for the configuration of Kubernetes nodes.
In other words, if your etcd data is corrupted or lost, with total lost of the etcd cluster, your Kubernetes will not be able to continue to run. In worst case without proper backup, you will have to re-install your Kubernetes cluster from scratch.
Quick Look at etcd Pod on Kubernetes
Let’s take a quick look at the etcd details we need to know before we proceed to perform the data snapshot using the etcdctl tool.
When we are using default installation approach, like I did, each of the etcd Pod is deployed on each Kubernetes control plane node. In my case, I have 3 control planes, thus I have 3 etcd Pods.
Here is how they looks like in the kube-system namespace.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
...
etcd-kube0.internal 1/1 Running 4 (37d ago) 63d
etcd-kube1.internal 1/1 Running 4 (37d ago) 63d
etcd-kube2.internal 1/1 Running 4 63d
...
Let’s take a closer look at one of the etcd Pod.
We can see that the Pod is using Hostpath for both the PKI certs (/etc/kubernetes/pki/etcd) and data (/var/lib/etcd).
This provides a very good information of where the PKI certs are located at the Kubernetes control plane node. We need to use these certificates when we are using the etcdctl tool to perform the snapshot.
$ kubectl describe pod etcd-kube0.internal -n kube-system
Name: etcd-kube0.internal
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: kube0.internal/10.0.0.110
Start Time: Fri, 10 Dec 2021 16:59:46 +0800
Labels: component=etcd
tier=control-plane
Annotations: kubeadm.kubernetes.io/etcd.advertise-client-urls: https://10.0.0110:2379
kubernetes.io/config.hash: f57e628f54788b476c77d3e2010f5878
kubernetes.io/config.mirror: f57e628f54788b476c77d3e2010f5878
kubernetes.io/config.seen: 2021-12-06T10:54:08.812868241Z
kubernetes.io/config.source: file
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 10.0.0.110
IPs:
IP: 10.0.0.110
Controlled By: Node/kube0.internal
...
...
Command:
etcd
--advertise-client-urls=https://10.0.0.110:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--client-cert-auth=true
--data-dir=/var/lib/etcd
--initial-advertise-peer-urls=https://10.0.0.110:2380
--initial-cluster=kube0.internal=https://10.0.0.110:2380
--key-file=/etc/kubernetes/pki/etcd/server.key
--listen-client-urls=https://127.0.0.1:2379,https://10.0.0.110:2379
--listen-metrics-urls=http://127.0.0.1:2381
--listen-peer-urls=https://10.0.0.110:2380
--name=kube0.internal
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
--peer-client-cert-auth=true
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
--snapshot-count=10000
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
...
...
Mounts:
/etc/kubernetes/pki/etcd from etcd-certs (rw)
/var/lib/etcd from etcd-data (rw)
...
...
Volumes:
etcd-certs:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/pki/etcd
HostPathType: DirectoryOrCreate
etcd-data:
Type: HostPath (bare host directory volume)
Path: /var/lib/etcd
HostPathType: DirectoryOrCreate
...
We also observe that the Command stanza provides valuable information of the public IP address that we can use to access to the etcd pod directly using the value from the –advertise-client-urls.
Now, let’s proceed to the backup procedure at the next section.
Backup etcd Data Using etcdctl
I have came across 2 approaches that you can use the etcdctl tool to backup the etcd data using snapshot.
Based on this understanding and experience. I cover the 3rd approach of how can you create your own etcdctl container to deploy and run it using Cronjob so that you can schedule the backup periodically.
Before we proceed further, let’s look at the etcdctl command parameter for snapshot.
Let’s run the following command to print the command usage.
$ kubectl -n kube-system exec -it etcd-kube0.internal -- sh -c 'ETCDCTL_API=3 etcdctl snapshot save -h'
NAME:
snapshot save - Stores an etcd node backend snapshot to a given file
USAGE:
etcdctl snapshot save <filename> [flags]
OPTIONS:
-h, --help[=false] help for save
GLOBAL OPTIONS:
--cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
--cert="" identify secure client using this TLS certificate file
--command-timeout=5s timeout for short running command (excluding dial timeout)
--debug[=false] enable client-side debug logging
--dial-timeout=2s dial timeout for client connections
-d, --discovery-srv="" domain name to query for SRV records describing cluster endpoints
--discovery-srv-name="" service name to query when using DNS discovery
--endpoints=[127.0.0.1:2379] gRPC endpoints
--hex[=false] print byte strings as hex encoded strings
--insecure-discovery[=true] accept insecure SRV records describing cluster endpoints
--insecure-skip-tls-verify[=false] skip server certificate verification (CAUTION: this option should be enabled only for testing purposes)
--insecure-transport[=true] disable transport security for client connections
--keepalive-time=2s keepalive time for client connections
--keepalive-timeout=6s keepalive timeout for client connections
--key="" identify secure client using this TLS key file
--password="" password for authentication (if this option is used, --user option shouldn't include password)
--user="" username[:password] for authentication (prompt if password is not supplied)
-w, --write-out="simple" set the output format (fields, json, protobuf, simple, table)
From the command usage above, we need at least information for –endpoints, –cacert, –cert and –key. The minimum command will look like the following.
ETCDCTL_API=3 etcdctl --endpoints <etcd endpoint> \
snapshot save <filename with path> \
--cacert=<ca.cert> \
--cert=<server.crt> \
--key="<server.key> '
This forms the command baseline that we need to perform the snapshot.
You can refer etcdctl usage documentation here.
Let’s look at how can we use this information in the next sections.
Using Local Container
For this approach, we are executing etcdctl command tool in a local container on our local PC to connect to the Kubernetes etcd instance.
For this scenario, we need to copy the certificates and key from the respective etcd instance to our local PC.
Let’s ssh into one of the Kubernetes control plane node. We can see the following certificates and keys are available.
$ sudo ls -al /etc/kubernetes/pki/etcd/
total 44
drwxr-xr-x 3 root root 4096 Jan 27 05:00 .
drwxr-xr-x 3 root root 4096 Dec 6 10:53 ..
-rw-r--r-- 1 root root 1086 Dec 6 10:53 ca.crt
-rw------- 1 root root 1675 Dec 6 10:53 ca.key
-rw-r--r-- 1 root root 1159 Dec 6 10:53 healthcheck-client.crt
-rw------- 1 root root 1679 Dec 6 10:53 healthcheck-client.key
-rw-r--r-- 1 root root 1212 Dec 6 10:53 peer.crt
-rw------- 1 root root 1675 Dec 6 10:53 peer.key
-rw-r--r-- 1 root root 1212 Dec 6 10:53 server.crt
-rw------- 1 root root 1675 Dec 6 10:53 server.key
Let’s proceed to create a local directory in our PC to store all the required certificates and key.
$ mkdir etcd-backup
We need to copy ca.crt, server.crt and server.key to the etcd-backup directory we created above.
Next, let’s proceed to run the etcd container to perform the snapshot. Try to make sure that you are using the same etcd container version used in your Kubernetes environment.
$ docker run -it --rm -v $(pwd):/backup \
k8s.gcr.io/etcd:3.5.0-0 sh -c \
'ETCDCTL_API=3 etcdctl --endpoints https://10.0.0.110:2379 \
snapshot save /backup/etcd-kube0.backup \
--cacert="/backup/ca.crt" \
--cert="/backup/server.crt" \
--key="/backup/server.key" '
You can find out the etcd container version by running the respective container engine command inside the control plane node. I am using containerd engine and the following is the output of crictl command.
$ sudo crictl images
...
IMAGE TAG IMAGE ID
...
k8s.gcr.io/etcd 3.5.0-0 2252d5eb703b0 158MB
...
On the success execution of the etcd container, you should receive the following output indicates a success.
{"level":"info","ts":1643260094.5110564,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/backup/etcd-kube0.backup.part"}
{"level":"info","ts":1643260094.5735552,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1643260094.5736437,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://10.0.0.110:2379"}
{"level":"info","ts":1643260095.5915964,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":1643260095.598498,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://10.0.0.110:2379","size":"12 MB","took":"1 second ago"}
{"level":"info","ts":1643260095.6047354,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/backup/etcd-kube0.backup"}
Copy this etcd-kube0.backup into a secured backup storage that you have.
Repeat the same procedure for the rest of your etcd instances data.
Note: Remember to remove the certificates and key files once you have done to make sure no one has a hand on them to avoid malicious access.
The pros of this approach is the snapshot file is created locally and you can copy this to your backup storage easily. On the contra site, we need to copy the certificates and key from Kubernetes nodes.
Using The etcd Pod in Kubernetes
In this approach we are going to run the snapshot using the etcdctl command tool reside in the etcd Pod.
Let’s start by ssh into the Kubernetes control plane node and create a backup copy of the directory /etc/kubernetes/pki/etcd in case anything goes wrong.
Proceed to create a backup directory under /etc/kubernetes/pki/etcd.
To make things simple, we are going to create our snapshot inside the /etc/kubernetes/pki/etcd directory.
Next, let’s run the following command to execute the snapshot.
$ kubectl -n kube-system exec -it \
etcd-kube1.internal -- sh -c \
'ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 \
snapshot save /etc/kubernetes/pki/etcd/backup/etcd-kube1.backup \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key" '
The above command will use the etcdctl tool inside the etcd Pod and create a snapshot at the location indicated (/etc/kubernetes/pki/etcd/backup/etcd-kube1.backup).
Next, we can copy this snapshot file into our secured backup storage.
Repeat the same steps for each of the etcd instance at each of the Kubernetes control plane node.
The pros of this approach is you do not need to copy certificates and key from the control plane nodes. The cons is you need to modify the /etc/kubernetes/pki/etcd/ directory to create a backup folder, and you need to copy the snapshot file to your backup storage.
Automate etcd Data Backup
In this section, we are going to use what we have learned from the previous sections and create our own etcdctl container and use the Kubernetes Cronjob to schedule the Pod to snapshot the etcd data.
Instead of snapshot each of the etcd data individually, we are going to create a script to help us to perform the snapshot for multiple etcd at one go.
The etcd Pods Command stanza presents instant information that we need and we can use the kubectl command to retrieve information that can be formatted for easy access.
$ kubectl get pods etcd-kube0.internal -n kube-system -o=jsonpath='{.spec.containers[0].command}' | jq
[
"etcd",
"--advertise-client-urls=https://10.0.0.110:2379",
"--cert-file=/etc/kubernetes/pki/etcd/server.crt",
"--client-cert-auth=true",
"--data-dir=/var/lib/etcd",
"--initial-advertise-peer-urls=https://10.0.0.110:2380",
"--initial-cluster=kube0.internal=https://10.0.0.110:2380",
"--key-file=/etc/kubernetes/pki/etcd/server.key",
"--listen-client-urls=https://127.0.0.1:2379,https://10.0.0.110:2379",
"--listen-metrics-urls=http://127.0.0.1:2381",
"--listen-peer-urls=https://10.0.0.110:2380",
"--name=kube0.internal",
"--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
"--peer-client-cert-auth=true",
"--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
"--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
"--snapshot-count=10000",
"--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
]
From the above output, instead of hardcoding the values, we can use kubectl command to obtain the information for certificates, key and IP address of the respective etcd instance. This can be used in our etcdctl container that we are going to create.
Obviously we need to have kubectl command tool inside the container in order to obtain this information. To structurally retrieve this information in JSON format, we also need jq command.
The following Dockerfile summarise all the things we need to build our container.
FROM alpine:3.15.0
ENV ETCD_VERSION="v3.5.0"
ENV ETCD_PATH="/etcd"
ENV ETCD_ENDPOINTS=""
ENV BAKCUP_PATH="/backup"
ENV ETCD_PKI_HOSTPATH="/etc/kubernetes/pki/etcd"
ENV ETCD_CACERT="ca.crt"
ENV ETCD_SERVER_CERT="server.crt"
ENV ETCD_SERVER_KEY="server.key"
ENV LOG_DIR="${BAKCUP_PATH}/logs"
ENV TZ="Etc/GMT"
RUN apk add --no-cache --virtual .build-deps git go
RUN apk add --no-cache bash jq tzdata \
&& apk update \
&& mkdir -p ${ETCD_PATH} \
&& mkdir -p /etcd-source \
&& git clone -b ${ETCD_VERSION} https://github.com/etcd-io/etcd.git /etcd-source \
&& cd /etcd-source \
&& ./build.sh \
&& cp /etcd-source/bin/etcdctl ${ETCD_PATH}/ \
&& rm -rf /etcd-source \
&& chmod -R +x ${ETCD_PATH}/* \
&& mkdir -p ${LOG_DIR} \
&& chown -R root:root ${BAKCUP_PATH} ${ETCD_PATH} ${LOG_DIR} \
&& cp /usr/share/zoneinfo/$TZ /etc/localtime
RUN apk del .build-deps
COPY --chown=root:root ./run-backup.sh ${ETCD_PATH}/
WORKDIR ${ETCD_PATH}/
USER root
CMD "${ETCD_PATH}/run-backup.sh"
From the Dockerfile above, you can see that we are building the etcdctl binary from the source code to support the respective container architecture. This Dockerfile provide the definition to build the etcd-backup-base base image.
Due the fact that the server.key is only accessible by root user, as from what can see from the command output earlier, we will have to run the container with root user.
Another option is to run a dummy container with root user in order to copy the certificates and key into a shared volume using initContainer pattern. The etcdctl container can then access to these certificates and key in the shared volume. However, we are going down this path here.
Next, we are going to create bash script to implement our etcd data snapshot logic. You can refer to the complete bash file here. The following show the excerpt of the bash script.
function backup(){
ETCD_PODS_NAME=$(kubectl get pod -l component=etcd -o jsonpath="{.items[*].metadata.name}" -n kube-system)
for etcd in $ETCD_PODS_NAME
do
log "Startinng snapshot for $etcd ... "
COMMANDS=$(kubectl get pods $etcd -n kube-system -o=jsonpath='{.spec.containers[0].command}')
for row in $(echo "${COMMANDS}" | jq -r '.[]'); do
if [[ ${row} = --advertise-client-urls* ]]; then
ADVERTISED_CLIENT_URL=$(paramValue ${row} "=")
log "ADVERTISED_CLIENT_URL = ${ADVERTISED_CLIENT_URL}"
elif [[ ${row} = --cert-file* ]]; then
ETCD_SERVER_CERT=$(paramValue ${row} "=")
log "ETCD_SERVER_CERT = ${ETCD_SERVER_CERT}"
elif [[ ${row} = --key-file* ]]; then
ETCD_SERVER_KEY=$(paramValue ${row} "=")
log "ETCD_SERVER_KEY = ${ETCD_SERVER_KEY}"
elif [[ ${row} = --trusted-ca-file* ]]; then
ETCD_CACERT=$(paramValue ${row} "=")
log "ETCD_CACERT = ${ETCD_CACERT}"
fi
done
cp ${ETCD_CACERT} /tmp/ca.crt && cp ${ETCD_SERVER_CERT} /tmp/server.crt && cp ${ETCD_SERVER_KEY} /tmp/server.key
TIMESTAMP=$(date '+%Y-%m-%d-%H-%M-%s')
log "Backing up $etcd ... Snapshot file: ${BAKCUP_PATH}/$etcd-${TIMESTAMP} ..."
OUTPUT=$( (ETCDCTL_API=3 ${ETCD_PATH}/etcdctl --endpoints ${ADVERTISED_CLIENT_URL} \
snapshot save ${BAKCUP_PATH}/$etcd-${TIMESTAMP} \
--cacert="/tmp/ca.crt" \
--cert="/tmp/server.crt" \
--key="/tmp/server.key") 2>&1 )
log "${OUTPUT}"
done
}
From the bash script above, we are using the kubectl command to retrieve the respective information of certificates, key and IP address from the respective etcd POD Command stanza.
Let’s proceed to build our container. You need to create the multi-arc profile if not already done so. Please refer here on how to build multi-architecture container using Docker.
The following command build the base image with amd64 and arm64 architecture and push it to the docker.io. You can also push to your own internal registry.
$ docker buildx build --platform linux/arm64,linux/amd64 -t chengkuan/etcd-backup-base:1.0.0 -f Dockerfile.alpine --push .
Using this etcd-backup-base base image, we will build the respective amd64 (Dockerfile.alpine.amd64) and arm64 (Dockerfile.alpine.arm64) etcd-backup container image with the correct architecture kubectl binary. Please refer the README.md for complete steps to build the container images.
Next, we can proceed to deploy the container using the image onto Kubernetes.
Let’s first create the necessary definition using the YAML file below.
We need to create a PersistentVolumeClaim to store the snapshots.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
namespace: etcd-backup
labels:
app: etcd-backup
app-group: etcd-backup
name: etcd-backup-snapshot-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
In order to allow our etcd-backup Pod to be able to access to the etcd Pod in kube-system namespace, we need to define a ClusterRole and bind this role to the default ServiceAccount via the ClusterRoleBinding.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ectd-backup
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ectd-backup
subjects:
- kind: ServiceAccount
name: default
namespace: etcd-backup
roleRef:
kind: ClusterRole
name: ectd-backup
apiGroup: rbac.authorization.k8s.io
Next, we need to create the definition for Kubernetes Cronjob, configure the schedule and define the container. Please note that the Cronjob timezone is currently set to UTC+0:00. There is no option to set the timezone for CronJob as per reported here.
apiVersion: batch/v1
kind: CronJob
metadata:
name: ectd-backup
namespace: etcd-backup
labels:
app: etcd-backup
app-group: etcd-backup
spec:
# Change the schedule here. Current scheduled to run at 12:00AM UTC
schedule: "0 0 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: ectd-backup
# change this image according to your OS architecture
image: chengkuan/etcd-backup:arm64-1.0.0
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
# The etcd PKI location on the control plane node
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
readOnly: true
# The etcdctl snapshot location. Also the log file location
- mountPath: /backup
name: snapshot-dir
restartPolicy: OnFailure
volumes:
- hostPath:
path: /etc/kubernetes/pki/etcd
type: Directory
name: etcd-certs
- name: snapshot-dir
persistentVolumeClaim:
claimName: etcd-backup-snapshot-pvc
Note: You can refer to the complete YAML file at the GitHub here.
Proceed to run the following command to deploy the components
$ kubectl apply -f etcd-backup.yaml
You should see the following components are deployed.
$ kubectl get all -n etcd-backup
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cronjob.batch/etcd-backup 0 0 * * * False 0 120m 24h
You can run the following command to immediately test the container.
# Create a job to test the Cronjob
$ kubectl create job testjob --from=cronjob/etcd-backup -n etcd-backup
# View the log
$ kubectl logs <pod-name>
You can now schedule a storage volume snapshot at the storage server where the PVC (etcd-backup-snapshot-pvc) is configured.
This approach completely automate your Kubernetes etcd data backup on a schedule determined by you. Now you can grab a coffee and enjoy your day.
Note: Please always refer to the GitHub for complete container build and deployment guide.
Summary
I hope this post give you a good idea of how can we utilize the existing available tools from the open source community to backup your Kubernetes etcd data.
The last approach that I have presented is the most elegant and simple approach that you can schedule the etcd data snapshot and then periodically backup the PV at your storage server. In addition, it also provides logging capability so that you can always check back if anything goes wrong.
You should be able to deploy this onto OpenShift without problem. There is a built-in script in OpenShift 4, which open up the possibility to use the script for this container image in OpenShift.
Feel free to contribute and enhance the source codes. Please let me know how this is helping you and if you have any feedback.