Anthos 1.2 GKE-OP backup script issues

4 lutego 20204 lutego 2020 Konrad ClapaDodaj komentarz

To backup your Anthos GKE-OP cluster Google provides a nice script you can schedule to run as a crone job.

https://cloud.google.com/gke-on-prem/docs/how-to/backing-up

The only problem is that it misses last line where you actually copy the snapshot of the admin cluster ETCD database.

Add the bellow line to make it work:

kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp kube-system/${admin_etcd}:admin_snapshot.db $BACKUP_DIR/

Once added it work like a charm 🙂

ubuntu@admin-workstation2:~/backup$ ls admin_snapshot.db gke-03-usercluster01_snapshot.db pki

In the output you might see some errors related to the issue I explained here: https://gcpfellow.com/2020/02/04/anthos-1-x-issue-when-running-sudo-commands/ but you can ignore it

Anthos 1.x issue when running sudo commands

4 lutego 2020 Konrad ClapaDodaj komentarz

When you login to GKE-OP nodes and try to run sudo command you will get the following warning:

sudo: unable to resolve host [nodename]

Your command will still execute but will show this warning. It is related to Ubuntu OS settings. To resolve it add the following line into the /etc/hosts file on the node:

127.0.0.1 [node-name]

Hope this will be solved soon as Google has already identified this issue. I guess they will add the record in the provisioning process for the nodes.

Professional Cloud DevOps Engineer Certificate tips

2 lutego 202013 marca 2020 Konrad ClapaDodaj komentarz

The DevOps Engineer exam is finally GA! I took the Beta in December 2019 and received good news that I am now certified! Here are couple of tips you can use to get this certification.

The starting point is the official Google site: https://cloud.google.com/certification/cloud-devops-engineer.

In the exam guide you can find basic info on the content: https://cloud.google.com/certification/guides/cloud-devops-engineer/

I must admit that the exam guide is very accurate and mastering all the topics is the clue to nail this exam.

The main topics to concentrate on are:

SRE ca. 30% of the exam:

https://www.coursera.org/learn/site-reliability-engineering-slos

The books https://landing.google.com/sre/books/ might be a great resource but the coursera is good enough to cover topics like SLO, SLI, SLA, post mortem reports, error budget, blameless culture etc.

Stackdriver ca. 30% of the exam:

Make sure you know all the services of APM, Log Based metrics and integration with Fluentd. Understand advanced logs filtering.

Go through all the docs under: https://cloud.google.com/stackdriver/docs/

and Qwiklabs: https://www.qwiklabs.com/quests/35?catalog_rank=%7B%22rank%22%3A2%2C%22num_filters%22%3A0%2C%22has_search%22%3Atrue%7D&search_id=4476323

For beginners good summary of Stackdriver you can find in our book in the Monitoring Chapter:

DevOps ca. 20% of the exam:

Have basic understanding of tools like Terraform, Jenkins, Cloud Build, Spinnaker, Deployment Manager. I suggest finding some 101 on youtube e.g. https://www.youtube.com/results?search_query=spinnaker+demo

GKE ca. 10% of the exam:

https://www.coursera.org/learn/google-kubernetes-engine

https://www.qwiklabs.com/quests/63?catalog_rank=%7B%22rank%22%3A1%2C%22num_filters%22%3A0%2C%22has_search%22%3Atrue%7D&search_id=4476352

and for beginners: https://www.qwiklabs.com/quests/29?catalog_rank=%7B%22rank%22%3A3%2C%22num_filters%22%3A0%2C%22has_search%22%3Atrue%7D&search_id=4476352

I would say having the Cloud Architect Professional knowledge you should need 40-80h to prepare for this exam.

Good luck to all of you and let me know if you have any questions!

Google Cloud Certified Fellow

30 stycznia 2020 Konrad ClapaDodaj komentarz

Today the Google Cloud Certified Fellow program was lunch. I am happy to announce that I am recognised as Fellow #5.

It is a certification outside of standard Professional program and is directed at Technial Leaders working with Anthos. This is how Google describes it.:

„The Google Cloud Certified Fellow program is for elite cloud architects and technical leaders who are experts in designing enterprise solutions. This certification program recognizes individuals with deep technical expertise who can translate business requirements into technical solutions using Anthos and Google Cloud.

The Hybrid Multi-cloud Certification is the first certification in this program and assesses both technical skills and business expertise. Achieving this certification demonstrates your leadership, business impact, and technical acumen, as well as your ability to:

• Design hybrid and multi-cloud solution architectures with Anthos
• Design for security and compliance
• Provision a solution infrastructure
• Optimize technical and business processes
• Ensure solution and operations reliability”

I will post more information on the program soon. In the meantime you can get more info here: https://cloud.google.com/certification/hybrid-multi-cloud

Anthos 1.2 Time Sync issues on GKE On-Prem nodes

9 stycznia 2020 Konrad ClapaDodaj komentarz

When you login into your GKE-OP nodes you might find out the time is synced with your ESXi host rather then with the Timeserver configured in you DHCP Options or Static IP files used for GKE-OP clusters provisioning.

This issue is actually related to Ubuntu 18 and is connected with settings of timesyncd service.

To see if you are experiencing the issue run

sudo SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-timesyncd

ubuntu@gke-03-user0103:~$ sudo SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-timesyncd
Failed to create state directory: Permission denied
ubuntu@gke-03-user0103:~$ sudo SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-timesyncd
Added new server 172.20.101.25.
Added new server ntp.ubuntu.com.
Selected server 172.20.101.25.
Resolving 172.20.101.25…
Resolved address 172.20.101.25:123 for 172.20.101.25.
Selected address 172.20.101.25:123 of server 172.20.101.25.
Connecting to time server 172.20.101.25:123 (172.20.101.25).
Sent NTP request to 172.20.101.25:123 (172.20.101.25).
Server has too large root distance. Disconnecting.
Waiting after exhausting servers.

Root cause: there is network delay that causes timeout for the response from NTP

Solution: There is no permanent solution for this issue as the settings of NTP are created when Nodes are deployed using the DHCP or StaticIP files. You can only fix this issue after your nodes are deployed. The settings will be lost when you redeploy.

To workaround this issue set edit the timesyncd.conf file and set RootDistanceMaxSec=20 (you might need to find our the honey spot)

sudo cat /etc/systemd/timesyncd.conf

[Time]

#NTP=

#FallbackNTP=ntp.ubuntu.com

RootDistanceMaxSec=20

#PollIntervalMinSec=32

#PollIntervalMaxSec=2048

Now you should check if connection works fine

ubuntu@gke-03-user0103:~$ sudo SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-timesyncd
Added new server 172.20.101.25.
…
freq offset : +0 (0 ppm)
interval/delta/delay/jitter/drift 64s/+0.033s/0.001s/0.000s/+0ppm
Synchronized to time server 172.20.101.25:123 (172.20.101.25).

Summary of 2019

31 grudnia 201931 grudnia 2019 Konrad ClapaDodaj komentarz

It has been a great year for me with some major goals achieved. I am very thankful for all that have made this come true!

Book – special thanks to @Brian Gerrad co-author. Only you know what was the true cost of writing this book 🙂

Professional Cloud Architect – Google Cloud Certification Guide

Certifications – transcript

Professional Cloud Architect ’18
Professional Data Engineer ’18
Associate Cloud Engineer
Professional Cloud Developer
Professional Cloud Network Engineer
Professional Cloud Security Engineer
Professional DevOps Engineer (Beta results pending)
There is one more that will be announced in January… cannot wait!

Conferences

BitConf Speaker – link
vBrownBags Speaker – link
Google Next San Francisco ’19
Goole Next London ’19
Google Developer Group Leads Lisbon
GSI Champions Conference in Sunnyvale

Google Developer Group Cloud Bydgoszcz

5 Meetups this year with around 50 participants each!

4 Onsite Meetups
1 Online Meetup

AtoS

GSI Champion
Google Cloud Platform Learning Ambassador
Start developing Anthos on DPC/DHC
Decided to stay with the company having a proposal to work in one of my top 5 companies to work for.

Missed goals

GCP Certified Trainer – lack of time
Google Developer Expert – builiding protfolio
Cloud Guru Instructor – lack of time

Goals 2020

Installing Istio on GKE-OP for Anthos

30 grudnia 20193 stycznia 2020 Konrad ClapaDodaj komentarz

THIS ARTICLE IS STILL UNDER DEVELOPMENT

GKE-OP 1.1.2 supports open source Istio version 1.1.13. To perform the installation you require a user cluster to be installed and validated. The procedure of installation can be found here: https://archive.istio.io/v1.1/docs/setup/kubernetes/install/helm/

In this article we will show hot to install Istio and a simple microservice application. We will generate some traffic to that application and visualise the flows with Kiali.

The high level steps are as follows:

install Helm
deploy Istio CRDs
deploy Istio
expose Telemetry services
install BookInfo application

All the steps are performed from the Admin workstation

Installing Helm

Download Helm running:

curl https://get.helm.sh/helm-v2.16.1-linux-amd64.tar.gz --output helm-v2.16.1-linux-amd64.tar.gz

Unzip it, move to the bin folder and see if you can check the version

tar -zxvf helm-v2.16.1-linux-amd64.tar.gz

mv linux-amd64/helm /usr/local/bin/helm

Helm version

Install CRDs

helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -

Setup Kiali password

KIALI_USERNAME=$(read -p 'Kiali Username: ' uval && echo -n $uval | base64)

KIALI_PASSPHRASE=$(read -sp 'Kiali Passphrase: ' pval && echo -n $pval | base64)

when prompted pass the username and password

cat <<EOF | kubectl apply -f –

apiVersion: v1

kind: Secret

metadata:

namespace: $NAMESPACE

labels:

app: kiali

type: Opaque

data:

username: $KIALI_USERNAME

passphrase: $KIALI_PASSPHRASE

EOF

Install Istio using the the demo pattern – this icludes Kiali, Grafana and Jeagger.

helm template install/kubernetes/helm/istio --name istio --namespace istio-system \ --values install/kubernetes/helm/istio/values-istio-demo.yaml | kubectl apply -f -

Check that services are running

kubectl get service -n istio-system

kubectl get pods -n istio-system

Edit the Istio ingress gateway to assing IP address the Istio Gateway.

kubectl edit svc -n istio-system istio-ingressgateway

add

spec:

loadBalancerIP: <IP_Address>

Check that IP is assigned

kubectl get service -n istio-system

Expose Kiali service

For reference you can use: https://istio.io/docs/tasks/observability/gateways/

cat <<EOF | kubectl apply -f –

apiVersion: networking.istio.io/v1alpha3

kind: Gateway

metadata:

namespace: istio-system

spec:

selector:

istio: ingressgateway

servers:

– port:

number: 15029

protocol: HTTP

hosts:

– „*”

—

apiVersion: networking.istio.io/v1alpha3

kind: VirtualService

metadata:

namespace: istio-system

spec:

hosts:

– „*”

gateways:

– kiali-gateway

http:

– match:

– port: 15029

route:

– destination:

host: kiali

port:

number: 20001

—

apiVersion: networking.istio.io/v1alpha3

kind: DestinationRule

metadata:

namespace: istio-system

spec:

host: kiali

trafficPolicy:

tls:

mode: DISABLE

—

EOF

Connect to Kiali http://172.16.15.111:15029/kiali/

Deploy the application

kubectl apply -f <(istioctl kube-inject -f samples/bookinfo/platform/kube/bookinfo.yaml)

watch kubectl get pods

kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml

Issues with F5 BIG-IP load balancer in GKE-OP Anthos 1.x – K8s APIs not responding

30 grudnia 2019 Konrad ClapaDodaj komentarz

When using F5 BIG-IP load balancer of GKE On-Prem you might be wanting to use evaluation license. Keep in mind that this license has a restriction of 2MBps bandwidth in total. GKE-OP even with one user cluster can cause saturation and slowness of K8s API response. With multiple cluster and Istio installed the API can stop response at all. Note that F5 might not be showing the bandwith is saturated when you use the CLI tools.

Resolution: use full license or request 10GBps evaluation license.

Problems creating pre-check VM in Anthos 1.2 GKE-OP

30 grudnia 2019 Konrad ClapaDodaj komentarz

With Anthos 1.2 there is a new feature that creates a test VM to check connectivities before you deploy your GKE-OP clusters. It helps to avoid issues during the installation.

When installing you GKE On-Prem using the following documentation: https://cloud.google.com/gke-on-prem/docs/how-to/install-dhcp you perform checks with the following commands

gkectl check-config --config [PATH_TO_CONFIG]

you will get an error as bellow:

Validation Category: F5 BIG-IP
- [FAILURE] Admin Cluster VIP and NodeIP: Failed to create VM: failed to create VM (not retriable): failed to find VM template "gke-on-prem-osimage-1.14.7-gke.24mage-1.14.7-gke.24-20191120-f71f9a709b' not found
- [FAILURE] User Cluster VIP and NodeIP: Failed to create VM: failed to create VM (not retriable): failed to find VM template "gke-on-prem-osimage-1.14.7-gke.24-age-1.14.7-gke.24-20191120-f71f9a709b' not found

Root cause: This is cause by the image not being present on the datastore. The installation steps in the GCP docs have wrong sequence.

Solution: run

gkectl prepare --config [CONFIG_FILE] --validate-attestations

After that the VMs get created and connectivity checks can be performed

Anthos 1.2 is GA

23 grudnia 20193 stycznia 2020 Konrad ClapaDodaj komentarz

Anthos 1.2 with new GKE on-prem supporting vSphere 6.7 Update 3.

The release notes for each of the components can be found bellow:

Google Cloud Certified Fellow #5

All about Google Cloud Platform and Anthos

Autor: Konrad Clapa

Anthos 1.2 GKE-OP backup script issues

Anthos 1.x issue when running sudo commands

Professional Cloud DevOps Engineer Certificate tips

Google Cloud Certified Fellow

Anthos 1.2 Time Sync issues on GKE On-Prem nodes

Summary of 2019

Installing Istio on GKE-OP for Anthos

Issues with F5 BIG-IP load balancer in GKE-OP Anthos 1.x – K8s APIs not responding

Problems creating pre-check VM in Anthos 1.2 GKE-OP

Anthos 1.2 is GA