从零在单机上搭建k8s ,kubeflow1.7机器学习平台
前言
kubeflow是在k8s之上搭建的机器学习平台,涵盖了机器学习的开发、训练、优化、部署、管理阶段。由于我是在单机上进行的,故k8s环境是基于kind来进行快速搭建。Kind 是一个通过使用 docker 容器模拟节点来创建本地 k8s 集群的工具。综上,不难看出,kubeflow依赖于k8s,kind创建的k8s位于docker
容器中
。
一、基础环境准备
- centos版本:CentOS Linux release 7.6.1810
- docker版本:24.0.7
- kind版本:0.17.0 [注:kind 0.17.0 默认用的是 v.1.25.3 版本 k8s]
- kubectl版本:v1.25.3 [注:client以及server版本最好一致]
- kustomize版本:v5.0.3
- Kubernetes 版本:v1.25.3
二、docker安装
- 卸载旧版本(可选)
yum remove -y docker docker-common docker-selinux docker-engine
- 安装依赖的软件包
yum install -y yum-utils device-mapper-persistent-data lvm2
- 国内添加yum源进行加速
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
- 安装docker的三个组件
yum install docker-ce docker-ce-cli containerd.io
- docker-ce 的主要作用是负责构建、运行和管理容器。主要包含 Docker Daemon(服务端)和 Docker Client(客户端)- docker-ce-cli 是 Docker 的命令行工具。通过 CLI,用户可以与 Docker 引擎进行交互,运行 Docker 命令。它还提供了一系列命令,用于构建、管理和监视容器。- containerd.io提供了容器生命周期管理、镜像传输和存储等核心功能。它还提供了容器生命周期管理、镜像传输和存储等核心功能。 - 启动docker及开机自启动
sudo systemctl start docker #手动启动
sudo systemctl enable docker #开机自启动
三、kind安装
kind 只是一个二进制文件,因此下载下来放到 bin 目录即可。
踩坑:根据你的centos版本,kind应创建对应的k8s版本。不然有可能会报错,kind创建cluster的时候会用到对应的cgroup
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
四、kubectl、kustomize安装
Kind 只负责创建集群(会配置好 kubeconfig),后续操作集群的话需要手动安装 kubectl.
# 安装最新的kubectlcurl-LO"https://dl.k8s.io/release/$(curl-L-s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"chmod +x kubectl sudomv kubectl /usr/local/bin
# (可选)安装对应版本的kubectl,需要对应版本存在curl-LO https://dl.k8s.io/release/v1.25.3/bin/linux/arm64/kubectl
sudomv kubectl /usr/local/bin
# 安装对应版本的kustomizecurl-o install_kustomize.sh "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"sh install_kustomize.sh 5.0.3 .cp kustomize /bin/
五、创建k8s cluster
docker pull kindest/node:v1.25.3
然后创建集群时指定刚才拉下来的镜像 以及创建的yaml文件
kind create cluster --config=kind-config.yaml --image kindest/node:v1.25.3 --name kubeflow -v 5
不出问题的话,几分钟就能创建好kubeflow这个cluster
kind-config.yaml文件
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:-role: control-plane
extraPortMappings:-containerPort:30000#这里可以指定container对应的端口hostPort:30000#这里可以指定host对应的端口listenAddress:"0.0.0.0"# Optional, defaults to "0.0.0.0"protocol: tcp # Optional, defaults to tcpkubeadmConfigPatches:-|
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
常用命令集合
# kind 查看集群数据
kind get clusters
# 删除集群# syntax:kind delete cluster --name ${clustername}
kind delete cluster --name kubeflow
这里需要特别注意,后面是个大坑!
由于 kind 是通过 docker 容器模拟 node 来部署集群的,因此和普通集群有一些差异。主要包括以下几个方面:
- 文件系统 - kind cluster 中无法直接访问宿主机上的文件- kind cluster 中无法直接使用宿主机上的镜像
- 网络 - 比如无法在宿主机直接访问 kind cluster 中的服务
以上问题 kind 都提供了相应的解决方案,比如镜像导入、端口映射、目录映射,详细解决方案可见链接
六、国内私有云搭建kubeflow
6.1 下载官方安装脚本
看官方的配置要求 https://github.com/kubeflow/manifests#prerequisites
# v1.7.0官方脚本文件wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.7.0.zip
# unzip v1.7.0.zipunzip v1.7.0.zip mv manifests-1.7.0/ manifests
6.2 国外镜像拉到本地
由于在私有云上无法访问外网,之前找了很多方案都无法解决,故只能找到国内的一个镜像源来代替,再次很感谢Daocloud所做的工作,让国内访问国外镜像便捷了很多。
#cd 到manifests目录cd manifests
#获取gcr镜像,因为我的网络只无法获取gcr.io, 可以根据需求修改
kustomize build example |grep'image: gcr.io'|awk'$2 != "" { print $2}'|sort-u#将需要拉取的的gcr镜像保存到image_list.txt中(上面这条命令输出的就是gcr镜像,手动复制粘贴)vim image_list.txt
# 运行脚本,脚本在下方给出sh pull_images.sh
# 查看docker镜像docker images
pull_images.sh文件如下
registry_prefix="m.daocloud.io/"# 读取镜像列表whileIFS=read-r image;do# 拼接完整的镜像路径full_image="${registry_prefix}${image}"# 执行 Docker 拉取命令docker pull "$full_image"# 如果你还需要其他操作,可以在这里添加done< image_list.txt
这里就是上面第五模块提到的一个坑,因为拉取的镜像是在本地,在cluster中无法直接访问,所以还需要将镜像同步到cluster中
#同步到cluster,批量的话也可以自己写个脚本文件,类似于上面的sh文件。或者自己手动一个个load也行
kind load docker-image m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/webhook:bc1376 --name kubeflow
#待全部都load到cluster中可以进去容器中查看,跟宿主机上一致就没问题了 # 查看 node 里的镜像列表dockerexec-it kubeflow-control-plane crictl images
这部分的第二个坑,对于knative-releases部分的,没有对应的tag,需要自己打tag,注意tag一定要正确了,后面kustomization.yaml要用到。
# 打tag,我这里用的就是原本仓库@后面的前六位# docker tag <IMAGE ID > <REPOSITORY-name:tag-name>docker tag 28336a010382 m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/activator:c3bbf3
6.3 准备sc、pv、pvc
- 准备本地目录
mkdir-p /data/k8s/istio-authservice /data/k8s/katib-mysql /data/k8s/minio /data/k8s/mysql-pv-claim修改auth路径权限sudochmod-R777 /data/k8s/istio-authservice/
- 编写kubeflow-storage.yamlhostPath: path: “/data/k8s/istio-authservice” 改成上面各自创建的目录
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: authservice
namespace: istio-system
labels:
type: local
spec:
storageClassName: local-storage
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data/k8s/istio-authservice"
---
apiVersion: v1
kind: PersistentVolume
metadata:
namespace: kubeflow
name: katib-mysql
labels:
type: local
spec:
storageClassName: local-storage
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data/k8s/katib-mysql"
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: minio
namespace: kubeflow
labels:
type: local
spec:
storageClassName: local-storage
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data/k8s/minio"
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: mysql-pv-claim
namespace: kubeflow
labels:
type: local
spec:
storageClassName: local-storage
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/data/k8s/mysql-pv-claim"
执行
# 执行脚本 安装 sc pv pvc
kubectl apply -f kubeflow-storage.yaml
#这里也有一个坑 就是最后执行完要看看绑定的NAME 跟 CLAIM是不是一致的,不一致的话会报错minio-*以及ml-pipeline会报错。需要要删除对应的pv以及pvc,单独创建pv,详情看6.6 第五个坑#查看创建的的sc pv pvc
kubectl get sc
kubectl get pv-n kubeflow
kubectl get pvc -n kubeflow
6.4 修改安装脚本
在example/kustomization.yaml脚本中添加images参数,这是全局的修改yaml文件,另外我将visualization-server以及frontend这两个pod都修改成了2.0.0-alpha.7,原本脚本拉取的是最新的镜像。
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
sortOptions:order: legacy
legacySortOptions:orderFirst:- Namespace
- ResourceQuota
- StorageClass
- CustomResourceDefinition
- MutatingWebhookConfiguration
- ServiceAccount
- PodSecurityPolicy
- Role
- ClusterRole
- RoleBinding
- ClusterRoleBinding
- ConfigMap
- Secret
- Endpoints
- Service
- LimitRange
- PriorityClass
- PersistentVolume
- PersistentVolumeClaim
- Deployment
- StatefulSet
- CronJob
- PodDisruptionBudget
orderLast:- ValidatingWebhookConfiguration
resources:# Cert-Manager- ../common/cert-manager/cert-manager/base
- ../common/cert-manager/kubeflow-issuer/base
# Istio- ../common/istio-1-16/istio-crds/base
- ../common/istio-1-16/istio-namespace/base
- ../common/istio-1-16/istio-install/base
# OIDC Authservice- ../common/oidc-authservice/base
# Dex- ../common/dex/overlays/istio
# KNative- ../common/knative/knative-serving/overlays/gateways
- ../common/knative/knative-eventing/base
- ../common/istio-1-16/cluster-local-gateway/base
# Kubeflow namespace- ../common/kubeflow-namespace/base
# Kubeflow Roles- ../common/kubeflow-roles/base
# Kubeflow Istio Resources- ../common/istio-1-16/kubeflow-istio-resources/base
# Kubeflow Pipelines- ../apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user
# Katib- ../apps/katib/upstream/installs/katib-with-kubeflow
# Central Dashboard- ../apps/centraldashboard/upstream/overlays/kserve
# Admission Webhook- ../apps/admission-webhook/upstream/overlays/cert-manager
# Jupyter Web App- ../apps/jupyter/jupyter-web-app/upstream/overlays/istio
# Notebook Controller- ../apps/jupyter/notebook-controller/upstream/overlays/kubeflow
# Profiles + KFAM- ../apps/profiles/upstream/overlays/kubeflow
# Volumes Web App- ../apps/volumes-web-app/upstream/overlays/istio
# Tensorboards Controller- ../apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow
# Tensorboard Web App- ../apps/tensorboard/tensorboards-web-app/upstream/overlays/istio
# Training Operator- ../apps/training-operator/upstream/overlays/kubeflow
# User namespace- ../common/user-namespace/base
# KServe- ../contrib/kserve/kserve
- ../contrib/kserve/models-web-app/overlays/kubeflow
images:-name: gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/metadata-envoy
newTag:"2.0.0-alpha.7"-name: gcr.io/arrikto/kubeflow/oidc-authservice:e236439
newName: m.daocloud.io/gcr.io/arrikto/kubeflow/oidc-authservice
newTag:"e236439"-name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/eventing/cmd/controller
newTag:"33d785"-name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
newTag:"282b52"-name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
newTag:"d217ab"-name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
newTag:"2b484d"-name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
newTag:"59b6a4"-name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/activator
newTag:"c3bbf3"-name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
newTag:"caae5e"-name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/controller
newTag:"38f955"-name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
newTag:"763d64"-name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
newTag:"a4ba00"-name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/queue
newTag:"505179"-name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/webhook
newTag:"bc1376"-name: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
newName: m.daocloud.io/gcr.io/kubebuilder/kube-rbac-proxy
newTag:"v0.13.1"-name: gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
newName: m.daocloud.io/gcr.io/kubebuilder/kube-rbac-proxy
newTag:"v0.8.0"-name: gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/api-server
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/cache-server
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/frontend
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/frontend
newName: m.daocloud.io/gcr.io/ml-pipeline/frontend
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/metadata-writer
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
newName: m.daocloud.io/gcr.io/ml-pipeline/minio
newTag:"RELEASE.2019-08-14T20-37-41Z-license-compliance"-name: gcr.io/ml-pipeline/mysql:8.0.26
newName: m.daocloud.io/gcr.io/ml-pipeline/mysql
newTag:"8.0.26"-name: gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/persistenceagent
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/scheduledworkflow
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/viewer-crd-controller
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
newName: m.daocloud.io/gcr.io/ml-pipeline/workflow-controller
newTag:"v3.3.8-license-compliance"-name: gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
newName: m.daocloud.io/gcr.io/tfx-oss-public/ml_metadata_store_server
newTag:"1.5.0"-name: gcr.io/ml-pipeline/visualization-server
newName: m.daocloud.io/gcr.io/ml-pipeline/visualization-server
newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
newName: m.daocloud.io/gcr.io/ml-pipeline/visualization-server
newTag:"2.0.0-alpha.7"
在下面每个文件里面添加
storageClassName: local-storage
apps/katib/upstream/components/mysql/pvc.yaml
apps/pipeline/upstream/third-party/minio/base/minio-pvc.yaml
apps/pipeline/upstream/third-party/mysql/base/mysql-pv-claim.yaml
common/oidc-authservice/base/pvc.yaml
6.5 一键安装
cd manifests
# 执行脚本安装while! kustomize build example | kubectl apply -f -;doecho"Retrying to apply resources";sleep10;done
执行完之后,需要等待半个小时左右,因为有很多个pod需要拉取。运行下面的命令查看pod状态,须所有pod都为running了就没问题。
kubectl get pods --all-namespaces
带所有pod都running之后,运行以下命令,访问kubeflow dashboard,如下图所示
kubectl port-forward --address0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80
6.6 踩的坑
- 第一个坑:如6.2最后,需要将宿主机上的镜像同步到cluster中,若读者不适用kind创建的就忽略这一步
- 第二个坑:对于拉取的@XXX,需要自己重新去打tag,详情见6.2最后部分
- 第三个坑:关于kubeflow-user-example-com的,虽然在kustomization.yaml文件中修改了,但这部分还是回去gcr.*去拉取,所以需要手动去修改。以上三个坑基本都是报 imagepullbackoff错误,大多数都是因为镜像在国外,国内无法访问。解决办法就是用国内镜像代替,在前面添加m.daocloud.io/
# 修改podkubectl edit pod -n kubeflow-user-example-com <NAME># 找到对应的image目录进行修改,手动添加m.daocloud.io/ 如下图所示
4. 第四个坑,authservice创建持久卷的权限问题,报错CrashLoopBackOff
# 查看错误日志
kubectl logs authservice-0 -n istio-system
# 查看上一步错误日志
kubectl logs -p authservice-0 -n istio-system
# 查看端口日志
kubectl describe pod authservice-0 -n istio-system
# 错误信息
authservice-0 pod 启动失败:Error opening bolt store: open /var/lib/authservice/data.db: permission denied
修改
common/oidc-authservice/base/statefulset.yaml
, 添加以下内容
initContainers:-name: fix-permission
image: busybox
command:['sh','-c']args:['chmod -R 777 /var/lib/authservice;']volumeMounts:-mountPath: /var/lib/authservice
name: data
或者用命令
kubectl edit statefulset -n istio-system authservice
直接进行修改,修改之后的yaml文件如图所示
- 第五个坑,minio-*以及ml-pipeline-报错CrashLoopBackOff,主要原因是持久化卷绑定错误,minio-pvc以及mysql-pv-claim绑定的VOLUM弄反了。
#查看 pv 以及 pvckubectl get pvc -n kubeflowkubectl get pv-n kubeflow
解决方案删除原来的pv以及pvc# kubectl delete pvc -n <k8s-name> <pv-name OR pvc-name>kubectl delete pvc -n kubeflow minio-pvckubectl delete pv-n kubeflow minio-pvc# 结果发现删不掉,hhh。因为k8s存在数据保护机制,需要先接触保护机制。pvc同理kubectl patch pv<pv-name>-p'{"metadata":{"finalizers":null}}'kubectl delete pv<pv-name>
将6.3第二部分的kubeflow-storage.yaml文件,单独拎出来minio以及mysql-pv-claim部分,重新创建#重新创建 auth 以及mysql 持久化卷kubectl apply -f *.yaml
创建之后,因为这些pod采用的是动态存储,还需要将这些pod删除之后才能够完成# 删除pod# kubectl delete pod -n <k8s-name> <pod-name>kubectl delete pod -n kubeflow minio
七、总结
折腾了好几天,写下这篇博客记录,希望能帮助需要的人。如果对您有帮助,辛苦点个赞鼓励一下,码字不易,谢谢。
参考链接
版权归原作者 king_super123 所有, 如有侵权,请联系我们删除。