0


从零在单机上搭建k8s ,kubeflow1.7机器学习平台(国内环境)

从零在单机上搭建k8s ,kubeflow1.7机器学习平台

前言

kubeflow是在k8s之上搭建的机器学习平台,涵盖了机器学习的开发、训练、优化、部署、管理阶段。由于我是在单机上进行的,故k8s环境是基于kind来进行快速搭建。Kind 是一个通过使用 docker 容器模拟节点来创建本地 k8s 集群的工具。综上,不难看出,kubeflow依赖于k8s,kind创建的k8s位于docker

容器中

一、基础环境准备

  • centos版本:CentOS Linux release 7.6.1810
  • docker版本:24.0.7
  • kind版本:0.17.0 [注:kind 0.17.0 默认用的是 v.1.25.3 版本 k8s]
  • kubectl版本:v1.25.3 [注:client以及server版本最好一致]
  • kustomize版本:v5.0.3
  • Kubernetes 版本:v1.25.3

二、docker安装

  1. 卸载旧版本(可选)yum remove -y docker docker-common docker-selinux docker-engine
  2. 安装依赖的软件包yum install -y yum-utils device-mapper-persistent-data lvm2
  3. 国内添加yum源进行加速yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
  4. 安装docker的三个组件yum install docker-ce docker-ce-cli containerd.io- docker-ce 的主要作用是负责构建、运行和管理容器。主要包含 Docker Daemon(服务端)和 Docker Client(客户端)- docker-ce-cli 是 Docker 的命令行工具。通过 CLI,用户可以与 Docker 引擎进行交互,运行 Docker 命令。它还提供了一系列命令,用于构建、管理和监视容器。- containerd.io提供了容器生命周期管理、镜像传输和存储等核心功能。它还提供了容器生命周期管理、镜像传输和存储等核心功能。
  5. 启动docker及开机自启动
sudo systemctl start docker #手动启动
sudo systemctl enable docker #开机自启动

三、kind安装

kind 只是一个二进制文件,因此下载下来放到 bin 目录即可。

踩坑:根据你的centos版本,kind应创建对应的k8s版本。不然有可能会报错,kind创建cluster的时候会用到对应的cgroup

curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

四、kubectl、kustomize安装

Kind 只负责创建集群(会配置好 kubeconfig),后续操作集群的话需要手动安装 kubectl.

# 安装最新的kubectlcurl-LO"https://dl.k8s.io/release/$(curl-L-s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"chmod +x kubectl sudomv kubectl /usr/local/bin

# (可选)安装对应版本的kubectl,需要对应版本存在curl-LO https://dl.k8s.io/release/v1.25.3/bin/linux/arm64/kubectl
sudomv kubectl /usr/local/bin

# 安装对应版本的kustomizecurl-o install_kustomize.sh "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"sh install_kustomize.sh 5.0.3 .cp kustomize /bin/

五、创建k8s cluster

docker pull kindest/node:v1.25.3

然后创建集群时指定刚才拉下来的镜像 以及创建的yaml文件

kind create cluster --config=kind-config.yaml --image kindest/node:v1.25.3 --name kubeflow -v 5

不出问题的话,几分钟就能创建好kubeflow这个cluster

kind-config.yaml文件

apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:-role: control-plane
  extraPortMappings:-containerPort:30000#这里可以指定container对应的端口hostPort:30000#这里可以指定host对应的端口listenAddress:"0.0.0.0"# Optional, defaults to "0.0.0.0"protocol: tcp # Optional, defaults to tcpkubeadmConfigPatches:-|
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"

常用命令集合

# kind 查看集群数据
kind get clusters
# 删除集群# syntax:kind delete  cluster --name ${clustername}
kind delete  cluster --name kubeflow

这里需要特别注意,后面是个大坑!

由于 kind 是通过 docker 容器模拟 node 来部署集群的,因此和普通集群有一些差异。主要包括以下几个方面:

  • 文件系统 - kind cluster 中无法直接访问宿主机上的文件- kind cluster 中无法直接使用宿主机上的镜像
  • 网络 - 比如无法在宿主机直接访问 kind cluster 中的服务

以上问题 kind 都提供了相应的解决方案,比如镜像导入、端口映射、目录映射,详细解决方案可见链接

六、国内私有云搭建kubeflow

6.1 下载官方安装脚本

看官方的配置要求 https://github.com/kubeflow/manifests#prerequisites

# v1.7.0官方脚本文件wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.7.0.zip

# unzip v1.7.0.zipunzip v1.7.0.zip mv manifests-1.7.0/ manifests

6.2 国外镜像拉到本地

由于在私有云上无法访问外网,之前找了很多方案都无法解决,故只能找到国内的一个镜像源来代替,再次很感谢Daocloud所做的工作,让国内访问国外镜像便捷了很多。

#cd 到manifests目录cd manifests
#获取gcr镜像,因为我的网络只无法获取gcr.io, 可以根据需求修改
kustomize build example |grep'image: gcr.io'|awk'$2 != "" { print $2}'|sort-u#将需要拉取的的gcr镜像保存到image_list.txt中(上面这条命令输出的就是gcr镜像,手动复制粘贴)vim image_list.txt
# 运行脚本,脚本在下方给出sh pull_images.sh
# 查看docker镜像docker images

pull_images.sh文件如下

registry_prefix="m.daocloud.io/"# 读取镜像列表whileIFS=read-r image;do# 拼接完整的镜像路径full_image="${registry_prefix}${image}"# 执行 Docker 拉取命令docker pull "$full_image"# 如果你还需要其他操作,可以在这里添加done< image_list.txt

这里就是上面第五模块提到的一个坑,因为拉取的镜像是在本地,在cluster中无法直接访问,所以还需要将镜像同步到cluster

#同步到cluster,批量的话也可以自己写个脚本文件,类似于上面的sh文件。或者自己手动一个个load也行
kind load docker-image m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/webhook:bc1376 --name kubeflow 
#待全部都load到cluster中可以进去容器中查看,跟宿主机上一致就没问题了  # 查看 node 里的镜像列表dockerexec-it kubeflow-control-plane crictl images

这部分的第二个坑,对于knative-releases部分的,没有对应的tag,需要自己打tag,注意tag一定要正确了,后面kustomization.yaml要用到。

# 打tag,我这里用的就是原本仓库@后面的前六位# docker tag <IMAGE ID > <REPOSITORY-name:tag-name>docker tag 28336a010382 m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/activator:c3bbf3

6.3 准备sc、pv、pvc

  1. 准备本地目录mkdir-p /data/k8s/istio-authservice /data/k8s/katib-mysql /data/k8s/minio /data/k8s/mysql-pv-claim修改auth路径权限sudochmod-R777 /data/k8s/istio-authservice/
  2. 编写kubeflow-storage.yamlhostPath: path: “/data/k8s/istio-authservice” 改成上面各自创建的目录
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: authservice
  namespace: istio-system
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/istio-authservice"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  namespace: kubeflow
  name: katib-mysql
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/katib-mysql"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: minio
  namespace: kubeflow
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/minio"

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: mysql-pv-claim
  namespace: kubeflow
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/k8s/mysql-pv-claim"

执行

# 执行脚本 安装 sc pv pvc
kubectl apply -f kubeflow-storage.yaml
#这里也有一个坑 就是最后执行完要看看绑定的NAME 跟 CLAIM是不是一致的,不一致的话会报错minio-*以及ml-pipeline会报错。需要要删除对应的pv以及pvc,单独创建pv,详情看6.6 第五个坑#查看创建的的sc pv pvc
kubectl get sc
kubectl get pv-n kubeflow
kubectl get pvc -n kubeflow

6.4 修改安装脚本

在example/kustomization.yaml脚本中添加images参数,这是全局的修改yaml文件,另外我将visualization-server以及frontend这两个pod都修改成了2.0.0-alpha.7,原本脚本拉取的是最新的镜像。

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

sortOptions:order: legacy
  legacySortOptions:orderFirst:- Namespace
    - ResourceQuota
    - StorageClass
    - CustomResourceDefinition
    - MutatingWebhookConfiguration
    - ServiceAccount
    - PodSecurityPolicy
    - Role
    - ClusterRole
    - RoleBinding
    - ClusterRoleBinding
    - ConfigMap
    - Secret
    - Endpoints
    - Service
    - LimitRange
    - PriorityClass
    - PersistentVolume
    - PersistentVolumeClaim
    - Deployment
    - StatefulSet
    - CronJob
    - PodDisruptionBudget
    orderLast:- ValidatingWebhookConfiguration

resources:# Cert-Manager- ../common/cert-manager/cert-manager/base
- ../common/cert-manager/kubeflow-issuer/base
# Istio- ../common/istio-1-16/istio-crds/base
- ../common/istio-1-16/istio-namespace/base
- ../common/istio-1-16/istio-install/base
# OIDC Authservice- ../common/oidc-authservice/base
# Dex- ../common/dex/overlays/istio
# KNative- ../common/knative/knative-serving/overlays/gateways
- ../common/knative/knative-eventing/base
- ../common/istio-1-16/cluster-local-gateway/base
# Kubeflow namespace- ../common/kubeflow-namespace/base
# Kubeflow Roles- ../common/kubeflow-roles/base
# Kubeflow Istio Resources- ../common/istio-1-16/kubeflow-istio-resources/base

# Kubeflow Pipelines- ../apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user
# Katib- ../apps/katib/upstream/installs/katib-with-kubeflow
# Central Dashboard- ../apps/centraldashboard/upstream/overlays/kserve
# Admission Webhook- ../apps/admission-webhook/upstream/overlays/cert-manager
# Jupyter Web App- ../apps/jupyter/jupyter-web-app/upstream/overlays/istio
# Notebook Controller- ../apps/jupyter/notebook-controller/upstream/overlays/kubeflow
# Profiles + KFAM- ../apps/profiles/upstream/overlays/kubeflow
# Volumes Web App- ../apps/volumes-web-app/upstream/overlays/istio
# Tensorboards Controller-  ../apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow
# Tensorboard Web App-  ../apps/tensorboard/tensorboards-web-app/upstream/overlays/istio
# Training Operator- ../apps/training-operator/upstream/overlays/kubeflow
# User namespace- ../common/user-namespace/base

# KServe- ../contrib/kserve/kserve
- ../contrib/kserve/models-web-app/overlays/kubeflow

images:-name: gcr.io/ml-pipeline/metadata-envoy:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/metadata-envoy
   newTag:"2.0.0-alpha.7"-name: gcr.io/arrikto/kubeflow/oidc-authservice:e236439
   newName: m.daocloud.io/gcr.io/arrikto/kubeflow/oidc-authservice
   newTag:"e236439"-name: gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:33d78536e9b38dbb2ec2952207b48ff8e05acb48e7d28c2305bd0a0f7156198f
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/eventing/cmd/controller
   newTag:"33d785"-name: gcr.io/knative-releases/knative.dev/eventing/cmd/mtping@sha256:282b5265e1ef26309b3343038c9b4f172654e06cbee46f6ddffd23ea9ad9a3be
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/eventing/cmd/mtping
   newTag:"282b52"-name: gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:d217ab7e3452a87f8cbb3b45df65c98b18b8be39551e3e960cd49ea44bb415ba
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/eventing/cmd/webhook
   newTag:"d217ab"-name: gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/net-istio/cmd/controller
   newTag:"2b484d"-name: gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8d
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook
   newTag:"59b6a4"-name: gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/activator
   newTag:"c3bbf3"-name: gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler
   newTag:"caae5e"-name: gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/controller
   newTag:"38f955"-name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping
   newTag:"763d64"-name: gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook
   newTag:"a4ba00"-name: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:505179c0c4892ea4a70e78bc52ac21b03cd7f1a763d2ecc78e7bbaa1ae59c86c
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/queue
   newTag:"505179"-name: gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae
   newName: m.daocloud.io/gcr.io/knative-releases/knative.dev/serving/cmd/webhook
   newTag:"bc1376"-name: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1
   newName: m.daocloud.io/gcr.io/kubebuilder/kube-rbac-proxy
   newTag:"v0.13.1"-name: gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0
   newName: m.daocloud.io/gcr.io/kubebuilder/kube-rbac-proxy
   newTag:"v0.8.0"-name: gcr.io/ml-pipeline/api-server:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/api-server
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/cache-server:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/cache-server
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/frontend:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/frontend
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/frontend
   newName: m.daocloud.io/gcr.io/ml-pipeline/frontend
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/metadata-writer
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
   newName: m.daocloud.io/gcr.io/ml-pipeline/minio    
   newTag:"RELEASE.2019-08-14T20-37-41Z-license-compliance"-name: gcr.io/ml-pipeline/mysql:8.0.26
   newName: m.daocloud.io/gcr.io/ml-pipeline/mysql
   newTag:"8.0.26"-name: gcr.io/ml-pipeline/persistenceagent:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/persistenceagent
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/scheduledworkflow
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/viewer-crd-controller:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/viewer-crd-controller
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/workflow-controller:v3.3.8-license-compliance
   newName: m.daocloud.io/gcr.io/ml-pipeline/workflow-controller
   newTag:"v3.3.8-license-compliance"-name: gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
   newName: m.daocloud.io/gcr.io/tfx-oss-public/ml_metadata_store_server
   newTag:"1.5.0"-name: gcr.io/ml-pipeline/visualization-server
   newName: m.daocloud.io/gcr.io/ml-pipeline/visualization-server
   newTag:"2.0.0-alpha.7"-name: gcr.io/ml-pipeline/visualization-server:2.0.0-alpha.7
   newName: m.daocloud.io/gcr.io/ml-pipeline/visualization-server
   newTag:"2.0.0-alpha.7"

在下面每个文件里面添加

storageClassName: local-storage
apps/katib/upstream/components/mysql/pvc.yaml
apps/pipeline/upstream/third-party/minio/base/minio-pvc.yaml
apps/pipeline/upstream/third-party/mysql/base/mysql-pv-claim.yaml
common/oidc-authservice/base/pvc.yaml

在这里插入图片描述

6.5 一键安装

cd manifests
# 执行脚本安装while! kustomize build example | kubectl apply -f -;doecho"Retrying to apply resources";sleep10;done

执行完之后,需要等待半个小时左右,因为有很多个pod需要拉取。运行下面的命令查看pod状态,须所有pod都为running了就没问题。

kubectl get pods --all-namespaces

在这里插入图片描述

带所有pod都running之后,运行以下命令,访问kubeflow dashboard,如下图所示

kubectl port-forward --address0.0.0.0 svc/istio-ingressgateway -n istio-system 8080:80

在这里插入图片描述

6.6 踩的坑

  1. 第一个坑:如6.2最后,需要将宿主机上的镜像同步到cluster中,若读者不适用kind创建的就忽略这一步
  2. 第二个坑:对于拉取的@XXX,需要自己重新去打tag,详情见6.2最后部分
  3. 第三个坑:关于kubeflow-user-example-com的,虽然在kustomization.yaml文件中修改了,但这部分还是回去gcr.*去拉取,所以需要手动去修改。以上三个坑基本都是报 imagepullbackoff错误,大多数都是因为镜像在国外,国内无法访问。解决办法就是用国内镜像代替,在前面添加m.daocloud.io/# 修改podkubectl edit pod -n kubeflow-user-example-com <NAME># 找到对应的image目录进行修改,手动添加m.daocloud.io/ 如下图所示

在这里插入图片描述
4. 第四个坑,authservice创建持久卷的权限问题,报错CrashLoopBackOff

# 查看错误日志
kubectl logs authservice-0 -n istio-system
# 查看上一步错误日志
kubectl logs -p authservice-0 -n istio-system
# 查看端口日志
kubectl describe pod authservice-0 -n istio-system
# 错误信息
authservice-0 pod 启动失败:Error opening bolt store: open /var/lib/authservice/data.db: permission denied

修改

common/oidc-authservice/base/statefulset.yaml

, 添加以下内容

initContainers:-name: fix-permission
        image: busybox
        command:['sh','-c']args:['chmod -R 777 /var/lib/authservice;']volumeMounts:-mountPath: /var/lib/authservice
          name: data

或者用命令

kubectl edit statefulset -n istio-system authservice

直接进行修改,修改之后的yaml文件如图所示
在这里插入图片描述

  1. 第五个坑,minio-*以及ml-pipeline-报错CrashLoopBackOff,主要原因是持久化卷绑定错误,minio-pvc以及mysql-pv-claim绑定的VOLUM弄反了。#查看 pv 以及 pvckubectl get pvc -n kubeflowkubectl get pv-n kubeflow解决方案删除原来的pv以及pvc# kubectl delete pvc -n <k8s-name> <pv-name OR pvc-name>kubectl delete pvc -n kubeflow minio-pvckubectl delete pv-n kubeflow minio-pvc# 结果发现删不掉,hhh。因为k8s存在数据保护机制,需要先接触保护机制。pvc同理kubectl patch pv<pv-name>-p'{"metadata":{"finalizers":null}}'kubectl delete pv<pv-name>将6.3第二部分的kubeflow-storage.yaml文件,单独拎出来minio以及mysql-pv-claim部分,重新创建#重新创建 auth 以及mysql 持久化卷kubectl apply -f *.yaml创建之后,因为这些pod采用的是动态存储,还需要将这些pod删除之后才能够完成# 删除pod# kubectl delete pod -n <k8s-name> <pod-name>kubectl delete pod -n kubeflow minio

七、总结

折腾了好几天,写下这篇博客记录,希望能帮助需要的人。如果对您有帮助,辛苦点个赞鼓励一下,码字不易,谢谢。

参考链接

  1. 国内镜像同步 https://github.com/DaoCloud/public-image-mirror
  2. kind安装k8s https://www.lixueduan.com/posts/kubernetes/15-kind-kubernetes-in-docker
  3. kubeflow安装1https://blog.csdn.net/weixin_40548136/article/details/131481520
  4. kubeflow安装2https://blog.csdn.net/yanqianglifei/article/details/128432784

本文转载自: https://blog.csdn.net/king_super123/article/details/134671914
版权归原作者 king_super123 所有, 如有侵权,请联系我们删除。

“从零在单机上搭建k8s ,kubeflow1.7机器学习平台(国内环境)”的评论:

还没有评论