ubuntu16.04-kubernetes+arena搭建机器学习环境

安装环境信息:

结合几天的测试,梳理了一下在kubernentes环境上构建gpu的机器学习训练环境的搭建大致过程,使用ubuntu16.04作为操作系统,安装kubernetes并添加gpu的支持。并使用阿里开源的工具arena提交训练任务。

  • ubuntu:16.04
  • kubernetes:1.10.4
  • cuda:9.2
  • cudnn:7.2.1.38
  • nvidia-driver:390.77
  • docker-ce:18.03
  • nvidia-docker2
  • helm:2.8.2

所有需要的安装包下载地址:

https://st.zhusl.com/univer/go-1.10.tgz

https://st.zhusl.com/univer/NVIDIA-Linux-x86_64-390.77.run

https://st.zhusl.com/univer/cuda_9.2.148_396.37_linux.run

https://st.zhusl.com/univer/cuda_9.2.148.1_linux.run

https://st.zhusl.com/univer/cudnn-9.2-linux-x64-v7.2.1.38.tgz

https://st.zhusl.com/univer/helm

https://st.zhusl.com/univer/k8s.1-10-4.tar.gz

安装节点:

id Ip 配置 显卡数量
1 10.10.0.51 56C128G 4块 tesla v100

开始安装:

基础系统环境

下载软件包:

下载需要的安装文件:

cuda_9.2.148.1_linux.run

cuda_9.2.148_396.37_linux.run

cudnn-9.2-linux-x64-v7.2.1.38.tgz

NVIDIA-Linux-x86_64-390.77.run

升级软件包和系统内核:

原系统内核: > 4.4.0-116-generic

安装基础软件包(升级后内核:4.4.0-134-generic):

apt-get install dkms build-essential linux-headers-generic

屏蔽nouveau驱动(系统自带nvidia显卡驱动):

vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
update-initramfs -u

重启

安装nvidia驱动:

chmod +x NVIDIA-Linux-x86_64-390.77.run
 ./NVIDIA-Linux-x86_64-390.77.run

Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel (no)

查看显卡信息(nvidia-smi -pm 1 ): > nvidia-smi

-w659

安装cuda(9.2):

chmod +x cuda_9.2.148_396.37_linux.run
./cuda_9.2.148_396.37_linux.run
(accept n y y n)

安装补丁:

chmod +x cuda_9.2.148.1_linux.run
./cuda_9.2.148.1_linux.run

增加环境变量(profile):

export PATH="/usr/local/cuda-9.2/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-9.2/lib64:$LD_LIBRARY_PATH"
export CUDA_HOME="/usr/local/cuda"

生效环境变量:

source /etc/profile

安装cudnn:

# tar -xzvf cudnn-9.2-linux-x64-v7.2.1.38.tgz
# sudo cp cuda/include/cudnn.h /usr/local/cuda/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
# sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

安装docker-ce:


apt-get -y install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
apt-get  install docker-ce=18.03.1~ce-0~ubuntu


安装nvidia-docker2:

#  头两行为清理历史数据,可省略
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
apt-get purge -y nvidia-docker
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey |   sudo apt-key add - 

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list |   sudo tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get  update
#不翻墙的话会非常慢,可以直接下载包,之后直接安装
apt-get install  nvidia-docker2=2.0.3+docker18.03.1-1 nvidia-container-runtime=2.0.0+docker18.03.1-1

修改docker的service启动文件:

vim /lib/systemd/system/docker.service 
.....
ExecStart=/usr/bin/dockerd --default-runtime=nvidia  --log   -level error   --log-opt max-size=50m --log-opt max-file=5
......


重启docker服务:

systemctl  daemon-reload
systemctl  restart docker

安装kubernetes(1.10.4)

调整kubelet参数支持gpu > 开启 Kubernetes 对 GPU 支持;Kubernetes GPU 文档可以参考 https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ >
> kubelet 启动时增加 –feature-gates=“Accelerators=true”

安装nvidia插件使kubernetes可以获取GPU资源(版本和kubernetes一致)

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

安装arena(参考官方文档)

安装依赖包helm(2.8.2)

wget https://st.zhusl.com/univer/helm
mv helm /usr/local/bin/

helm init --upgrade -i registry.cn-hangzhou.aliyuncs.com/google_containers/tiller:v2.8.2 --stable-repo-url https://kubernetes.oss-cn-hangzhou.aliyuncs.com/charts

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
# 执行list命令没有报错表示安装成功
helm list

创建helm模版目录

mkdir /charts
git clone https://github.com/AliyunContainerService/arena.git
cp -r arena/charts/* /charts

安装tfjob支持(kubeflow)

#Install TFJob Controller
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml

#Install Dashboard(非必须)
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml

#Install MPIJob Controller
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml

安装arena

https://github.com/kubeflow/arena

需提前安装go

wget  https://st.zhusl.com/univer/go-1.10.tgz
tar zxf  go-1.10.tgz ; mv go /usr/local/

配置环境变量:

vi /etc/profile
export GOPATH=/var/lib/go
export GOROOT=/usr/local/go
export PATH=/usr/local/go/bin:$PATH
#环境变量生效
source /etc/profile

编译arena

mkdir -p $GOPATH/src/github.com/kubeflow
cd $GOPATH/src/github.com/kubeflow
git clone https://github.com/AliyunContainerService/arena.git
cd arena
make

增加arena到PATH

vi /etc/profile
export PATH=/var/lib/go/src/github.com/kubeflow/arena/bin:$PATH

测试,查看节点信息

arena top node

-w809

tensorflow上的一段代码跑的示例: > 详细使用方式参考官方提供的文档

# 官方示例
arena submit tf \
             --name=tf-git \
             --gpus=1 \
             --image=zhusl/tensorflow:1.5.0-devel-gpu \
             --syncMode=git \
             --syncSource=https://github.com/cheyang/tensorflow-sample-code.git \
             "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 100"

本文链接:参与评论 »

--EOF--

提醒:本文最后更新于 2025 天前,文中所描述的信息可能已发生改变,请谨慎使用。

专题「KUBERNETES」的其它文章 »

Comments