安装环境信息：

结合几天的测试，梳理了一下在kubernentes环境上构建gpu的机器学习训练环境的搭建大致过程，使用ubuntu16.04作为操作系统，安装kubernetes并添加gpu的支持。并使用阿里开源的工具arena提交训练任务。

ubuntu:16.04
kubernetes:1.10.4
cuda:9.2
cudnn:7.2.1.38
nvidia-driver:390.77
docker-ce:18.03
nvidia-docker2
helm:2.8.2

所有需要的安装包下载地址：

https://st.zhusl.com/univer/go-1.10.tgz

https://st.zhusl.com/univer/NVIDIA-Linux-x86_64-390.77.run

https://st.zhusl.com/univer/cuda_9.2.148_396.37_linux.run

https://st.zhusl.com/univer/cuda_9.2.148.1_linux.run

https://st.zhusl.com/univer/cudnn-9.2-linux-x64-v7.2.1.38.tgz

https://st.zhusl.com/univer/helm

https://st.zhusl.com/univer/k8s.1-10-4.tar.gz

安装节点：

id	Ip	配置	显卡数量
1	10.10.0.51	56C128G	4块 tesla v100

开始安装：

基础系统环境

下载软件包：

下载需要的安装文件：

cuda_9.2.148.1_linux.run

cuda_9.2.148_396.37_linux.run

cudnn-9.2-linux-x64-v7.2.1.38.tgz

NVIDIA-Linux-x86_64-390.77.run

升级软件包和系统内核：

原系统内核： > 4.4.0-116-generic

安装基础软件包(升级后内核：4.4.0-134-generic)：

apt-get install dkms build-essential linux-headers-generic

屏蔽nouveau驱动（系统自带nvidia显卡驱动）：

vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
update-initramfs -u

重启

安装nvidia驱动：

chmod +x NVIDIA-Linux-x86_64-390.77.run
 ./NVIDIA-Linux-x86_64-390.77.run

Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel (no)

查看显卡信息(nvidia-smi -pm 1 ): > nvidia-smi

-w659

安装cuda（9.2）：

chmod +x cuda_9.2.148_396.37_linux.run
./cuda_9.2.148_396.37_linux.run
(accept n y y n)

安装补丁：

chmod +x cuda_9.2.148.1_linux.run
./cuda_9.2.148.1_linux.run

增加环境变量（profile）：

export PATH="/usr/local/cuda-9.2/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-9.2/lib64:$LD_LIBRARY_PATH"
export CUDA_HOME="/usr/local/cuda"

生效环境变量：

source /etc/profile

安装cudnn：

# tar -xzvf cudnn-9.2-linux-x64-v7.2.1.38.tgz
# sudo cp cuda/include/cudnn.h /usr/local/cuda/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
# sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

安装docker-ce：


apt-get -y install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
apt-get  install docker-ce=18.03.1~ce-0~ubuntu

安装nvidia-docker2：

#  头两行为清理历史数据，可省略
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
apt-get purge -y nvidia-docker
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey |   sudo apt-key add - 

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list |   sudo tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get  update
#不翻墙的话会非常慢，可以直接下载包，之后直接安装
apt-get install  nvidia-docker2=2.0.3+docker18.03.1-1 nvidia-container-runtime=2.0.0+docker18.03.1-1

修改docker的service启动文件：

vim /lib/systemd/system/docker.service 
.....
ExecStart=/usr/bin/dockerd --default-runtime=nvidia  --log   -level error   --log-opt max-size=50m --log-opt max-file=5
......

重启docker服务：

systemctl  daemon-reload
systemctl  restart docker

安装kubernetes（1.10.4）

略

调整kubelet参数支持gpu > 开启 Kubernetes 对 GPU 支持；Kubernetes GPU 文档可以参考 https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ >
> kubelet 启动时增加 –feature-gates=“Accelerators=true”

安装nvidia插件使kubernetes可以获取GPU资源（版本和kubernetes一致）

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

安装arena（参考官方文档）

安装依赖包helm(2.8.2)

wget https://st.zhusl.com/univer/helm
mv helm /usr/local/bin/

helm init --upgrade -i registry.cn-hangzhou.aliyuncs.com/google_containers/tiller:v2.8.2 --stable-repo-url https://kubernetes.oss-cn-hangzhou.aliyuncs.com/charts

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
# 执行list命令没有报错表示安装成功
helm list

创建helm模版目录

mkdir /charts
git clone https://github.com/AliyunContainerService/arena.git
cp -r arena/charts/* /charts

安装tfjob支持（kubeflow）

#Install TFJob Controller
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml

#Install Dashboard（非必须）
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml

#Install MPIJob Controller
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml

安装arena

https://github.com/kubeflow/arena

需提前安装go

wget  https://st.zhusl.com/univer/go-1.10.tgz
tar zxf  go-1.10.tgz ; mv go /usr/local/

配置环境变量：

vi /etc/profile
export GOPATH=/var/lib/go
export GOROOT=/usr/local/go
export PATH=/usr/local/go/bin:$PATH
#环境变量生效
source /etc/profile

编译arena

mkdir -p $GOPATH/src/github.com/kubeflow
cd $GOPATH/src/github.com/kubeflow
git clone https://github.com/AliyunContainerService/arena.git
cd arena
make

增加arena到PATH

vi /etc/profile
export PATH=/var/lib/go/src/github.com/kubeflow/arena/bin:$PATH

测试，查看节点信息

arena top node

-w809

tensorflow上的一段代码跑的示例： > 详细使用方式参考官方提供的文档

# 官方示例
arena submit tf \
             --name=tf-git \
             --gpus=1 \
             --image=zhusl/tensorflow:1.5.0-devel-gpu \
             --syncMode=git \
             --syncSource=https://github.com/cheyang/tensorflow-sample-code.git \
             "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 100"

本文链接：https://zhusl.com/post/2018-09-01-ubuntu-gpu-install.html，参与评论 »

--EOF--

发表于 2018-09-01 07:51:00，并被添加「tensorflow」标签。

本站使用「署名 4.0 国际」创作共享协议，转载请注明作者及原网址。更多说明 »

提醒：本文最后更新于 2633 天前，文中所描述的信息可能已发生改变，请谨慎使用。

专题「KUBERNETES」的其它文章 »

Kubernetes 使用Ceph作为storageClass (Apr 04, 2018)
安装文档（1.9参考） (Apr 01, 2018)
使用kubeadm安装Kubernetes 1.9 (Feb 07, 2018)

阳光好刺眼

ubuntu16.04-kubernetes+arena搭建机器学习环境