MXNet Training
Out of date
This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.Alpha
This Kubeflow component has alpha status with limited support. See the Kubeflow versioning policies. The Kubeflow team is interested in your feedback about the usability of the feature.This guide walks you through using MXNet with Kubeflow.
Installing MXNet Operator
If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.
A version of MXNet support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.
Verify that MXNet support is included in your Kubeflow deployment
Check that the MXNet custom resource is installed
kubectl get crd
The output should include mxjobs.kubeflow.org
NAME                                           AGE
...
mxjobs.kubeflow.org                            4d
...
If it is not included you can add it as follows
git clone https://github.com/kubeflow/manifests
cd manifests/mxnet-job/mxnet-operator
kubectl kustomize base | kubectl apply -f -
Alternatively, you can deploy the operator with default settings without using kustomize by running the following from the repo:
git clone https://github.com/kubeflow/mxnet-operator.git
cd mxnet-operator
kubectl create -f manifests/crd-v1beta1.yaml 
kubectl create -f manifests/rbac.yaml 
kubectl create -f manifests/deployment.yaml
Creating a MXNet training job
You create a training job by defining a MXJob with MXTrain mode and then creating it with
kubectl create -f examples/v1beta1/train/mx_job_dist_gpu.yaml
Creating a TVM tuning job (AutoTVM)
TVM is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator. You can create a auto tuning job by define a type of MXTune job and then creating it with
kubectl create -f examples/v1beta1/tune/mx_job_tune_gpu.yaml
Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters, For more details, please see https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py. Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/v1beta1/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.
Monitoring a MXNet Job
To get the status of your job
kubectl get -o yaml mxjobs ${JOB}
Here is sample output for an example job
apiVersion: kubeflow.org/v1beta1
kind: MXJob
metadata:
  creationTimestamp: 2019-03-19T09:24:27Z
  generation: 1
  name: mxnet-job
  namespace: default
  resourceVersion: "3681685"
  selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
  uid: cb11013b-4a28-11e9-b7f4-704d7bb59f71
spec:
  cleanPodPolicy: All
  jobMode: MXTrain
  mxReplicaSpecs:
    Scheduler:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Server:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /incubator-mxnet/example/image-classification/train_mnist.py
            - --num-epochs
            - "10"
            - --num-layers
            - "2"
            - --kv-store
            - dist_device_sync
            - --gpus
            - "0"
            command:
            - python
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources:
              limits:
                nvidia.com/gpu: "1"
status:
  completionTime: 2019-03-19T09:25:11Z
  conditions:
  - lastTransitionTime: 2019-03-19T09:24:27Z
    lastUpdateTime: 2019-03-19T09:24:27Z
    message: MXJob mxnet-job is created.
    reason: MXJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2019-03-19T09:24:27Z
    lastUpdateTime: 2019-03-19T09:24:29Z
    message: MXJob mxnet-job is running.
    reason: MXJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2019-03-19T09:24:27Z
    lastUpdateTime: 2019-03-19T09:25:11Z
    message: MXJob mxnet-job is successfully completed.
    reason: MXJobSucceeded
    status: "True"
    type: Succeeded
  mxReplicaStatuses:
    Scheduler: {}
    Server: {}
    Worker: {}
  startTime: 2019-03-19T09:24:29Z
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.