TensorFlow on GPUs
An open-source software library for Machine Intelligence TensorFlow Homepage
Available Modules¶
module load TensorFlow/2.2.0-gimkl-2018b-Python-3.8.1
TensorFlow is an open source library for machine learning. TensorFlow can train and run deep neural networks. It can also serve as a backend for other techniques requiring automatic differentiation and GPU acceleration.
TensorFlow is callable from Python with the numerically intensive parts of the algorithms implemented in C++ for efficiency. This page focus on running TensorFlow with GPU support.
See also
- To request GPU resources using
--gpus-per-node
option of Slurm, see the GPU use on NeSI documentation page. - To run TensorFlow on CPUs instead, have a look at our article TensorFlow on CPUs for tips on how to configure TensorFlow and Slurm for optimal performance.
Use NeSI modules¶
TensorFlow is available on Mahuika as an environment module
module load TensorFlow/2.4.1-gimkl-2020a-Python-3.8.2
Note this will automatically load the right versions of CUDA and cuDNN modules needed to run TensorFlow on GPUs.
You can list the available versions of the module using:
module spider TensorFlow
To install additional Python packages for your project, you can either:
- install packages in your home folder,
- install packages in a dedicated Python virtual environment for your project.
The first option is easy but will consume space in your home folder and
can create conflicts if you have multiple projects with different
versions requirements. To install packages this way, you need to
usepip install --user
. For example, to install the SciKeras package:
pip install --user scikeras
The second option provides a better separation between projects.
Additionally, it saves space in your home folder if you create your
virtual environments in the project or nobackup folder. The following
example illustrates how to create a virtual environment, activate it and
install the SciKeras package in it with pip
:
$ export PYTHONNOUSERSITE=1
$ python3 -m venv --system-site-packages my_venv
$ source my_venv/bin/activate
(my_venv) $ pip install scikeras
where my_venv
is the path of the virtual environment folder.
The --system-site-packages
option allows the virtual environment to
access the TensorFlow package provided by the environment module
previously loaded:
$ module load TensorFlow/2.4.1-gimkl-2020a-Python-3.8.2
$ source my_venv/bin/activate
(my_venv) $ python -c "import tensorflow as tf; print(tf.__version__)"
[...]
2.4.1
Don't forget to activate the virtual environment before calling your Python scripts in a Slurm submission script, using:
source <path_to_virtual_environment>/bin/activate
Virtual environment isolation
Use export PYTHONNOUSERSITE=1
to ensure that your virtual
environment is isolated from packages installed in your home folder
~/.local/lib/python3.8/site-packages/
.
Conda environments¶
As an alternative, you can also create conda environments to install a specific version of Python, TensorFlow and any additional packages required for your project. On Mahuika, use the Miniconda3 module:
export PYTHONNOUSERSITE=1
module load Miniconda3/4.9.2
conda create -p my_conda_venv python=3.8
Note that here we use the -p
option to create the conda environment in
a local my_conda_venv
folder. Use a subfolder in your project or
nobackup folder to save space in your home folder.
Then activate the conda environment and install TensorFlow using
conda install
or pip install
, depending on your preferences:
source $(conda info --base)/etc/profile.d/conda.sh # if you didn't use "conda init" to set your .bashrc
conda activate ./my_conda_venv
pip install tensorflow==2.5.0
To use TensorFlow on GPUs, you also need to load cuDNN/CUDA modules with
the proper versions. See the official documentation about tested
configurations for
compatibilities. For example, Tensorflow 2.5.0 requires you to load the
cuDNN/8.1.1.33-CUDA-11.2.0
module:
module load cuDNN/8.1.1.33-CUDA-11.2.0 # for Tensorflow 2.5
You can list the available versions of cuDNN (and associated CUDA module) using:
module spider cuDNN
Please contact us at support@nesi.org.nz if you need a version not available on the platform.
Māui Ancillary Nodes
- Load the Anaconda3 module instead of Miniconda3 to manipulate
conda environments:
module load Anaconda3/2020.02-GCC-7.1.0
- Use
module avail
to list available versions of modules, e.g.module avail cuDNN
Additionnally, depending your version of TensorFlow, you may need to take into consideration the following:
- install the
tensorflow-gpu
Python package if your are using TensorFlow 1, - make sure to use a supported version of Python when creating the conda environment (e.g. TensorFlow 1.14.0 requires Python 3.3 to 3.7),
- use
conda install
(notpip install
) if your version of TensorFlow relies on GCC 4.8 (TensorFlow < 1.15).
Tip
Make sure to use module purge
before loading Miniconda3, to ensure
that no other Python module is loaded and could interfere with your
conda environment.
module purge
module load Miniconda3/4.9.2
export PYTHONNOUSERSITE=1
source $(conda info --base)/etc/profile.d/conda.sh # if you didn't use "conda init" to set your .bashrc
conda ... # any conda commands (create, activate, install...)
Singularity containers¶
You can use containers to run your application on the NeSI platform. We provide support for Singularity containers, that can be run by users without requiring additional privileges. Note that Docker containers can be converted into Singularity containers.
For TensorFlow, we recommend using the official container provided by NVIDIA. More information about using Singularity with GPU enabled containers is available on the NVIDIA GPU Containers support page.
Specific versions for A100¶
Here are the recommended options to run TensorFlow on the A100 GPUs:
- If you use TensorFlow 1, use the TF1 container provided by NVIDIA, which comes with a version of TensorFlow 1.15 compiled specifically to support the A100 GPUs (Ampere architecture). Other official Python packages won't support the A100, triggering various crashes and slowdowns.
- If you use TensorFlow 2, any version from 2.4 and above will provide support for the A100 GPUs.
Example Slurm script¶
In the following example, we will use the make_image_classifier provided by TensorFlow Hub to illustrate a training workflow. The example task consists in retraining the last layers of an already trained deep neural network in order to make it classify pictures of flowers. This type of task is known as "transfer learning".
-
Create a virtual environment to install the
tensorflow-hub[make_image_classifier]
package:module purge # start from a clean environment module load TensorFlow/2.4.1-gimkl-2020a-Python-3.8.2 export PYTHONNOUSERSITE=1 python3 -m venv --system-site-packages tf_hub_venv source tf_hub_venv/bin/activate pip install tensorflow-hub[make_image_classifier]~=0.12
-
Download and uncompress the example dataset containing labelled photos of flowers (daisies, dandelions, roses, sunflowers and tulips):
wget http://download.tensorflow.org/example_images/flower_photos.tgz -O - | tar -xz
-
Copy the following code in a job submission script named
flowers.sl
:#!/bin/bash -e #SBATCH --job-name=flowers-example #SBATCH --gpus-per-node=1 #SBATCH --cpus-per-task=2 #SBATCH --time 00:10:00 #SBATCH --mem 4G # load TensorFlow module and activate the virtual environment module purge module load TensorFlow/2.4.1-gimkl-2020a-Python-3.8.2 export PYTHONNOUSERSITE=1 source tf_hub_venv/bin/activate # select a model to train, here MobileNetV2 MODEL_URL="https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4" # run the training script make_image_classifier \ --image_dir flower_photos \ --tfhub_module "$MODEL_URL" \ --image_size 224 \ --saved_model_dir "model-${SLURM_JOBID}"
-
Submit the job:
sbatch flowers.sl
Once the job has finished, the trained model will be saved in a
results-JOBID
folder, where JOBID
is the Slurm job ID number.
All messages printed by TensorFlow during the training, including
training and validation accuracies, are captured in the Slurm output
file, named slurm-JOBID.out
by default.
Tip
While your job is running, you can monitor the progress of model
training using tail -f
on the Slurm output file:
tail -f slurm-JOBID.out # replace JOBID with an actual number