AI computing resources

AI Computing Resources at ISSAI

Our AI Computing cluster consists of many interconnected computing elements (nodes). The nodes in each cluster operate in parallel with each other, reaching higher processing power to train deep learning models. The current AI computing resources at ISSAI consist of 7 computing nodes:

We are committed to share our know-how and resources with the Kazakhstani research community. Here, we provide information on how to interact with us to use these resources.

ACCESS

To obtain access credentials, please fill in the application form.  After you submit the completed application, ISSAI will review it and contact you to inform you of the following steps. You will need to sign a user agreement.

GENERAL GUIDELINES

Information on general principles that should be followed when working with ISSAI AI Computing resources and information on technical support.

HARDWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

Information on the system parameters, the access network, the ISSAI computing cluster, data storage, and node names and parameters.

SOFTWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

Information on the software installed on the AI Computing cluster.

USER INTERFACES

Information on the command line, the graphical interface of the software, and the web interface.

GETTING STARTED WITH THE ISSAI AI COMPUTING RESOURCES

Information on connecting to the cluster, the user workspace, copying files, and using software modules.

USING GPU’S AND CUDA

Information on requesting graphics processing units (GPUs) for a task and developing with the CUDA Toolkit – preparing the working environment, usage examples for the cluster, and GPU code generation options.

GETTING ACCESS

To obtain access credentials, please complete the application form.  After you submit the completed application, ISSAI will review it and contact you to inform you of the following steps. You will need to sign a user agreement.

GENERAL GUIDELINES FOR USING THE AI COMPUTING RESOURCES

The terms of use are available online here: Terms of Use.

The maximum execution time for a task is two weeks. After this period, the task is automatically canceled. If more time is needed, the user can make a separate user form with the ISSAI administration.

If a task requires intensive disk operations, it is advised to use a local SSD hard disk instead of the network directory (/home/* or /raid/*). The local hard disk is mounted in the /scratch directory. When the task is complete, a user should manually remove the files created in this directory.

Contacts for technical support

Technical support is provided to all AI Computing cluster users. The support includes help with task preparation, application software installation, instruction on how to work with the cluster, and advice.

We can also explore the possibility of providing scientific support by helping you to select the most appropriate simulation tool for your problem or to create a simulation model.

  • AI computing user support by e-mail: helpissai@nu.edu.kz.
  • Please report security incidents by e-mail: helpissai@nu.edu.kz.

HARDWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

The ISSAI AI Computing resources consist of 7 computing nodes for task execution and one head node that performs the cluster management function. All nodes are interconnected via a fast InfiniBand network. Each computing node is equipped with two x86_64 architecture central processing units (CPUs) and some of the nodes are additionally equipped with 2 or 4 Nvidia Tesla GPUs. The cluster architecture is heterogeneous and combines nodes of different generations and technical parameters.

Several network-attached storage systems are available for storing user data. For tasks with intensive I/O, a dedicated NVMe disk array with an NFS file system is available.

General system parameters

7 computing nodes

1296 CPU cores

6.7 TB RAM

72 Graphical Nvidia Volta, Tesla processors (GPU)

145 TFlops overall performance (25 Tflops GPU)

124 TB data storage

SOFTWARE Specifications of the ISSAI AI Computing Resources

SOFTWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

This section provides an overview of the scientific software installed on the ISSAI AI Computing resources.

Programming tools and compilers

  • GNU Compiler Collection (ver. 4.8.5, 5.4.0, 7.3.0, 3.0)
  • OpenMPI (ver. 1.10.7, 3.1.1, 4.0.5)
  • Conda/Anaconda (ver. 4.7.5)
  • CUDA Toolkit (ver. 8.0, 9.2, 10.1, 10.2), cuDNN
  • Python (ver. 2.7, 3.6)

On request, the installation of additional software or other versions can be considered. The Centos, EPEL, and OpenHPC  repositories are used on the cluster. Tools available via the OpenHPC repository can be found at: https://github.com/openhpc/ohpc/wiki/Component-List-v1.3.8.

Users can also install/compile the software themselves in their user area if the installation process does not require administrator (root) permissions.

USER INTERFACES

Information on the command line and the web interface.

Several options are available for working with the cluster:

  • command line (remote terminal) interface

Command line. This is a traditional way of accessing the cluster. A user connects via the Internet to a remote terminal (usually via a secure shell (SSH) protocol) where commands can be executed in text mode. It is possible to edit and compile a code, queue a task (simulation), and monitor the progress of the execution of the task. Graphical tools can also be called from the command line. The instructions in this guide are mainly intended for access via the command line.

GETTING STARTED WITH THE CLUSTER

Accessing the cluster

Working with the cluster is done through a dedicated server (login node) with the Ubuntu operating system and workload management tools installed. Once connected, the user can use the command line with Unix commands, and it is possible to edit and compile a code, submit a task (simulation) and monitor its execution on the computing resources. From the command line, the user can also invoke graphical tools/windows described later in this document.

We request the users not to use the login nodes for the resource-intensive operations. The login nodes are only for copying files to or from the cluster, preparing tasks, passing tasks to a queue, and monitoring the results. Testing a short (< 5 min.) task or compiling a code on a few CPU cores is acceptable. Also note that running codes or simulations on the login node (even opening a software GUI) does not automatically guarantee that the task will run on the computing nodes. For this, you must use the workload management tools described in the Job Management section.

Command-line access parameters:

  • Primary login node: issai_gateway
  • Protocol: Secure Shell (SSH)
  • Port: 11223
  • For authentication, use the login name and password that you received when registering for the cluster access.
  • Access is guaranteed from any IP address.

Tools for connecting to the ISSAI server

The SSH connection can be established with the following tools:

  • Command Prompt CMD, PowerShell or another Unix Terminal

You can connect to the ISSAI server using the SSH by opening the command line (terminal) and executing the following command:

if you have preconfigured the config file sent to you in the email with connection setup

ssh remote_server

Example of ~/.ssh/config file

Host issai_gateway 
        HostName 87.255.216.119
        User user_name
        Port 11223
        IdentityFile ~/.ssh/id_rsa
        ForwardAgent yes
    
    Host remote_server
        HostName 10.10.25.12
        User user_name
        Port 22
        ProxyJump issai_gateway 
        IdentityFile ~/.ssh/id_rsa
        ForwardAgent yes
        LocalForward 222 localhost:22
  • For Windows operating system, you can also use convenient MobaXterm software.

Using the MobaXterm software, you can initialize a new ssh connection with the following settings:

Go to tab Advanced SSH settings and fill:

Remote host and Specify username fields with the data sent to you in the email with connection setup.

On the Network settings tab, add SSH gateway (jump host) by providing data from the email with the connection settings.

How to use graphical interface through command line?

If you use the server via an SSH client, you can use not only the command line terminal but also graphical user interfaces. SSH ensures X11 forwarding from the server to the user’s computer.

Use the -X parameter when connecting to the server in a Linux environment.

ssh -X remote_server

USER WORKSPACE

Each user has a prepared workspace where they can store files related to their tasks. When the user logs into the system using command line, they are automatically directed to the home directory.

/home/user_name

How to transfer files? to the server from a personal computer, run the following command, forwarding the port

ssh -f user_name@issai_gateway -p 11223 -L 2222: remote_server:22 -N

then copy the required files

scp -r -P 2222 SomeFileOrDir user_name@127.0.0.1:/raid/user_name/

Instead of scp you can use convenient FTP software FileZilla with the following settings (assuming port forwarding is done)

On the left side, you can see your computer files, and on the right side – the working directory on the server. You can drag and drop files from one window to another with the computer mouse

USING GPU’S AND CUDA ON THE CLUSTER

This section describes best practices for using GPUs and the CUDA framework on the AI Computing cluster.

GPUs available on the ISSAI cluster:

GPU model Arch CUDA FP64 power Tensor power Memory feature (sub)
Tesla A100 Ampere 11.0 9.746 TFLOPS (1:2) 624 TFLOPS* 40 GB a100
Tesla V100 Volta 11.4 7.8 TFLOPS 125 TFLOPS 16 GB v100

For more details, please see the section Hardware specifications of the ISSAI AI Computing resources.

Docker

Check available GPUs:

nvidia-smi

How to run Docker?

Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly.
(Docker overview, What is container?)

Log in to the ISSAI server and type id to know your user id UID
Then start writing your Dockerfile with performance-optimized NVIDIA containers for AI/ML and high performance computing.

FROM nvcr.io/nvidia/
    
    LABEL maintainer="your email"
    ENV TZ=Asia/Almaty \
        USER=user_name \
        UID=yourID
    
    RUN groupadd -g ${UID} ${USER} && useradd -l -r -m -s /bin/bash -u ${UID} ${USER} -g ${USER}
    
    #Install your required software
    RUN apt update -y && \
        apt -y install ca-certificates tzdata software-properties-common cmake
            
    USER ${USER}
    
    WORKDIR /home/user_name/
    COPY . .
            
    RUN python3 -m venv venv && source venv/bin/activate
            
    RUN python3 -m pip install --upgrade pip 
    RUN pip3 install -U -r requirements.txt

Now you can build the image with the command

docker build -t image_name:v1 .

Optionally, you can list the most recently created images

docker images

To start the container, use the following command


            docker run --runtime=nvidia -d -it --rm \
                -p 9060:9060 --shm-size=10g \
                -e NVIDIA_VISIBLE_DEVICES=7,8 \
                --mount type=bind,src=/raid/user_name/,dst=/home/user_name/ \
                --name=container_name image_name:v1
        

where:

  • -d -it Run container in background and interactive mode
  • –rm Removes the container when it stops
  • -p Port local machine port:docker container port
  • –shm-size RAM memory
  • -e NVIDIA_VISIBLE_DEVICES number of GPU
  • –mount type=bind Attach a filesystem mount to the container
  • –name Assign a name to the container
  • image_name:v1 Name and tag of the image from which you want to run the container

Optionally, you can list the running containers

docker container ls

or

docker ps

Now you can enter inside the container that is already running with executing bash command

docker exec -it container_name bash

Now you can run your projects in Docker on our ISSAI server

For example, the following command runs training of the model for event-based camera face detection

CUDA_VISIBLE_DEVICES=4-6 taskset --cpu-list 73-84 python3 train_detection.py results_dir dataset_dir

The model was trained with several GPUs (CUDA_VISIBLE_DEVICES=4-6 taskset) and CPUs (–cpu-list 73-84) using python3 code (train_detection.py) and specifying directories to save the results (results_dir) and the dataset itself (dataset_dir).