Projects

AI Computing Resources at ISSAI

Our AI Computing cluster consists of many interconnected computing elements (nodes). The nodes in each cluster operate in parallel with each other, reaching higher processing power to train deep learning models. The current AI computing resources at ISSAI consist of 7 computing nodes:

We are committed to share our know-how and resources with the Kazakhstani research community. Here, we provide information on how to interact with us to use these resources.

ACCESS

To obtain access credentials, please fill in the application form.  After you submit the completed application, ISSAI will review it and contact you to inform you of the following steps. You will need to sign a user agreement.

GENERAL GUIDELINES

Information on general principles that should be followed when working with ISSAI AI Computing resources and information on technical support.

HARDWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

Information on the system parameters, the access network, the ISSAI computing cluster, data storage, and node names and parameters.

SOFTWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

Information on the software installed on the AI Computing cluster.

USER INTERFACES

Information on the command line, the graphical interface of the software, and the web interface.

GETTING STARTED WITH THE ISSAI AI COMPUTING RESOURCES

Information on connecting to the cluster, the user workspace, copying files, and using software modules.

USING GPU’S AND CUDA

Information on requesting graphics processing units (GPUs) for a task and developing with the CUDA Toolkit – preparing the working environment, usage examples for the cluster, and GPU code generation options.

GETTING ACCESS

To obtain access credentials, please complete the application form.  After you submit the completed application, ISSAI will review it and contact you to inform you of the following steps. You will need to sign a user agreement.

GENERAL GUIDELINES FOR USING THE AI COMPUTING RESOURCES

The terms of use are available online here: Terms of Use.

The maximum execution time for a task is two weeks. After this period, the task is automatically canceled. If more time is needed, the user can make a separate user form with the ISSAI administration.

If a task requires intensive disk operations, it is advised to use a local SSD hard disk instead of the network directory (/home/* or /raid/*). The local hard disk is mounted in the /scratch directory. When the task is complete, a user should manually remove the files created in this directory.

Contacts for technical support

Technical support is provided to all AI Computing cluster users. The support includes help with task preparation, application software installation, instruction on how to work with the cluster, and advice.

We can also explore the possibility of providing scientific support by helping you to select the most appropriate simulation tool for your problem or to create a simulation model.

  • AI computing user support by e-mail: helpissai@nu.edu.kz.
  • Please report security incidents by e-mail: helpissai@nu.edu.kz.

HARDWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

The ISSAI AI Computing resources consist of 7 computing nodes for task execution and one head node that performs the cluster management function. All nodes are interconnected via a fast InfiniBand network. Each computing node is equipped with two x86_64 architecture central processing units (CPUs) and some of the nodes are additionally equipped with 2 or 4 Nvidia Tesla GPUs. The cluster architecture is heterogeneous and combines nodes of different generations and technical parameters.

Several network-attached storage systems are available for storing user data. For tasks with intensive I/O, a dedicated NVMe disk array with an NFS file system is available.

General system parameters

7 computing nodes

1296 CPU cores

6.7 TB RAM

72 Graphical Nvidia Volta, Tesla processors (GPU)

145 TFlops overall performance (25 Tflops GPU)

124 TB data storage

SOFTWARE Specifications of the ISSAI AI Computing Resources

SOFTWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES

This section provides an overview of the scientific software installed on the ISSAI AI Computing resources.

Programming tools and compilers

  • GNU Compiler Collection (ver. 4.8.5, 5.4.0, 7.3.0, 3.0)
  • OpenMPI (ver. 1.10.7, 3.1.1, 4.0.5)
  • Conda/Anaconda (ver. 4.7.5)
  • CUDA Toolkit (ver. 8.0, 9.2, 10.1, 10.2), cuDNN
  • Python (ver. 2.7, 3.6)

On request, the installation of additional software or other versions can be considered. The Centos, EPEL, and OpenHPC  positories are used on the cluster. TTools available via the OpenHPC repository can be found at: https://github.com/openhpc/ohpc/wiki/Component-List-v1.3.8.

Users can also install/compile the software themselves in their user area if the installation process does not require administrator (root) permissions.

USER INTERFACES

Information on the command line and the web interface.

Several options are available for working with the cluster:

  • command line (remote terminal) interface

Command line. This is a traditional way of accessing the cluster. A user connects via the Internet to a remote terminal (usually via a secure shell (SSH) protocol) where commands can be executed in text mode. It is possible to edit and compile a code, queue a task (simulation), and monitor the progress of the execution of the task. Graphical tools can also be called from the command line. The instructions in this guide are mainly intended for access via the command line.

GETTING STARTED WITH THE CLUSTER

Accessing the cluster

Working with the cluster is done through a dedicated server (login node) with the Ubuntu operating system and workload management tools installed. Once connected, the user can use the command line with Unix commands, and it is possible to edit and compile a code, submit a task (simulation) and monitor its execution on the computing resources. From the command line, the user can also invoke graphical tools/windows described later in this document.

We request the users not to use the login nodes for the resource-intensive operations. The login nodes are only for copying files to or from the cluster, preparing tasks, passing tasks to a queue, and monitoring the results. Testing a short (< 5 min.) task or compiling a code on a few CPU cores is acceptable. Also note that running codes or simulations on the login node (even opening a software GUI) does not automatically guarantee that the task will run on the computing nodes. For this, you must use the workload management tools described in the Job Management section.

Command-line access parameters:

  • Primary login node: issai_gateway
  • Protocol: Secure Shell (SSH)
  • Port: 11223
  • For authentication, use the login name and password that you received when registering for the cluster access.
  • Access is guaranteed from any IP address.

Tools for accessing command line. The SSH connection can be established with the following tools:

  • If you are using a Linux or macOS operating system, you can connect to a cluster using SSH by opening the command line (terminal) and executing the following command: ssh @issai_gateway -p 11223
  • For Windows operating system, you can use PuTTY (download it from: http://www.putty.org/). The parameters you need to enter can be found above in the section Command-line access parameters. The example is shown in the figure below.

After you have opened the connection, a login name and password are required. After entering these, you can use a remote terminal with Unix commands.

Graphical interface through command line

If you use the cluster via an SSH client (like PuTTY), you can use not only the command line terminal but also graphical tools. SSH ensures X11 forwarding from the cluster to the user’s computer.

Use the -X parameter when connecting to the cluster in a Linux environment.

ssh -X @issai_gateway -p 11223

To use this function under MS Windows, additional preparatory steps must be carried out on the user’s personal computer.

  1. Install X Windows server, which can be downloaded from: https://sourceforge.net/projects/vcxsrv/.
  2. Run it in the background as a service. The first time you start it, you may be asked to change the settings, leave the default settings.
  3. You also need to enable X11 forwarding in a PuTTY configuration (Connection -> SSH -> X11).

USER WORKSPACE

Each user has a prepared workspace where they can store files related to their tasks. When the user logs into the system using command-line, they are automatically directed to the work directory.

/home/username

Users can install/compile software in their workspace if the installation process does not require administrator (root) permissions.

To copy files from your computer to the login node of the cluster, MS Windows users can use tools such as WinScp or the FAR file manager. Before doing so, you should set up an ssh tunnel to the server (ssh -L 2222:server_ip -p11223 username@issai_gateway).

WinScp can be downloaded here. The connection is similar to that of PuTTY:

On the left-hand side, you can see your computer files, and on the right-hand side – the working directory on the cluster. You can drag and drop files from one window to another with the computer mouse:

On a macOS or Linux operating system, use the SCP command from the command line or a suitable graphical tool. An example of copying a file from a Linux command line:

ssh -L 2222:server_ip -p11223 username@issai_gateway
scp –r my.file username@127.0.0.1 -p 2222:/raid/username

USING GPU’S AND CUDA ON THE CLUSTER

This section describes best practices for using GPUs and the CUDA framework on the AI Computing cluster.

GPUs available on the ISSAI cluster:

GPU model Arch CUDA FP64 power Tensor power Memory feature (sub)
Tesla A100 Ampere 11.0 9.746 TFLOPS (1:2) 624 TFLOPS* 40 GB a100
Tesla V100 Volta 11.4 7.8 TFLOPS 125 TFLOPS 16 GB v100

For more details, please see the section Hardware specifications of the ISSAI AI Computing resources.

Docker

Check available GPUs:

nvidia-smi

Run docker:

docker run --gpus all -it --rm –v local_dir:container_dir nvcr.io/nvidia/<repository>:<xx.xx>
        -it interactive mode with container
        --rm automatically remove changes in container after finish
        -p port local machine port:docker container port
        -v volume=[host-src:]container-dest[:<options>]: Bind mount a volume.
        --shm-size = ram memory
        -e NVIDIA_VISIBLE_DEVICES = number of GPU
        nvcr.io/nvidia/pytorch:20.01-py3 images nvcr.io/nvidia/<repository>:<tag>