Our AI Computing cluster consists of many interconnected computing elements (nodes). The nodes in each cluster operate in parallel with each other, reaching higher processing power to train deep learning models. The current AI computing resources at ISSAI consist of 7 computing nodes:
We are committed to share our know-how and resources with the Kazakhstani research community. Here, we provide information on how to interact with us to use these resources.
To obtain access credentials, please fill in the application form. After you submit the completed application, ISSAI will review it and contact you to inform you of the following steps. You will need to sign a user agreement.
Information on general principles that should be followed when working with ISSAI AI Computing resources and information on technical support.
HARDWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES
Information on the system parameters, the access network, the ISSAI computing cluster, data storage, and node names and parameters.
SOFTWARE SPECIFICATIONS OF THE ISSAI AI COMPUTING RESOURCES
Information on the software installed on the AI Computing cluster.
Information on the command line, the graphical interface of the software, and the web interface.
GETTING STARTED WITH THE ISSAI AI COMPUTING RESOURCES
Information on connecting to the cluster, the user workspace, copying files, and using software modules.
Information on requesting graphics processing units (GPUs) for a task and developing with the CUDA Toolkit – preparing the working environment, usage examples for the cluster, and GPU code generation options.
To obtain access credentials, please complete the application form. After you submit the completed application, ISSAI will review it and contact you to inform you of the following steps. You will need to sign a user agreement.
The terms of use are available online here: Terms of Use.
The maximum execution time for a task is two weeks. After this period, the task is automatically canceled. If more time is needed, the user can make a separate user form with the ISSAI administration.
If a task requires intensive disk operations, it is advised to use a local SSD hard disk instead of the network directory (/home/* or /raid/*). The local hard disk is mounted in the /scratch directory. When the task is complete, a user should manually remove the files created in this directory.
Technical support is provided to all AI Computing cluster users. The support includes help with task preparation, application software installation, instruction on how to work with the cluster, and advice.
We can also explore the possibility of providing scientific support by helping you to select the most appropriate simulation tool for your problem or to create a simulation model.
The ISSAI AI Computing resources consist of 7 computing nodes for task execution and one head node that performs the cluster management function. All nodes are interconnected via a fast InfiniBand network. Each computing node is equipped with two x86_64 architecture central processing units (CPUs) and some of the nodes are additionally equipped with 2 or 4 Nvidia Tesla GPUs. The cluster architecture is heterogeneous and combines nodes of different generations and technical parameters.
Several network-attached storage systems are available for storing user data. For tasks with intensive I/O, a dedicated NVMe disk array with an NFS file system is available.
7 computing nodes
1296 CPU cores
6.7 TB RAM
72 Graphical Nvidia Volta, Tesla processors (GPU)
145 TFlops overall performance (25 Tflops GPU)
124 TB data storage
This section provides an overview of the scientific software installed on the ISSAI AI Computing resources.
On request, the installation of additional software or other versions can be considered. The Centos, EPEL, and OpenHPC positories are used on the cluster. TTools available via the OpenHPC repository can be found at: https://github.com/openhpc/ohpc/wiki/Component-List-v1.3.8.
Users can also install/compile the software themselves in their user area if the installation process does not require administrator (root) permissions.
Information on the command line and the web interface.
Several options are available for working with the cluster:
Command line. This is a traditional way of accessing the cluster. A user connects via the Internet to a remote terminal (usually via a secure shell (SSH) protocol) where commands can be executed in text mode. It is possible to edit and compile a code, queue a task (simulation), and monitor the progress of the execution of the task. Graphical tools can also be called from the command line. The instructions in this guide are mainly intended for access via the command line.
Working with the cluster is done through a dedicated server (login node) with the Ubuntu operating system and workload management tools installed. Once connected, the user can use the command line with Unix commands, and it is possible to edit and compile a code, submit a task (simulation) and monitor its execution on the computing resources. From the command line, the user can also invoke graphical tools/windows described later in this document.
We request the users not to use the login nodes for the resource-intensive operations. The login nodes are only for copying files to or from the cluster, preparing tasks, passing tasks to a queue, and monitoring the results. Testing a short (< 5 min.) task or compiling a code on a few CPU cores is acceptable. Also note that running codes or simulations on the login node (even opening a software GUI) does not automatically guarantee that the task will run on the computing nodes. For this, you must use the workload management tools described in the Job Management section.
Command-line access parameters:
Tools for accessing command line. The SSH connection can be established with the following tools:
After you have opened the connection, a login name and password are required. After entering these, you can use a remote terminal with Unix commands.
Graphical interface through command line
If you use the cluster via an SSH client (like PuTTY), you can use not only the command line terminal but also graphical tools. SSH ensures X11 forwarding from the cluster to the user’s computer.
Use the -X parameter when connecting to the cluster in a Linux environment.
ssh -X @issai_gateway -p 11223
To use this function under MS Windows, additional preparatory steps must be carried out on the user’s personal computer.
Each user has a prepared workspace where they can store files related to their tasks. When the user logs into the system using command-line, they are automatically directed to the work directory.
/home/username
Users can install/compile software in their workspace if the installation process does not require administrator (root) permissions.
To copy files from your computer to the login node of the cluster, MS Windows users can use tools such as WinScp or the FAR file manager. Before doing so, you should set up an ssh tunnel to the server (ssh -L 2222:server_ip -p11223 username@issai_gateway).
WinScp can be downloaded here. The connection is similar to that of PuTTY:
On the left-hand side, you can see your computer files, and on the right-hand side – the working directory on the cluster. You can drag and drop files from one window to another with the computer mouse:
On a macOS or Linux operating system, use the SCP command from the command line or a suitable graphical tool. An example of copying a file from a Linux command line:
ssh -L 2222:server_ip -p11223 username@issai_gateway
scp –r my.file username@127.0.0.1 -p 2222:/raid/username
This section describes best practices for using GPUs and the CUDA framework on the AI Computing cluster.
GPUs available on the ISSAI cluster:
GPU model | Arch | CUDA | FP64 power | Tensor power | Memory | feature (sub) |
---|---|---|---|---|---|---|
Tesla A100 | Ampere | 11.0 | 9.746 TFLOPS (1:2) | 624 TFLOPS* | 40 GB | a100 |
Tesla V100 | Volta | 11.4 | 7.8 TFLOPS | 125 TFLOPS | 16 GB | v100 |
For more details, please see the section Hardware specifications of the ISSAI AI Computing resources.
Check available GPUs:
nvidia-smi
Run docker:
docker run --gpus all -it --rm –v local_dir:container_dir nvcr.io/nvidia/<repository>:<xx.xx>
-it interactive mode with container
--rm automatically remove changes in container after finish
-p port local machine port:docker container port
-v volume=[host-src:]container-dest[:<options>]: Bind mount a volume.
--shm-size = ram memory
-e NVIDIA_VISIBLE_DEVICES = number of GPU
nvcr.io/nvidia/pytorch:20.01-py3 images nvcr.io/nvidia/<repository>:<tag>