A Slurm-based HPC workload management environment, driven by Ansible.
This repository contains playbooks and configuration to define a Slurm-based HPC environment. This includes:
The repository is expected to be forked for a specific HPC site but can contain multiple environments for e.g. development, staging and production clusters
sharing a common configuration. It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs
back upstream to us!
While it is tested on OpenStack it should work on any cloud with appropriate OpenTofu configuration files.
The default configuration in this repository may be used to create a cluster to explore use of the appliance. It provides:
Note that the OpenOndemand portal and its remote apps are not usable with this default configuration.
It requires an OpenStack cloud, and an Ansible “deploy host” with access to that cloud.
Before starting ensure that:
The following operating systems are supported for the deploy host:
These instructions assume the deployment host is running Rocky Linux 8:
sudo yum install -y git python38
git clone https://github.com/stackhpc/ansible-slurm-appliance
cd ansible-slurm-appliance
./dev/setup-env.sh
You will also need to install OpenTofu.
Run the following from the repository root to activate the venv:
. venv/bin/activate
Use the cookiecutter
template to create a new environment to hold your configuration:
cd environments
cookiecutter skeleton
and follow the prompts to complete the environment name and description.
NB: In subsequent sections this new environment is refered to as $ENV
.
Activate the new environment:
. environments/$ENV/activate
And generate secrets for it:
ansible-playbook ansible/adhoc/generate-passwords.yml
Create an OpenTofu variables file to define the required infrastructure, e.g.:
# environments/$ENV/terraform/terraform.tfvars:
cluster_name = "mycluster"
cluster_net = "some_network" # *
cluster_subnet = "some_subnet" # *
key_pair = "my_key" # *
control_node_flavor = "some_flavor_name"
login_nodes = {
login-0: "login_flavor_name"
}
cluster_image_id = "rocky_linux_9_image_uuid"
compute = {
general = {
nodes: ["compute-0", "compute-1"]
flavor: "compute_flavor_name"
}
}
Variables marked *
refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see environments/$ENV/terraform/terraform.tfvars
.
To deploy this infrastructure, ensure the venv and the environment are activated and run:
export OS_CLOUD=openstack
cd environments/$ENV/terraform/
tofu apply
and follow the prompts. Note the OS_CLOUD environment variable assumes that OpenStack credentials are defined using a clouds.yaml file in a default location with the default cloud name of openstack
.
To configure the appliance, ensure the venv and the environment are activated and run:
ansible-playbook ansible/site.yml
Once it completes you can log in to the cluster using:
ssh rocky@$login_ip
where the IP of the login node is given in environments/$ENV/inventory/hosts.yml
environments/
: See docs/environments.md.ansible/
: Contains the ansible playbooks to configure the infrastruture.packer/
: Contains automation to use Packer to build machine images for an enviromment - see the README in this directory for further information.dev/
: Contains development tools.For further information see the docs directory.