Declarative cluster management using constraint programming, where constraints are described using SQL.
Modern cluster management systems like Kubernetes routinely grapple
with hard combinatorial optimization problems: load balancing,
placement, scheduling, and configuration. Implementing application-specific algorithms to
solve these problems is notoriously hard to do, making it challenging to evolve the system over time
and add new features.
DCM is a tool to overcome this challenge. It enables programmers to build schedulers
and cluster managers using a high-level declarative language (SQL).
Specifically, developers need to represent cluster state in an SQL database, and write constraints
and policies that should apply on that state using SQL. From the SQL specification, the DCM compiler synthesizes a
program that at runtime, can be invoked to compute policy-compliant cluster management decisions given the latest
cluster state. Under the covers, the generated program efficiently encodes the cluster state as an
optimization problem that can be solved using off-the-shelf solvers, freeing developers from having to
design ad-hoc heuristics.
The high-level architecture is shown in the diagram below.
The DCM project’s groupId is com.vmware.dcm
and its artifactId is dcm
.
We make DCM’s artifacts available through Maven Central.
To use DCM from a Maven-based project, use the following dependency:
<dependency>
<groupId>com.vmware.dcm</groupId>
<artifactId>dcm</artifactId>
<version>0.15.0</version>
</dependency>
To use within a Gradle-based project:
implementation 'com.vmware.dcm:dcm:0.15.0'
We test regularly on JDK 11 and 16.
We test regularly on OSX and Ubuntu 20.04.
We currently support two solver backends.
Google OR-tools CP-SAT (version 9.1.9490). This is available by default when using the maven dependency.
MiniZinc (version 2.3.2). This backend is currently being deprecated. If you still want to use it
in your project, or if you want run all tests in this repository, you will have to install MiniZinc out-of-band.
To do so, download MiniZinc from https://www.minizinc.org/software.html
… and make sure you are able to invoke the minizinc
binary from your commandline.
Here is a complete program
that you can run to get a feel for DCM.
import com.vmware.dcm.Model;
import org.jooq.DSLContext;
import org.jooq.impl.DSL;
import org.junit.jupiter.api.Test;
import java.util.List;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;
public class QuickStartTest {
@Test
public void quickStart() {
// Create an in-memory database and get a JOOQ connection to it
final DSLContext conn = DSL.using("jdbc:h2:mem:");
// A table representing some machines
conn.execute("create table machines(id integer)");
// A table representing tasks, that need to be assigned to machines by DCM.
// To do so, create a variable column (prefixed by controllable__).
conn.execute("create table tasks(task_id integer, controllable__worker_id integer, " +
"foreign key (controllable__worker_id) references machines(id))");
// Add four machines
conn.execute("insert into machines values(1)");
conn.execute("insert into machines values(3)");
conn.execute("insert into machines values(5)");
conn.execute("insert into machines values(8)");
// Add two tasks
conn.execute("insert into tasks values(1, null)");
conn.execute("insert into tasks values(2, null)");
// Time to specify a constraint! Just for fun, let's assign tasks to machines such that
// the machine IDs sum up to 6.
final String constraint = "create constraint example_constraint as " +
"select * from tasks check sum(controllable__worker_id) = 6";
// Create a DCM model using the database connection and the above constraint
final Model model = Model.build(conn, List.of(constraint));
// Solve and return the tasks table. The controllable__worker_id column will either be [1, 5] or [5, 1]
final List<Integer> column = model.solve("TASKS")
.map(e -> e.get("CONTROLLABLE__WORKER_ID", Integer.class));
assertEquals(2, column.size());
assertTrue(column.contains(1));
assertTrue(column.contains(5));
}
}
The Model class serves as DCM’s public API. It exposes
two methods: Model.build()
and model.solve()
.
We welcome all feedback and contributions! ❤️
Please use Github issues for user questions
and bug reports.
Check out the contributing guide if you’d like to send us a pull request.
The entire build including unit tests can be triggered from the root folder with the following command (make
sure to setup both solvers first):
$: ./gradlew build
To avoid documentation drift, code snippets in a documentation file (like the README or tutorial)
are embedded directly from source files that are continuously tested. To refresh these documentation
files:
$: npx embedme <file>
The Kubernetes scheduler also comes with integration tests that run against a real Kubernetes cluster.
It goes without saying that you should not point to a production cluster as these tests repeatedly delete all
running pods and deployments. To run these integration-tests, make sure you have a valid KUBECONFIG
environment variable that points to a Kubernetes cluster.
We recommend setting up a local multi-node cluster and a corresponding KUBECONFIG
using
kind. Once you’ve installed kind
, run the following
to create a test cluster:
$: kind create cluster --config k8s-scheduler/src/test/resources/kind-test-cluster-configuration.yaml --name dcm-it
The above step will create a configuration file in your home folder (~/.kube/kind-config-dcm-it
), make sure
you initialize a KUBECONFIG
environment variable to point to that path.
You can then execute the following command to run integration-tests against the created local cluster:
$: KUBECONFIG=~/.kube/kind-config-dcm-it ./gradlew :k8s-scheduler:integrationTest
To run a specific integration test class (example: SchedulerIT
from the k8s-scheduler
module):
$: KUBECONFIG=~/.kube/kind-config-dcm-it ./gradlew :k8s-scheduler:integrationTest --tests SchedulerIT
To learn more about DCM, we suggest going through the following references:
Talks:
Research papers:
Building Scalable and Flexible Cluster Managers Using Declarative Programming
Lalith Suresh, Joao Loff, Faria Kalim, Sangeetha Abdu Jyothi, Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Pranshu Jain, Michael Gasch.
To appear, 14th USENIX Symposium on Operating Systems Design and Implementation, (OSDI 2020).
Automating Cluster Management with Weave
Lalith Suresh, Joao Loff, Faria Kalim, Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Zeeshan Lokhandwala, Mukesh Hira, Mooly Sagiv. arXiv preprint arXiv:1909.03130 (2019).
Synthesizing Cluster Management Code for Distributed Systems
Lalith Suresh, João Loff, Nina Narodytska, Leonid Ryzhyk, Mooly Sagiv, and Brian Oki. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS 2019).
ACM, New York, NY, USA, 45-50. DOI: https://doi.org/10.1145/3317550.3321444