kennel

Datadog monitors/dashboards/slos as code, avoid chaotic management via UI

134

Ruby

Manage Datadog Monitors / Dashboards / Slos as code

DRY, searchable, audited, documented
Changes are PR reviewed and applied on merge
Updating shows diff before applying
Automated import of existing resources
Resources are grouped into projects that belong to teams and inherit tags
No copy-pasting of ids to create new resources
Automated cleanup when removing code
Helpers for automating common tasks

Applying changes

Example code

# teams/foo.rb
module Teams
  class Foo < Kennel::Models::Team
    defaults(mention: -> { "@slack-my-team" })
  end
end

# projects/bar.rb
class Bar < Kennel::Models::Project
  defaults(
    team: -> { Teams::Foo.new }, # use mention and tags from the team
    tags: -> { super() + ["project:bar"] }, # unique tag for all project components
    parts: -> {
      [
        Kennel::Models::Monitor.new(
          self, # the current project
          type: -> { "query alert" },
          kennel_id: -> { "load-too-high" }, # pick a unique name
          name: -> { "Foobar Load too high" }, # nice descriptive name that will show up in alerts and emails
          message: -> {
            <<~TEXT
              This is bad!
              #{super()} # inserts mention from team
            TEXT
          },
          query: -> { "avg(last_5m):avg:system.load.5{hostgroup:api} by {pod} > #{critical}" },
          critical: -> { 20 }
        )
      ]
    }
  )
end

Installation

create a new private kennel repo for your organization (do not fork this repo)

use the template folder as starting point:

git clone [email protected]:your-org/kennel.git
git clone [email protected]:grosser/kennel.git seed
mv seed/template/* kennel/
cd kennel && git add . && git commit -m 'initial'

add a basic projects and teams so others can copy-paste to get started
setup CI build for your repo (travis and Github Actions supported)
uncomment .travis.yml section for datadog updates on merge (TODO: example setup for Github Actions)
follow Setup in your repos Readme.md

Structure

projects/ monitors/dashboards/etc scoped by project
teams/ team definitions
parts/ monitors/dashboards/etc that are used by multiple projects
generated/ projects as json, to show current state and proposed changes in PRs

About the models

Kennel provides several classes which act as models for different purposes:

Kennel::Models::Dashboard, Kennel::Models::Monitor, Kennel::Models::Slo, Kennel::Models::SyntheticTest;
these models represent the various Datadog objects
Kennel::Models::Project; a container for a collection of Datadog objects
Kennel::Models::Team; provides defaults and values (e.g. tags, mentions) for the other models.

After loading all the *.rb files under projects/, Kennel’s starting point
is to find all the subclasses of Kennel::Models::Project, and for each one,
create an instance of that subclass (via .new) and then call #parts on that
instance. parts should return a collection of the Datadog-objects (Dashboard / Monitor / etc).

Model Settings

Each of the models defines various settings; for example, a Monitor has name, message,
type, query, tags, and many more.

When defining a subclass of a model, one can use defaults to provide default values for
those settings:

class MyMonitor < Kennel::Models::Monitor
  defaults(
    name: "Error rate",
    type: "query alert",
    critical: 5.0,
    query: -> {
      "some datadog metric expression > #{critical}"
    },
    # ...
  )
end

This is equivalent to defining instance methods of those names, which return those values:

class MyMonitor < Kennel::Models::Monitor
  def name
    "Error rate"
  end

  def type
    "query alert"
  end

  def critical
    5.0
  end

  def query
    "some datadog metric expression > #{critical}"
  end
end

except that defaults will complain if you try to use a setting name which doesn’t
exist. Note also that you can use either plain values (critical: 5.0), or procs
(query: -> { ... }). Using a plain value is equivalent to using a proc which returns
that same value; use whichever suits you best.

When you instantiate a model class, you can pass settings in the constructor, after
the project:

project = Kennel::Models::Project.new
my_monitor = MyMonitor.new(
  project,
  critical: 10.0,
  message: -> {
    <<~MESSAGE
      Something bad is happening and you should be worried.

      #{super()}
    MESSAGE
  },
)

This works just like defaults (it checks the setting names, and it accepts
either plain values or procs), but it applies just to this instance of the class,
rather than to the class as a whole (i.e. it defines singleton methods, rather
than instance methods).

Most of the examples in this Readme use the proc syntax (critical: -> { 5.0 }) but
for simple constants you may prefer to use the plain syntax (critical: 5.0).

Workflows

Adding a team

mention is used for all team monitors via super()
renotify_interval is used for all team monitors (defaults to 0 / off)
tags is used for all team monitors/dashboards (defaults to team:<team-name>)

# teams/my_team.rb
module Teams
  class MyTeam < Kennel::Models::Team
    defaults(
      mention: -> { "@slack-my-team" }
    )
  end
end

Adding a new monitor

use datadog monitor UI to create a monitor
see below

Updating an existing monitor

use datadog monitor UI to find a monitor
run URL='https://app.datadoghq.com/monitors/123' bundle exec rake kennel:import and copy the output
find or create a project in projects/
add the monitor to parts: [ list, for example:

# projects/my_project.rb
class MyProject < Kennel::Models::Project
  defaults(
    team: -> { Teams::MyTeam.new }, # use existing team or create new one in teams/
    parts: -> {
      [
        Kennel::Models::Monitor.new(
          self,
          id: -> { 123456 }, # id from datadog url, not necessary when creating a new monitor
          type: -> { "query alert" },
          kennel_id: -> { "load-too-high" }, # make up a unique name
          name: -> { "Foobar Load too high" }, # nice descriptive name that will show up in alerts and emails
          message: -> {
            # Explain what behavior to expect and how to fix the cause
            # Use #{super()} to add team notifications.
            <<~TEXT
              Foobar will be slow and that could cause Barfoo to go down.
              Add capacity or debug why it is suddenly slow.
              #{super()}
            TEXT
          },
          query: -> { "avg(last_5m):avg:system.load.5{hostgroup:api} by {pod} > #{critical}" }, # replace actual value with #{critical} to keep them in sync
          critical: -> { 20 }
        )
      ]
    }
  )
end

run PROJECT=my_project bundle exec rake plan, an Update to the existing monitor should be shown (not Create / Delete)
alternatively: bundle exec rake generate to only locally update the generated json files
review changes then git commit
make a PR … get reviewed … merge
datadog is updated by CI

Deleting

Remove the code that created the resource. The next update will delete it (see above for PR workflow).

Adding a new dashboard

go to datadog dashboard UI and click on New Dashboard to create a dashboard
see below

Updating an existing dashboard

go to datadog dashboard UI and click on New Dashboard to find a dashboard
run URL='https://app.datadoghq.com/dashboard/bet-foo-bar' bundle exec rake kennel:import and copy the output
find or create a project in projects/
add a dashboard to parts: [ list, for example:

class MyProject < Kennel::Models::Project
  defaults(
    team: -> { Teams::MyTeam.new }, # use existing team or create new one in teams/
    parts: -> {
      [
        Kennel::Models::Dashboard.new(
          self,
          id: -> { "abc-def-ghi" }, # id from datadog url, not needed when creating a new dashboard
          title: -> { "My Dashboard" },
          description: -> { "Overview of foobar" },
          template_variables: -> { ["environment"] }, # see https://docs.datadoghq.com/api/?lang=ruby#timeboards
          kennel_id: -> { "overview-dashboard" }, # make up a unique name
          layout_type: -> { "ordered" },
          definitions: -> {
            [ # An array or arrays, each one is a graph in the dashboard, alternatively a hash for finer control
              [
                # title, viz, type, query, edit an existing graph and see the json definition
                "Graph name", "timeseries", "area", "sum:mystats.foobar{$environment}"
              ],
              [
                # queries can be an Array as well, this will generate multiple requests
                # for a single graph
                "Graph name", "timeseries", "area", ["sum:mystats.foobar{$environment}", "sum:mystats.success{$environment}"],
                # add events too ...
                events: [{q: "tags:foobar,deploy", tags_execution: "and"}]
              ]
            ]
          }
        )
      ]
    }
  )
end

Updating existing resources with id

Setting id makes kennel take over a manually created datadog resource.
When manually creating to import, it is best to remove the id and delete the manually created resource.

When an id is set and the original resource is deleted, kennel will fail to update,
removing the id will cause kennel to create a new resource in datadog.

Organizing projects with many resources

When project files get too long, this structure can keep things bite-sized.

# projects/project_a/base.rb
module ProjectA
  class Base < Kennel::Models::Project
    defaults(
      kennel_id: -> { "project_a" },
      parts: -> {
        [
          Monitors::FooAlert.new(self),
          ...
        ]
      }
      ...

# projects/project_a/monitors/foo_alert.rb
module ProjectA
  module Monitors
    class FooAlert < Kennel::Models::Monitor
      ...

Updating a single project or resource

Use PROJECT=<kennel_id> for single project:

Use the projects kennel_id (and if none is set then snake_case of the class name including modules)
to refer to the project. For example for class ProjectA use PROJECT=project_a but for Foo::ProjectA use foo_project_a.
Use TRACKING_ID=<project-kennel_id>:<resource-kennel_id> for single resource:

Use the project kennel_id and the resources kennel_id, for example class ProjectA and FooAlert would give project_a:foo_alert.

Skipping validations

Some validations might be too strict for your usecase or just wrong, please open an issue and
to unblock use the validate: -> { false } option.

Linking resources with kennel_id

Link resources with their kennel_id in the format project kennel_id + : + resource kennel_id,
this should be used to create dependent resources like monitor + slos,
so they can be created in a single update and can be re-created if any of them is deleted.

Resource	Type	Syntax
Dashboard	uptime	`monitor: {id: "foo:bar"}`
Dashboard	alert_graph	`alert_id: "foo:bar"`
Dashboard	slo	`slo_id: "foo:bar"`
Dashboard	timeseries	`queries: [{ data_source: "slo", slo_id: "foo:bar" }]`
Monitor	composite	`query: -> { "%{foo:bar} && %{foo:baz}" }`
Monitor	slo alert	`query: -> { "error_budget(\"%{foo:bar}\").over(\"7d\") > 123.0" }`
Slo	monitor	`monitor_ids: -> ["foo:bar"]`

Debugging changes locally

rebase on updated master to not undo other changes
figure out project name by converting the class name to snake_case
run PROJECT=foo bundle exec rake kennel:update_datadog to test changes for a single project (monitors: remove mentions while debugging to avoid alert spam)
- use PROJECT=foo,bar,... for multiple projects

Reuse

Add to parts/<folder>.

module Monitors
  class LoadTooHigh < Kennel::Models::Monitor
    defaults(
      name: -> { "#{project.name} load too high" },
      message: -> { "Shut it down!" },
      type: -> { "query alert" },
      query: -> { "avg(last_5m):avg:system.load.5{hostgroup:#{project.kennel_id}} by {pod} > #{critical}" }
    )
  end
end

Reuse it in multiple projects.

class Database < Kennel::Models::Project
  defaults(
    team: -> { Kennel::Models::Team.new(mention: -> { '@slack-foo' }, kennel_id: -> { 'foo' }) },
    parts: -> { [Monitors::LoadTooHigh.new(self, critical: -> { 13 })] }
  )
end

Helpers

Listing un-muted alerts

Run rake kennel:alerts TAG=service:my-service to see all un-muted alerts for a given datadog monitor tag.

Validating mentions work

rake kennel:validate_mentions should run as part of CI

Grepping through all of datadog

rake kennel:dump > tmp/dump
cat tmp/dump | grep foo

focus on a single type: TYPE=monitors

Show full resources or just their urls by pattern:

rake kennel:dump_grep DUMP=tmp/dump PATTERN=foo URLS=true
https://foo.datadog.com/dasboard/123
https://foo.datadog.com/monitor/123

Find all monitors with No-Data

rake kennel:nodata TAG=team:foo

Finding the tracking id of a resource

When trying to link resources together, this avoids having to go through datadog UI.

rake kennel:tracking_id ID=123 RESOURCE=monitor

Development

Benchmarking

Setting FORCE_GET_CACHE=true will cache all get requests, which makes benchmarking improvements more reliable.
Setting STORE=false will make rake plan not update the files on disk and save a bit of time

Integration testing

rake play
cd template
rake plan

Then make changes to play around, do not commit changes and make sure to revert with a rake kennel:update_datadog after deleting everything.

To make changes via the UI, make a new free datadog account and use it’s credentials instead.

Author

Michael Grosser

[email protected]

License: MIT