Adventures of a wannabe geek!

Ranting within

Building an Autodiscovering Apache Zookeeper Cluster in AWS Using Packer, Ansible and Terraform

Following my pattern of building AMIs for applications, I create my Apache Zookeeper cluster with Packer for my AMI and Terraform for the infrastructure. This Zookeeper cluster is auto-discovering of the other nodes that are determined to be in the cluster

Building Zookeeper AMIs with Packer

The packer template looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
  "variables": {
    "ami_id": "",
    "private_subnet_id": "",
    "security_group_id": "",
    "packer_build_number": "",
  },
  "description": "Zookeeper Image",
  "builders": [
    {
      "ami_name": "zookeeper-{{user `packer_build_number`}}",
      "availability_zone": "eu-west-1a",
      "iam_instance_profile": "app-server",
      "instance_type": "t2.small",
      "region": "eu-west-1",
      "run_tags": {
        "role": "packer"
      },
      "security_group_ids": [
        "{{user `security_group_id`}}"
      ],
      "source_ami": "{{user `ami_id`}}",
      "ssh_timeout": "10m",
      "ssh_username": "ubuntu",
      "subnet_id": "{{user `private_subnet_id`}}",
      "tags": {
        "Name": "zookeeper-packer-image"
      },
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "inline": [ "sleep 10" ]
    },
    {
      "type": "shell",
      "script": "install_dependencies.sh",
      "execute_command": "echo '' | {{ .Vars }} sudo -E -S sh '{{ .Path }}'"
    },
    {
      "type": "ansible-local",
      "playbook_file": "zookeeper.yml",
      "extra_arguments": [
        "--module-path=./modules"
      ],
      "playbook_dir": "../../"
    }
  ]
}

The install_dependencies.sh script is as described previously

The ansible playbook for Zookeeper looks as follows:

1
2
3
4
5
6
7
8
9
10
11
- hosts: all
  sudo: yes

  pre_tasks:
    - ec2_tags:
    - ec2_facts:

  roles:
    - base
    - zookeeper
    - exhibitor

The base playbook installs a base role for all the base pieces of my system (e.g. Logstash, Sensu-client, prometheus node_exporter) and then proceeds to install zookeeper. As a last step, I install exhibitor. Exhibitor is a co-process for monitoring, backup/recovery, cleanup and visualization of zookeeper.

The zookeeper ansible role looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
- name: Download ZooKeeper
  get_url: url=http://www.mirrorservice.org/sites/ftp.apache.org/zookeeper/current/zookeeper-{{ zookeeper_version }}.tar.gz dest=/tmp/zookeeper-{{ zookeeper_version }}.tar.gz mode=0440

- name: Unpack Zookeeper
  command: tar xzf /tmp/zookeeper-{{ zookeeper_version }}.tar.gz -C /opt/ creates=/opt/zookeeper-{{ zookeeper_version }}

- name: Link to Zookeeper Directory
  file: src=/opt/zookeeper-{{ zookeeper_version }}
        dest=/opt/zookeeper
        state=link
        force=yes

- name: Create zookeeper group
  group: name=zookeeper system=true state=present

- name: Create zookeeper user
  user: name=zookeeper groups=zookeeper system=true home=/opt/zookeeper

- name: Create Zookeeper Config Dir
  file: path={{zookeeper_conf_dir}} owner=zookeeper group=zookeeper recurse=yes state=directory mode=0644

- name: Create Zookeeper Transations Dir
  file: path=/opt/zookeeper/transactions owner=zookeeper group=zookeeper recurse=yes state=directory mode=0644

- name: Create Zookeeper Log Dir
  file: path={{zookeeper_log_dir}} owner=zookeeper group=zookeeper recurse=yes state=directory mode=0644

- name: Create Zookeeper DataStore Dir
  file: path={{zookeeper_datastore_dir}} owner=zookeeper group=zookeeper recurse=yes state=directory mode=0644

- name: Setup log4j
  template: dest="{{zookeeper_conf_dir}}/log4j.properties" owner=root group=root mode=644 src=log4j.properties.j2

The role itself is very simple. The zookeeper cluster is managed by exhibitor so there are very few settings passed to zookeeper at this point. One thing to note though, this requires an installation of the Java JDK to work.

The exhibitor playbook looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
- name: Install Maven
  apt: pkg=maven state=latest update_cache=yes

- name: Create Exhibitor Install Dir
  file: path={{ exhibitor_install_dir }} state=directory mode=0644

- name: Create Exhibitor Build Dir
  file: path={{ exhibitor_install_dir }}/{{ exhibitor_version }} state=directory mode=0644

- name: Create Exhibitor POM
  template: src=pom.xml.j2
            dest={{ exhibitor_install_dir }}/{{ exhibitor_version }}/pom.xml

- name: Build Exhibitor jar
  command: '/usr/bin/mvn clean package -f {{ exhibitor_install_dir }}/{{ exhibitor_version }}/pom.xml creates={{ exhibitor_install_dir }}/{{ exhibitor_version }}/target/exhibitor-{{ exhibitor_version }}.jar'

- name: Copy Exhibitor jar
  command: 'cp {{ exhibitor_install_dir }}/{{ exhibitor_version }}/target/exhibitor-{{ exhibitor_version }}.jar {{exhibitor_install_dir}}/exhibitor-standalone-{{ exhibitor_version }}.jar creates={{exhibitor_install_dir}}/exhibitor-standalone-{{ exhibitor_version }}.jar'

- name: Exhibitor Properties Config
  template: src=exhibitor.properties.j2
            dest={{ exhibitor_install_dir }}/exhibitor.properties

- name: exhibitor upstart config
  template: src=upstart.j2 dest=/etc/init/exhibitor.conf mode=644 owner=root

- service: name=exhibitor state=started

This role has a lot more configuration to set as it essentially manages zookeeper. The template files for configuration are all available to download.

The variables for the entire playbook look as follows:

1
2
3
4
5
6
7
8
9
zookeeper_hosts: ":2181"
zookeeper_version: 3.4.6
zookeeper_conf_dir: /etc/zookeeper/conf
zookeeper_log_dir: /var/log/zookeeper
zookeeper_datastore_dir: /var/lib/zookeeper
zk_s3_bucket_name: "mys3bucket"
monasca_log_level: WARN
exhibitor_version: 1.5.5
exhibitor_install_dir: /opt/exhibitor

The main thing to note here is that the exhibitor process starts with the following configuration:

1
exec java -jar {{ exhibitor_install_dir }}/exhibitor-standalone-{{exhibitor_version}}.jar --port 8181 --defaultconfig /opt/exhibitor/exhibitor.properties --configtype s3 --s3config {{ zk_s3_bucket_name }}:{{ ansible_ec2_placement_region }} --s3backup true --hostname {{ ec2_private_ip_address }} > /var/log/exhibitor.log 2>&1

This means that the node will check itself into a configuration file in S3 and that all other zookeepers will read the same configuration file and can form the cluster required. You can read more about Exhibitor shared configuration on their github wiki.

When I launch the instances now, the zookeeper cluster will be formed

Deploying Zookeeper Infrastructure with Terraform

The infrastructure of the Zookeeper cluster is pretty simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
resource "aws_security_group" "zookeeper" {
  name = "digit-zookeeper-sg"
  description = "Zookeeper Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "ZooKeeper Node"
  }
}

resource "aws_launch_configuration" "zookeeper_launch_config" {
  image_id = "${var.zookeeper_ami_id}"
  instance_type = "${var.zookeeper_instance_type}"
  iam_instance_profile = "zookeeper-server"
  key_name = "${aws_key_pair.terraform.key_name}"
  security_groups = ["${aws_security_group.zookeeper.id}","${aws_security_group.node.id}"]
  enable_monitoring = false

  lifecycle {
    create_before_destroy = true
  }

  root_block_device {
    volume_size = "${var.digit_zookeeper_volume_size}"
  }
}

resource "aws_autoscaling_group" "zookeeper_autoscale_group" {
  name = "zookeeper-autoscale-group"
  availability_zones = ["${aws_subnet.primary-private.availability_zone}","${aws_subnet.secondary-private.availability_zone}","${aws_subnet.tertiary-private.availability_zone}"]
  vpc_zone_identifier = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]  launch_configuration = "${aws_launch_configuration.zookeeper_launch_config.id}"
  min_size = 0
  max_size = 100
  desired = 3

  tag {
    key = "Name"
    value = "zookeeper"
    propagate_at_launch = true
  }

  tag {
    key = "role"
    value = "zookeeper"
    propagate_at_launch = true
  }
}

When Terraform is applied here, a 3 node cluster of zookeeper will be created. You can go to exhibitor and see the configuration e.g.

Image Image

Replacing a Node in a Riak Cluster

The instances that run in my infrastructure get a lifespan of 14 days. This allows me to continually test that I can replace my environment at any point. People always ask me if I follow the same principal for data nodes. I posted previously about replacing nodes is an ElasticSearch cluster, this post will detail how I replace nodes in a Riak cluster

NOTE: This post assumes that you have the Riak Control console enabled for Riak. You can find out how to enable that in the post I wrote on configuring Riak.

When going to the Riak Control, you can find the following screens:

Cluster Health

Image

Ring Status

Image

Cluster Management

Image

Node Management

Image

Removing a node from the Cluster

In order to remove a node from the cluster, go to the cluster managemenet screen. Find the node you want to replace in the list and click on the Actions toggle. It will reveal actions as follows:

Image

As the node is currently running, I tend to chose the Allow this node to leave normally option (if the node had died or was unresponsive, I would usually chose the force this node to leave). Clicking on the Stage button, details a plan of what is going to happen:

Image

If the proposed changes look good, Commit the plan. Watch the partitions drain from the node to be replaced:

Image

When the all the partitions have drained, we now have a 2 node cluster where the partitons are split 50:50:

Image

We can now destroy the node and let the autoscaling group launch another to replace it

Adding a new node to the Cluster

Assuming a new node has already been launched and is ready to go into the cluster. Go to the cluster management page in the portal and enter new node details. It should follow the format riak@<ipaddress>

Image

The list of actions that are pending on the cluster:

Image

Commit the changes, watch the partions rebalance across the cluster:

Image

The cluster will return to being 3 nodes, with equal partition split and will then show as green again

Image

Building a Riak Cluster in AWS With Packer and Terraform

Following my pattern of building AMIs for applications, I create my riak cluster with Packer for my AMI and Terraform for the infrastructure

Building Riak AMIs with Packer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
  "variables": {
    "ami_id": "",
    "private_subnet_id": "",
    "security_group_id": "",
    "packer_build_number": "",
  },
  "description": "Riak Image",
  "builders": [
    {
      "ami_name": "riak-{{user `packer_build_number`}}",
      "availability_zone": "eu-west-1a",
      "iam_instance_profile": "app-server",
      "instance_type": "t2.small",
      "region": "eu-west-1",
      "run_tags": {
        "role": "packer"
      },
      "security_group_ids": [
        "{{user `security_group_id`}}"
      ],
      "source_ami": "{{user `ami_id`}}",
      "ssh_timeout": "10m",
      "ssh_username": "ubuntu",
      "subnet_id": "{{user `private_subnet_id`}}",
      "tags": {
        "Name": "riak-packer-image"
      },
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "inline": [ "sleep 10" ]
    },
    {
      "type": "shell",
      "script": "install_dependencies.sh",
      "execute_command": "echo '' | {{ .Vars }} sudo -E -S sh '{{ .Path }}'"
    },
    {
      "type": "ansible-local",
      "playbook_file": "riak.yml",
      "extra_arguments": [
        "--module-path=./modules"
      ],
      "playbook_dir": "../../"
    }
  ]
}

The install_dependencies.sh script is as described previously

The ansible playbook for Riak looks as follows:

1
2
3
4
5
6
7
8
9
10
- hosts: all
  sudo: yes

  pre_tasks:
    - ec2_tags:
    - ec2_facts:

  roles:
    - base
    - riak

The base playbook installs a base role for all the base pieces of my system (e.g. Logstash, Sensu-client, prometheus node_exporter) and then proceeds to install riak.

The riak ansible role looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- action: apt_key url={{ riak_key_url }} state=present

- action: apt_repository repo='{{ riak_deb_repo }}' state=present update_cache=yes

- apt: name=riak={{ riak_version }} state=present update_cache=yes

- name: set ulimit
  copy: src=etc-default-riak dest=/etc/default/riak owner=root group=root mode=0644

- name: template riak configuration
  template: src=riak.j2 dest=/etc/riak/riak.conf owner=riak mode=0644
  register: restart_riak

- name: restart riak
  service: name=riak state=started

The role itself is very simple. The riak cluster settings are all held in the riak.j2 template file. Notice that the riak template has the following line in it:

1
riak_control = on

The variables for the riak playbook look as follows:

1
2
3
riak_key_url: "https://packagecloud.io/gpg.key"
riak_deb_repo: "deb https://packagecloud.io/basho/riak/ubuntu/ trusty main"
riak_version: 2.1.1-1

Deploying Riak with Terraform

The infrastructure of the Riak cluster is pretty simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
resource "aws_elb" "riak_v2_elb" {
  name = "riak-elb-v2"
  subnets = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]
  security_groups = ["${aws_security_group.riak_elb.id}"]
  cross_zone_load_balancing = true
  connection_draining = true
  internal = true

  listener {
    instance_port      = 8098
    instance_protocol  = "tcp"
    lb_port            = 8098
    lb_protocol        = "tcp"
  }

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    target              = "HTTP:8098/ping"
    timeout             = 5
  }
}

resource "aws_security_group" "riak" {
  name = "riak-sg"
  description = "Riak Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 8098
    to_port   = 8098
    protocol  = "tcp"
    security_groups = ["${aws_security_group.riak_elb.id}"]
  }

  ingress {
    from_port = 8098
    to_port   = 8098
    protocol  = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "Riak Node"
  }
}

resource "aws_security_group_rule" "riak_all_tcp" {
    type = "ingress"
    from_port = 0
    to_port = 65535
    protocol = "tcp"
    security_group_id = "${aws_security_group.riak.id}"
    source_security_group_id = "${aws_security_group.riak.id}"
}

resource "aws_security_group" "riak_elb" {
  name = "riak-elb-sg"
  description = "Riak Elastic Load Balancer Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 8098
    to_port   = 8098
    protocol  = "tcp"
    security_groups = ["${aws_security_group.node.id}"]
  }

  ingress {
    from_port = 8098
    to_port   = 8098
    protocol  = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "Riak Load Balancer"
  }
}

resource "aws_autoscaling_group" "riak_v2_autoscale_group" {
  name = "riak-v2-autoscale-group"
  availability_zones = ["${aws_subnet.primary-private.availability_zone}","${aws_subnet.secondary-private.availability_zone}","${aws_subnet.tertiary-private.availability_zone}"]
  vpc_zone_identifier = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]
  launch_configuration = "${aws_launch_configuration.riak_launch_config.id}"
  min_size = 0
  max_size = 100
  health_check_type = "EC2"

  tag {
    key = "Name"
    value = "riak"
    propagate_at_launch = true
  }

  tag {
    key = "role"
    value = "riak"
    propagate_at_launch = true
  }

  tag {
    key = "elb_name"
    value = "${aws_elb.riak_v2_elb.name}"
    propagate_at_launch = true
  }

  tag {
    key = "elb_region"
    value = "${var.aws_region}"
    propagate_at_launch = true
  }
}

resource "aws_launch_configuration" "riak_launch_config" {
  image_id = "${var.riak_ami_id}"
  instance_type = "${var.riak_instance_type}"
  iam_instance_profile = "app-server"
  key_name = "${aws_key_pair.terraform.key_name}"
  security_groups = ["${aws_security_group.riak.id}","${aws_security_group.node.id}"]
  enable_monitoring = false

  lifecycle {
    create_before_destroy = true
  }

  root_block_device {
    volume_size = "${var.driak_volume_size}"
  }
}

Replacing the Nodes in an AWS ElasticSearch Cluster

In a previous post, I talked about how I have tended towards the philosophy of 'Immutable Infrastructure'. As part of that philospohy, when a box is created in my environment, it has a lifespan of 14 days. On the 14th day, I get a notification to tell me that the box is due for renewal. When it comes to ElasticSearch nodes, there is a process I follow to renew a box.

I have an example 3 nodes cluster of ElasticSearch up and running to test this on:

Image

Let's assume that instance i- was due for renewal. Firstly, I would usually disable shard reallocation. This will stop unnecessary data transfer between nodes and minimise the wastage of I/O.

1
2
3
4
5
curl -XPUT localhost:9200/_cluster/settings -d '{
                "transient" : {
                    "cluster.routing.allocation.enable" : "none"
                }
        }'

As these shard allocation is now disabled, I can continue with the node replacement. There are a few ways to do this. Previously to ElasticSearch 2.0, we could do it with the ElasticSearch API:

1
curl -XPOST 'http://localhost:9200/_cluster/nodes/MYNODEIP/_shutdown'

If you are using ElasticSearch 2.0, you are more than likely running ElasticSearch as a service. To shutdown the node, stop the service.

By looking at the status of the cluster now, I can see the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 160,
  "active_shards" : 317,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 151,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

I can see that it tells me the cluster is yellow and that I have 2 nodes in it. I can proceed with the instance termination.

Image

I have an AWS Autoscale Group configured for ElasticSearch to keep 3 instances running. Therefore, the node that I destroyed will fail the Autoscale Group Healthcheck and a new instance will be spawned to replace it.

Using the ElasticSearch Cluster Health API, I can determine when the new node is in place:

1
curl -XGET 'http://localhost:9200/_cluster/health?wait_for_nodes=3&timeout=100s'

The command will continue running until the cluster has 3 nodes in it. If you want to replace more nodes in the cluster, then repeat the steps above. If you are finished, then it is important to re-enable the shard reallocation:

1
2
3
4
5
curl -XPUT localhost:9200/_cluster/settings -d '{
                "transient" : {
                    "cluster.routing.allocation.enable" : "all"
                }
        }'

The time taken to rebalance the cluster will depend on the number and size of the shards.

You can monitor the health of the cluster until it turns green:

1
curl -XGET 'http://localhost:9200/_cluster/health?wait_for_status=green&timeout=100s'

The cluster is now green and all is working as expected again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 160,
  "active_shards" : 470,
  "relocating_shards" : 1,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Deploying Kibana Using Nginx as an SSL Proxy

In my last post, I described how I use Packer and Terraform to deploy an ElasticSearch cluster. In order to make the logs stored in ElasticSearch searchable, I use Kibana. I follow the previous pattern and deploy Kibana using Packer to build an AMI and then create the infrastructure using Terraform. The Packer template has already taken into account that I want to use nginx as a proxy.

Building Kibana AMIs with Packer and Ansible

The template looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
  "variables": {
    "ami_id": "",
    "private_subnet_id": "",
    "security_group_id": "",
    "packer_build_number": "",
  },
  "description": "Kibana Image",
  "builders": [
    {
      "ami_name": "kibana-{{user `packer_build_number`}}",
      "availability_zone": "eu-west-1a",
      "iam_instance_profile": "app-server",
      "instance_type": "t2.small",
      "region": "eu-west-1",
      "run_tags": {
        "role": "packer"
      },
      "security_group_ids": [
        "{{user `security_group_id`}}"
      ],
      "source_ami": "{{user `ami_id`}}",
      "ssh_timeout": "10m",
      "ssh_username": "ubuntu",
      "subnet_id": "{{user `private_subnet_id`}}",
      "tags": {
        "Name": "kibana-packer-image"
      },
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "inline": [ "sleep 10" ]
    },
    {
      "type": "shell",
      "script": "install_dependencies.sh",
      "execute_command": "echo '' | {{ .Vars }} sudo -E -S sh '{{ .Path }}'"
    },
    {
      "type": "ansible-local",
      "playbook_file": "kibana.yml",
      "extra_arguments": [
        "--module-path=./modules"
      ],
      "playbook_dir": "../../"
    }
  ]
}

The install_dependencies.sh script is as described previously

The ansible playbook for Kibana looks as follows:

1
2
3
4
5
6
7
8
9
10
11
- hosts: all
  sudo: yes

  pre_tasks:
    - ec2_tags:
    - ec2_facts:

  roles:
    - base
    - kibana
    - reverse_proxied

The playbook installs a base role for all the base pieces of my system (e.g. Logstash, Sensu-client, prometheus node_exporter) and then proceeds to install ElasticSearch.

The Kibana role looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
- name: Download Kibana
  get_url: url=https://download.elasticsearch.org/kibana/kibana/kibana-{{ kibana_version }}-linux-x64.tar.gz dest=/tmp/kibana-{{ kibana_version }}-linux-x64.tar.gz mode=0440

- name: Untar Kibana
  command: tar xzf /tmp/kibana-{{ kibana_version }}-linux-x64.tar.gz -C /opt creates=/opt/kibana-{{ kibana_version }}-linux-x64.tar.gz

- name: Link to Kibana Directory
  file: src=/opt/kibana-{{ kibana_version }}-linux-x64
        dest=/opt/kibana
        state=link
        force=yes

- name: Link Kibana to ElasticSearch
  lineinfile: >
    dest=/opt/kibana/config/kibana.yml
    regexp="^elasticsearch_url:"
    line='elasticsearch_url: "{{ elasticsearch_url }}"'

- name: Create Kibana Init Script
  copy: src=initd.conf dest=/etc/init.d/kibana mode=755 owner=root

- name: Ensure Kibana is running
  service: name=kibana state=started

The reverse_proxied ansible role looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
- name: download private key file
  command: aws s3 cp {{ reverse_proxy_private_key_s3_path }} /etc/ssl/private/{{ reverse_proxy_private_key }}

- name: private key permissions
  file: path=/etc/ssl/private/{{ reverse_proxy_private_key }} mode=600

- name: download certificate file
  command: aws s3 cp {{ reverse_proxy_cert_s3_path }} /etc/ssl/certs/{{ reverse_proxy_cert }}

- name: download DH 2048bit encryption
  command: aws s3 cp {{ reverse_proxy_dh_pem_s3_path }} /etc/ssl/{{ reverse_proxy_dh_pem }}

- name: certificate permissions
  file: path=/etc/ssl/certs/{{ reverse_proxy_cert }} mode=644

- apt: pkg=nginx

- name: remove default nginx site from sites-emabled
  file: path=/etc/nginx/sites-enabled/default state=absent

- template: src=nginx.conf.j2 dest=/etc/nginx/nginx.conf mode=644 owner=root group=root

- service: name=nginx state=restarted

- file: path=/var/log/nginx
        mode=0755
        state=directory

This role downloads a private SSL Key and a Certificate from a S3 bucket that is security controlled through IAM. This allows us to configure nginx to act as a proxy. The nginx proxy template is available to view.

We can then pass a number of variables to our role for use within ansible:

1
2
3
4
5
6
7
8
9
10
11
reverse_proxy_private_key: mydomain.key
reverse_proxy_private_key_s3_path: s3://my-bucket/certs/mydomain/mydomain.key
reverse_proxy_cert: mydomain.crt
reverse_proxy_cert_s3_path: s3://my-bucket/certs/mydomain/mydomain.crt
reverse_proxy_dh_pem_s3_path: s3://my-bucket/certs/dhparams.pem
reverse_proxy_dh_pem: dhparams.pem
proxy_urls:
  - reverse_proxy_url: /
    reverse_proxy_upstream_port: 3000
kibana_version: 4.1.0
elasticsearch_url: http://myes.com:9200

This allows me to easily change the configuration of nginx to patch security vulnerabilities easily.

Deploying Kibana with Terraform

The infrastructure of the Kibana cluster is now pretty easy. The Terraform script now looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
resource "aws_security_group" "kibana" {
  name = "kibana-sg"
  description = "Kibana Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 443
    to_port   = 443
    protocol  = "tcp"
    security_groups = ["${aws_security_group.kibana_elb.id}"]
  }

  ingress {
    from_port = 80
    to_port   = 80
    protocol  = "tcp"
    security_groups = ["${aws_security_group.kibana_elb.id}"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "Kibana Node"
  }
}

resource "aws_security_group" "kibana_elb" {
  name = "kibana-elb-sg"
  description = "Kibana Elastic Load Balancer Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 443
    to_port   = 443
    protocol  = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port = 80
    to_port   = 80
    protocol  = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "Kibana Load Balancer"
  }
}

resource "aws_elb" "kibana_elb" {
  name = "kibana-elb"
  subnets = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]
  security_groups = ["${aws_security_group.kibana_elb.id}"]
  cross_zone_load_balancing = true
  connection_draining = true
  internal = true

  listener {
    instance_port      = 443
    instance_protocol  = "tcp"
    lb_port            = 443
    lb_protocol        = "tcp"
  }

  listener {
    instance_port      = 80
    instance_protocol  = "tcp"
    lb_port            = 80
    lb_protocol        = "tcp"
  }

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    target              = "TCP:443"
    timeout             = 5
  }
}

resource "aws_launch_configuration" "kibana_launch_config" {
  image_id = "${var.kibana_ami_id}"
  instance_type = "${var.kibana_instance_type}"
  iam_instance_profile = "app-server"
  key_name = "${aws_key_pair.terraform.key_name}"
  security_groups = ["${aws_security_group.kibana.id}","${aws_security_group.node.id}"]
  enable_monitoring = false

  root_block_device {
    volume_size = "${var.kibana_volume_size}"
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "kibana_autoscale_group" {
  name = "kibana-autoscale-group"
  availability_zones = ["${aws_subnet.primary-private.availability_zone}","${aws_subnet.secondary-private.availability_zone}","${aws_subnet.tertiary-private.availability_zone}"]
  vpc_zone_identifier = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]
  launch_configuration = "${aws_launch_configuration.kibana_launch_config.id}"
  min_size = 2
  max_size = 100
  health_check_type = "EC2"
  load_balancers = ["${aws_elb.kibana_elb.name}"]

  tag {
    key = "Name"
    value = "kibana"
    propagate_at_launch = true
  }

  tag {
    key = "role"
    value = "kibana"
    propagate_at_launch = true
  }

  tag {
    key = "elb_name"
    value = "${aws_elb.kibana_elb.name}"
    propagate_at_launch = true
  }

  tag {
    key = "elb_region"
    value = "${var.aws_region}"
    propagate_at_launch = true
  }
}

This allows me to scale my system up or down just by changing the values in my Terraform configuration. When the instances are instantiated, the Kibana instances are added to the ELB and they are then available to serve traffic

Building an ElasticSearch Cluster in AWS With Packer and Terraform

As discussed in a previous post, I like to build separate AMIs for each of my systems. This allows me to scale up and recycle nodes easily. I have been doing this with ElasticSearch for a while now. I usually build an AMI with Packer and Ansible and I use Terraform to roll out the infrastructure

Building ElasticSearch AMIs with Packer

The packer template looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
  "variables": {
    "ami_id": "",
    "private_subnet_id": "",
    "security_group_id": "",
    "packer_build_number": "",
  },
  "description": "ElasticSearch Image",
  "builders": [
    {
      "ami_name": "elasticsearch-{{user `packer_build_number`}}",
      "availability_zone": "eu-west-1a",
      "iam_instance_profile": "app-server",
      "instance_type": "t2.small",
      "region": "eu-west-1",
      "run_tags": {
        "role": "packer"
      },
      "security_group_ids": [
        "{{user `security_group_id`}}"
      ],
      "source_ami": "{{user `ami_id`}}",
      "ssh_timeout": "10m",
      "ssh_username": "ubuntu",
      "subnet_id": "{{user `private_subnet_id`}}",
      "tags": {
        "Name": "elasticsearch-packer-image"
      },
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "inline": [ "sleep 10" ]
    },
    {
      "type": "shell",
      "script": "install_dependencies.sh",
      "execute_command": "echo '' | {{ .Vars }} sudo -E -S sh '{{ .Path }}'"
    },
    {
      "type": "ansible-local",
      "playbook_file": "elasticsearch.yml",
      "extra_arguments": [
        "--module-path=./modules"
      ],
      "playbook_dir": "../../"
    }
  ]
}

This is essentially a pretty simple script and builds an AWS Instance in a private subnet of my choice in eu-west-1a in AWS.

install_dependencies.sh

The first part of the script just installs the dependencies that my system has:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash

apt-get update
apt-get upgrade -y
apt-get install -y software-properties-common git
apt-add-repository -y ppa:ansible/ansible apt-get update

# workaround for ubuntu pip bug - https://bugs.launchpad.net/ubuntu/+source/python-pip/+bug/1306991
rm -rf /usr/local/lib/python2.7/dist-packages/requests
apt-get install -y python-dev

ssh-keyscan -H github.com > /etc/ssh/ssh_known_hosts

wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py

pip install ansible paramiko PyYAML jinja2 httplib2 netifaces boto awscli six

Ansible playbook for ElasticSearch

The ElasticSearch playbook looks as follows:

1
2
3
4
5
6
7
8
9
10
- hosts: all
  sudo: yes

  pre_tasks:
    - ec2_tags:
    - ec2_facts:

  roles:
    - base
    - elasticsearch

The playbook installs a base role for all the base pieces of my system (e.g. Logstash, Sensu-client, prometheus node_exporter) and then proceeds to install ElasticSearch.

The ElasticSearch role looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
- ec2_facts:
- ec2_tags:

- name: Add Oracle Java Repository
  apt_repository: repo='ppa:webupd8team/java'

- name: Accept Java 8 Licence
  shell: echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | tee /etc/oracle-java-8-licence-acceptance | /usr/bin/debconf-set-selections
  args:
    creates: /etc/oracle-java-8-licence-acceptance

- name: Add ElasticSearch repo public signing key
  apt_key: id=46095ACC8548582C1A2699A9D27D666CD88E42B4 url=https://packages.elastic.co/GPG-KEY-elasticsearch state=present

- name: Add ElasticSearch repository
  apt_repository:
    repo: 'deb http://packages.elasticsearch.org/elasticsearch/{{ es_release }}/debian stable main'
    state: present

- name: Install Oracle Java 8
  apt: name={{item}} state=latest
  with_items:
    - oracle-java8-installer
    - ca-certificates
    - oracle-java8-set-default

- name: Install ElasticSearch
  apt: name=elasticsearch={{ es_version }} state=present
  notify: Restart elasticsearch

- name: Copy /etc/default/elasticsearch
  template: src=elasticsearch dest=/etc/default/elasticsearch
  notify: Restart elasticsearch

- name: Copy /etc/elasticsearch/elasticsearch.yml
  template: src=elasticsearch.yml dest=/etc/elasticsearch/elasticsearch.yml
  notify: Restart elasticsearch

- name: Set elasticsearch service to start on boot
  service: name=elasticsearch enabled=yes

- name: Install plugins
  command: bin/plugin --install {{item.name}}
  args:
    chdir: "{{ es_home }}"
    creates: "{{ es_home }}/plugins/{{ item.plugin_file | default(item.name.split('/')[1]) }}"
  with_items: es_plugins
  notify: Restart elasticsearch

- name: Set elasticsearch to be running
  service: name=elasticsearch state=running enabled=yes

This is just some basic ansible commands to get the apt-repo, packages and plugins installed in the system. You can find the templates used here. The important part to note is that variables are used both in the script and in the templates to setup the cluster to the required level.

My variables look as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
es_release: "1.6"
es_version: ".0"
es_home: /usr/share/elasticsearch
es_wait_for_listen: yes
es_etc:
  cluster_name: central_logging_cluster
  discovery.type: ec2
  discovery.ec2.groups: elasticsearch-sg
  cloud.aws.region: ""
es_default_es_heap_size: 4g
es_plugins:
  - name: elasticsearch/elasticsearch-cloud-aws/2.6.0
  - name: elasticsearch/marvel/latest
  - name: mobz/elasticsearch-head
es_etc_index_number_of_replicas: 2

As I have specified elasticsearch-sg and installed the elasticsearch-cloud-aws plugin, my nodes can auto-discover each other in the aws region. I can build the packer image as follows:

1
2
3
4
5
6
7
8
9
10
#!/bin/bash

LATEST_UBUNTU_IMAGE=$(curl http://cloud-images.ubuntu.com/locator/ec2/releasesTable | grep eu-west-1 | grep trusty | grep amd64 | grep "\"hvm:ebs\"" | awk -F "[<>]" '{print $3}')

packer build \
  -var ami_id=$LATEST_UBUNTU_IMAGE \
  -var security_group_id=MYSGID\
  -var private_subnet_id=MYSUBNETID \
  -var packer_build_number=PACKERBUILDNUMBER \
  elasticsearch.json

We are now ready to build the infrastructure for the cluster

Building an ElasticSearch Cluster with Terraform

The infrastructure of the ElasticSearch cluster is now pretty easy. I deploy my nodes into a VPC and onto private subnets so that they are not externally accessible. I have an ELB in place across the nodes so that I can easily get to the ElasticSearch plugins like Marvel and Head.

The Terraform configuration is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
resource "aws_security_group" "elasticsearch" {
  name = "elasticsearch-sg"
  description = "ElasticSearch Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 9200
    to_port   = 9400
    protocol  = "tcp"
    security_groups = ["${aws_security_group.elasticsearch_elb.id}"]
  }

  ingress {
    from_port = 9200
    to_port   = 9400
    protocol  = "tcp"
    security_groups = ["${aws_security_group.node.id}"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "ElasticSearch Node"
  }
}


resource "aws_security_group" "elasticsearch_elb" {
  name = "elasticsearch-elb-sg"
  description = "ElasticSearch Elastic Load Balancer Security Group"
  vpc_id = "${aws_vpc.default.id}"

  ingress {
    from_port = 9200
    to_port   = 9200
    protocol  = "tcp"
    security_groups = ["${aws_security_group.node.id}"]
  }

  egress {
    from_port = "0"
    to_port = "0"
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "ElasticSearch Load Balancer"
  }
}

resource "aws_elb" "elasticsearch_elb" {
  name = "elasticsearch-elb"
  subnets = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]
  security_groups = ["${aws_security_group.elasticsearch_elb.id}"]
  cross_zone_load_balancing = true
  connection_draining = true
  internal = true

  listener {
    instance_port      = 9200
    instance_protocol  = "tcp"
    lb_port            = 9200
    lb_protocol        = "tcp"
  }

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    target              = "TCP:9200"
    timeout             = 5
  }
}

resource "aws_autoscaling_group" "elasticsearch_autoscale_group" {
  name = "elasticsearch-autoscale-group"
  availability_zones = ["${aws_subnet.primary-private.availability_zone}","${aws_subnet.secondary-private.availability_zone}","${aws_subnet.tertiary-private.availability_zone}"]
  vpc_zone_identifier = ["${aws_subnet.primary-private.id}","${aws_subnet.secondary-private.id}","${aws_subnet.tertiary-private.id}"]
  launch_configuration = "${aws_launch_configuration.elasticsearch_launch_config.id}"
  min_size = 3
  max_size = 100
  desired = 3
  health_check_grace_period = "900"
  health_check_type = "EC2"
  load_balancers = ["${aws_elb.elasticsearch_elb.name}"]

  tag {
    key = "Name"
    value = "elasticsearch"
    propagate_at_launch = true
  }

  tag {
    key = "role"
    value = "elasticsearch"
    propagate_at_launch = true
  }

  tag {
    key = "elb_name"
    value = "${aws_elb.elasticsearch_elb.name}"
    propagate_at_launch = true
  }

  tag {
    key = "elb_region"
    value = "${var.aws_region}"
    propagate_at_launch = true
  }
}

resource "aws_launch_configuration" "elasticsearch_launch_config" {
  image_id = "${var.elasticsearch_ami_id}"
  instance_type = "${var.elasticsearch_instance_type}"
  iam_instance_profile = "app-server"
  key_name = "${aws_key_pair.terraform.key_name}"
  security_groups = ["${aws_security_group.elasticsearch.id}","${aws_security_group.node.id}"]
  enable_monitoring = false

  lifecycle {
    create_before_destroy = true
  }

  root_block_device {
    volume_size = "${var.elasticsearch_volume_size}"
  }
}

This allows me to scale my system up or down just by changing the values in my Terraform configuration. When the instances are instantiatied, the ElasticSearch cloud plugin discovers the other members of the cluster and allows the node to join the cluster

Autoscaling Group Notifications With Terraform and AWS Lambda

I use Autoscaling Groups in AWS for all of my systems. The main benefit for me here was to make sure that when a node died in AWS, the Autoscaling Group policy made sure that the node was replaced. I wanted to get some visibility of when the Autoscaling Group was launching and terminating nodes and decided that posting notifications to Slack would be a good way of getting this. With Terraform and AWS Lambda, I was able to make this happen.

This post assumes that you are already setup and running with Terraform

Create an IAM Role that allows access to AWS Lambda:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
resource "aws_iam_role" "slack_iam_lambda" {
    name = "slack-iam-lambda"
    assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

Create a lambda function as follows:

1
2
3
4
5
6
resource "aws_lambda_function" "slack_notify" {
  filename = "slackNotify.zip"
  function_name = "slackNotify"
  role = "${aws_iam_role.slack_iam_lambda.arn}"
  handler = "slackNotify.handler"
}

We assume here, that you have already created a Slack Integration. The hook URL from that integration is required for the lambda contents.

The filename slackNotify.zip is a zip of a file called slackNotify.js. The contents of that js file are available

Terraform currently does not support hooking AWS Lambda up to SNS Event Sources. Therefore, unfortunately, there is a manual step required to configure the Lambda to talk to the SNS Topic. There is a PR in Terraform to allow this to be automated as well.

In the AWS Console, go to Lambda and then chose the Lambda function.

Image

Go to the Event Sources tab:

Image

Click on Add Event Source and then choose SNS from the dropdown and then make sure you chose the correct SNS Topic name:

Image

We then use another Terraform resource to attach the Autoscale Groups to the Lambda as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
resource "aws_autoscaling_notification" "slack_notifications" {
  group_names = [
    "admin-api-autoscale-group",
    "rundeck-autoscale-group",
  ]
  notifications  = [
    "autoscaling:EC2_INSTANCE_LAUNCH",
    "autoscaling:EC2_INSTANCE_TERMINATE",
    "autoscaling:EC2_INSTANCE_LAUNCH_ERROR",
    "autoscaling:EC2_INSTANCE_TERMINATE_ERROR",
    "autoscaling:TEST_NOTIFICATION"
  ]
  topic_arn = "${aws_sns_topic.asg_slack_notifications.arn}"
}

As we have configured notifications for autoscaling:TEST_NOTIFICATION, when you apply this configuration with Terraform, you will see something similar to the following in Slack:

Image

In the current infrastructure I manage, there are 27 Autoscale groups. I don't really want to add 27 hardcoded groupnames in the awsautoscaling_notifcation in Terraform.

I wanted to take advantage of a Terraform module. In a nutshell, the module does a lookup of all the Autoscaling Groups in a region and then passes that list into the Terraform resource.

The output of the module looks as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
  "variable": {
    "autoscalegroup_names": {
      "description": "List of autoscalegroup names for a region",
      "default": {
        "eu-west-1": "admin-api-autoscale-group,dash-autoscale-group,demo-autoscale-group,docker-v2-autoscale-group,elasticsearch-autoscale-group,faces-autoscale-group,internal-api-autoscale-group,jenkins-master-autoscale-group,kafka-autoscale-group,landscapes-autoscale-group",
        "ap-southeast-1": "",
        "ap-southeast-2": "",
        "eu-central-1": "",
        "ap-northeast-1": "",
        "us-east-1": "",
        "sa-east-1": "",
        "us-west-1": "",
        "us-west-2": ""
      }
    }
  }
}

I then pass this into the Terraform resource as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
module "autoscalegroups" {
  source = "github.com/stack72/tf_aws_autoscalegroup_names"
  region = "${var.aws_region}"
}

resource "aws_autoscaling_notification" "slack_notifications" {
  group_names = [
    "${split(",", module.autoscalegroups.asg_names)}",
  ]
  notifications  = [
    "autoscaling:EC2_INSTANCE_LAUNCH",
    "autoscaling:EC2_INSTANCE_TERMINATE",
    "autoscaling:EC2_INSTANCE_LAUNCH_ERROR",
    "autoscaling:EC2_INSTANCE_TERMINATE_ERROR",
    "autoscaling:TEST_NOTIFICATION"
  ]
  topic_arn = "${aws_sns_topic.asg_slack_notifications.arn}"
}

When anything happens within an Autoscaling Group, I now get notifications as follows:

Image Image

The Quest for Infrastructure Management 2.0

I've long been a configuration management tool fan. I have blogged, spoken at conferences and used Puppet as well as Chef and Ansible. The more I use these tools now, the more I realise I'm actually not making my life any easier

Currently, the infrastructure I manage is 100% AWS Cloud based. This has actually changed how I work:

  1. I have learned to always expect problems so I therefore should have everything 100% automated.

  2. No server is kept in production for more than 2 weeks

By combining these 2 ways of working, I can easily recover from outages. The speed of recovery is down to being able to provision the pieces of my system as fast as possible. The simplist way to be able to provision instances fast is to build my own AMIs with Packer. I have come to the realisation that when I boot an instance, I don't really want to wait for a configuration management tool to run. I have also begun to realise that having a tool change my systems in production can introduce unneeded risk. The Packer templates to build the AMI have serverspec tests built into them. This means that at build time, I know if an AMI has been built correctly.

The AWS infrastructure itself is managed by Terraform. I tend to use AutoScalingGroups and LaunchConfig for the instances and when Terraform is checking the state of the infrastructure, it will look up the latest AMI ID and make sure that it is part of the Launch Configuration. If it isn't, Terraform will update the Launch Config so that the next machine will be booted from the new AMI.

I use Rundeck for orchestrating changes to the infrastructure. I have a job in Rundeck that allows me to recycle all instances in an AutoScalingGroup one at a time and in a HA manner. From building a new AMI, to fully recycling an AutoScalingGroup is about 20 minutes (the packer build itself takes about 12 minutes). So, in theory, it takes me about 20 minutes to release new security patches to all instances in an AutoScalingGroup.

Isn't this just 'Golden Images'?

Technically, yes. But the important for me is being able to roll out a fully tested AMI and then not making any additional changes to it in production. I would like to say that my infrastructure is 100% immutable, but after reading a recent article by @emmajanehw, I now realise that can never be the case. Each of my AMIs are versioned and I have a nightly Rundeck job that tells me what version of an AMI a system is built / released with.

Do I Consider Configuration Management Dead?

Not at all. I simply do not want to make additional changes to my environments when I know they are working. Right now, I use Ansible to provision my AMI as part of my Packer scripts. So I do believe these tools still need to be part of our eco-system. I could substitute in any configuration management tool to help build my AMIs. The purists could even use bash / shell scripts to do the same job

Can I only do this if I use *nix / AWS?

Not at all. At $JOB[-1], we actually were changing our provisioning to allow us to spin up images much faster. We were using a mix of AMIs and VMWare templates for Windows and Ubuntu. By moving in that direction, it would reduce the time taken to provision a box from maybe an hour to minutes.

In my opinion, moving to a more immutable style of infrastructure is the next phase of infra management for me. I believe the learnings from using config management tools in production across 1000s of nodes has helped me move in this direction but YMMV.

DevOps and .Net Conference

So I just tweeted the following,

Image

Firstly, I'd like to say that this is not about naming and shaming. Secondly, I am not annoyed with the conference at all about the response. The conference I spoke to advertises itself as “engineering talks only” so I wanted to post a few things about that.

In my opinion, the writing of code and the ecosystem of a specific platform is only 10% (or rather a small portion) of what we need to be aware of as software engineers. I am a software developer who works in the infrastructure / ops world now. When I was writing application only code, I was not involved in understanding the entire ecosystem of the software I was working on. In hindsight, I really feel I missed out by not being part of it. Since being part of the infrastructure world, I feel it has actually helped me develop better & more robust software.

Organising conferences is a huge amount of work and is, frankly, hard. I understand that conferences cannot cater for every part of an ecosystem. One thing I do think conferences should do, is to strive to make developers better. DevOps, Continuous Delivery and Infrastructure are / should be things that we, as developers, care about. To dismiss these style of topics from a conference that advertises for “engineering talks only” can help to hinder developers from delivering the best products they can. It may not also make developers understand the importance of software being in production and making money

Food for thought…

Recommended Ops Books

I created a Github repo to store all the book recommendations I have had since moving more into the operations space. The book categories (so far), include:

  • DevOps / Culture
  • Continuous Delivery
  • Systems Thinking
  • Web Operations
  • System Architecture
  • Tooling

I want this repository to continue to grow with a list of links to books that people would recommend. If you feel you want to make some recommendations, then please feel free to send a PR!

Otherwise, enjoy the suggestions