Tech at HackGT: Infrastructure

Ehsan Asdar / March 20, 2018

HackGT is a student organization at Georgia Tech that runs a series of events every year that empower students to learn and hone their skills in Computer Science. Our marquee event (of the same name, HackGT) takes place each fall, hosting over 1000 hackers at Georgia Tech for 36 hours of learning and creating, while other events such as Catalyst and HackGTeeny strive to show students how intriguing and rewarding careers in STEM, and specifically CS, can be.

The HackGT tech team builds software to support our event organizing teams. Our variety of responsibilities range from developing educational content (workshops, tech talks, CTFs) to writing applications to manage hacker data. This is the first in a series of posts about HackGT and will focus on how we built our infrastructure to ensure a clean, consistent development-release cycle, complete with zero-downtime deployment, monitoring and alerting, and effortless metrics collection.

Motivation

As HackGT has grown, so has the organization's tech requirements. We now run an assortment of applications that are written in several languages and require various external resources (disk, databases, APIs). In the past, we have relied on a single VPS to host all of our applications, but this quickly became a pain to manage. Early in our search for a replacement, we chose Docker as a method of containing our apps. Docker lets us abstract the deployment requirements of each application away from the servers they are running on. Instead of installing build tools and interpreters (such as Node, Python, Ruby etc…) on each server, the application's exact requirements come pre-packaged in the Docker image for the application.

Docker is great, but it still left some problems to solve. Namely, we wanted to automate the process of building, running and monitoring all our container-packaged applications. The rest of this post describes Beehive, a custom infrastructure automation system we wrote to do just that. Let's dive in!

Overview

Hosting

After researching a range of container orchestration tools, we settled on Kubernetes as a system to run our Docker-packaged applications. Kubernetes comes with some useful built-in features such as round-robin load balancing between multiple containers, flexible scaling, and centralized log management (more on these later). We run our Kubernetes cluster on Google Kubernetes Engine.

Application Configuration

To host application configurations, we wrote a light wrapper around Kubernetes Deployment objects. This file, called deployment.yaml, lives at the root level of each application deployed into our cluster ( this is an example). The file defines basic information the cluster needs to know to run the application, such as network ports, RAM/CPU requirements, environment variables, any configuration files or resources, and any application secrets required (like API keys). The contents of secret files and environment variables are uploaded separately to the cluster using Kubernetes Secret resources.

Building an Application

Each application in our cluster is built using a custom Travis CI configuration. We require each application in our cluster to contain a Dockerfile at the root level. On each commit, Travis builds a Docker image from the application source and pushes the built image to Docker Hub for storage.

Running an Application

A sample deployment file for our registration application

The picture above is a sample service file from our Bee Hive repository. Each file specifies the Git repository of the application to be run and a Git branch, commit hash or tag. Optionally, the service file can specify per-deployment overrides of fields in the application's deployment.yaml.

Each service file is named according to the subdomain of hack.gt that the service should be deployed on. For example, registration.yaml specifies the application that will be deployed to registration.hack.gt and dev/registration.yaml specifies the application that will be deployed to registration.dev.hack.gt. Another Travis CI script works with services running in the Kubernetes cluster to automatically provision SSL certificates for each subdomain and wire DNS entries. Branch-based services are updated every time a commit is made to the corresponding git branch in the application.

Monitoring

Slack alerts

Kubernetes periodically health-checks every application in the cluster. Should an application ever hang or crash, Kubernetes immediately restarts it and alerts our dev team on Slack. All critical services in the cluster are replicated, so the loss of one container (or even a full Kubernetes node) won't cause downtime.

Any time we deploy a new application version into the cluster, it is blue-green deployed. Kubernetes starts an instance of the new version but will not direct any traffic to the container until it is verified running and healthy. If we need to rollback a deployment, we simply modify the corresponding deployment file in Beehive, pinning it to a specific known-good commit hash.

Developer Experience

Automated PR builds!

Beehive makes deploying and configuring applications easy. To host one of our applications for an event we simply commit a new service file to Beehive. To test a feature, we simply create a service file in the dev folder pointing to the corresponding feature branch. Cleaning up a deployment is just as easy – simply delete the corresponding service file and application resources are automatically cleaned up.

After building out the core infrastructure, we wrote a bot to take advantage of this flexibility and address a pain point we identified — how best to review pull requests! The bot automatically generates a service file in Beehive for each open pull request in our application repositories. It then posts a link to a running version of the pull requested code for developers to test.

Extra

Clustered applications allowed us to explore some interesting side-projects. Here are some systems we're working on now:

Metrics

Application metrics visualized through Grafana

Kubernetes gives us centralized access to logs across all running applications. We developed a system called HackGT Metrics to standardize our application logs into a common format that is simple to search, query, and visualize.

Here's how it works:

To implement HackGT Metrics support, all an application has to do is log a one-line JSON-formatted message to STDOUT that conforms to the following schema:

{

   hackgtmetricsversion: 1,

   serviceName: string

   values: object

   tags: object

}

Each server in our cluster runs an instance of Fluentd that constantly monitors all application logs. Should a HackGT Metrics formatted log line be detected, the corresponding event information is pushed into InfluxDB. Data in InfluxDB can be queried as needed through SQL.

Our developers love this system because we can add tracking for any event with a simple console print (no need for external libraries!). InfluxDB also supports a rich ecosystem of integrations to let us visualize stored metrics. The screenshot above is from Grafana which is our go-to for visualizing and reporting on our metrics.

Wrap-Up

You might be thinking that this is overkill or too homebrewed and plagued by Not Invented Here Syndrome, but let us emphasize the design goals we had in mind when architecting this system:

  1. Transparency (our open source code should go through an open build process)
  2. Durability (if people write bad code, mitigate its effects as much as possible)
  3. Cheap-ness (utilize as many free, as in beer and freedom, services as possible)
  4. Resource-efficient (cut down on costs by only running services we need)
  5. Reproducible ( all build steps are in code , no secret build steps passed down orally)
  6. Idempotent Deployment (version controlled definition of the entire state of the cluster)
  7. Agnostic (let people choose their stacks)
  8. Modular (we want other people to use our projects and docker images!)

We wanted to really focus on the ideas above. Notice that 'scalability' is not listed nor is 'speed'. We understand that it's unlikely for us to have traffic that justifies a focus on those things. Instead we know that our members don't get paid, everyone is constantly learning about writing software, students enter and leave with high turnover (graduation), and we don't have the financial resources of a small company. Taking all of these into account we think we've built a system to best fit our needs, with minimal homebrewed extras. At the time of this writing Travis has already recorded 1455 configuration changes to the server done with our new system! (that count doesn't include updates, i.e. when new code was pushed to an already configured project) We've tested it through many of our events which gives us confidence that we can rely on this for years to come. This, and many of our other overhauls, were motivated by a need to keep a consistent code stack that will survive generations of students, and we hope Beehive stands the test of time.