Sunday, 19 August 2018

Deploying docker images on elastic beanstalk

Elastic Beanstalk is an Amazon Web Services service that allows you to quickly deploy and manage a web app, it takes a lot of the overhead out of monitoring and configuration. However EB seems to want to build your app for itself, such that it wants you to upload a tarball of your code structured in a generic way and it will try and pick up the source code to build or pre-built binaries it finds.

In the pursuit of reproducable builds I prefer to have aws run a pre-built containerised build that I can examine or run myself on my own machine also. This article is designed to help the reader do this by getting Elastic Beanstalk set up to use Elastic Container Service.

Push a built image to ECR, see ECR for how to do this, you will need to authenticate your aws client to do so. You need to use a properly qualified image name corresponding to your repository URI in order to do this successfuly.

To enable EB to pull from ECR you need to give your EB role (in this case aws-elasticbeanstalk-ec2-role) the policy AmazonEC2ContainerRegistryReadOnly via the IAM service.

To have EB pull from ECR rather than having it try to build your code itself you need to upload a Dockerrun.aws.json file in the 'Upload and Deploy' step where you normally upload a tarball, this seems incredibly non-obvious from the wording of the dialogue (see below).




















An example file (with an example image name) exposing port 1234 follows:


Documentation for the Dockerrun.aws.json is found here, note Version 2 is required for multi-container builds.

Monday, 4 December 2017

Better error logging with with error wrapping

This December, why not give the gift of better error logging?


For a while I have thought that the functionality provided by the stdlib errors package in the standard library was insufficient. One issue with errors is that they lack the context, namely stack information, that say, an exception would provide in another language. With the stdlib package, you can see the line number where an error is logged but you cannot see where the error was generated which is usually more important. This is because good error handling practice is to log the error at its highest level of propagation, but it is here where it is most far removed from the site of the error generation. You do get the string that is set when the error is generated but these can often be dynamically generated or reused in multiple places so they are no replacement for a stack trace.

This problem can be resolved by use of the third party pkg/errors package github.com/pkg/errors. This package was designed to be used as a drop in replacement for the stdlib errors package, providing a superset of the stlib package's functionality.

pkg/errors provides some functions that can annotate errors with stack information and messages at certain levels of the stack. New, Wrap and WithStack all annotate an error with stack information whilst Wrap also enables you to provide a message relevant to that level of the stack. WithStack and Wrap are smart enough, using the runtime package to examine an error and work out where it was generated even after that function has returned, which is useful if you have no control over the code that generates the error. Where you do it is best to annotate the error at generation to be sure that you don't forget to do it.

Here is an example which demonstrates annotating an error with stack information at generation and providing a contextual message at another point of the call stack. What's so great about pkg/errors is that it is instantly swappable with the stdlib package so you can enrich your errors with a one line import change!







Pretty cool, huh, happy wrapping!

Tuesday, 31 October 2017

On Using Multiple Source Control Repositories for Microservices

The nature of microservices and its small services with limited bounded contexts means that most functionality within the platform entails communication between services. In terms of network overhead this is largely mitigated by efficient and low latency communication protocols such as gRPC. However, this means that most functional changes made by developers require making changes to multiple services. Certainly, it follows that the smaller we make our services, the more likely a change is to cross service boundaries.


A graph showing the relationship between service size and the typical number of services affected by a change




This overhead, along with many others such as frequent re-implementation of common functionality, is an accepted cost of using microservices as services are naturally more granular, composable and reusable.

There are two accepted processes for version controlling microservices, the 'monorepo' and the repository per service 'manyrepo' approach. With the monorepo, all services are kept in the same source control repository. When studying microservices from a theoretical perspective it seems logical to add yet another layer of separation between services by adopting manyrepo but there are some real issues and overheads associated with doing so, which your author has recently been experiencing first hand! Below I have attempted to communicate the relative pros and cons for the 'manyrepo' approach.


Pros

  • The separation of repos with explicit dependencies upon one another makes the scope of any commit easy to reason about. That is, to say which services a commit affects and requires deployment of. With a single repo 'the seams' between services aren't as well defined / bounded. This makes it harder to release services independently with confidence, thus hampering the continuous deployment of independent releases that is considered crucial to  microservices based architectures. That is to say the monorepo, to an extent encourages lock-step releases.
  • It scales better with large organisations of many developers, it ensures that developer's version control workflows do not become congested.
  • It helps repos stay smaller and so keeps inbound network traffic for pulls smaller for developers providing a faster workflow.

Cons

  • With multiple repos changelogs become fragmented across multiple pull requests making it harder to review, deploy and roll back a single feature. Indeed, deploying a feature can require deployments across mulitiple services, making it more difficult to track the status of features through environments. There is a lot of version control overhead here.
  • Making common changes to each service. Making a certain change to N services requires N git pushes.
  • It is harder to detect API breakage as the tests in the consumer of the API will not run until a change to that service is pushed. Note that this can be resolved with 'contract testing', that is, testing of the API of a service within the service itself, which you do not get for free in the consumer.
  • It is more difficult to build a continuous integration system where all dependencies are not immediately available via the same repository.

This a very interesting topic that definitely warrants further discussion and debate, indeed the the choice of monorepo vs manyrepo has interesting implications for release strategy and whether supporting varied permutations of co-existing versions is worth the very real and visible overhead that it incurs.

Lots of the workflow pros of the manyrepo don't really come into affect until you have a very large engineering staff which most companies won't / are unlikely to ever have. Also google have solved some of these problems for monorepos by doing fancy stuff with virtual filesystems.

Personally I considered the manyrepo to be superior until I had to experience the increased developer overhead first hand. However it remains to be seen if this encouragement to think about services as separate distinct entities is worth it in the long run.

Thursday, 19 October 2017

Go Concurrency Patterns #1 - Signalling Multiple Readers - Closing Channels in Go

The traditional use of a channel close is to signal from the writer to the reader that there is no more data to be consumed. When a channel is closed and empty a reader will exit its 'for range' loop over such a channel. This is because of some handy shorthand in 'for range', note that the two expressions are functionally equivalent:


In the first snippet the check for the channel close is implicit. The two value receive format is how we normally check for a channel close, this is what the spec has to say on it:

 A receive expression used in an assignment or initialization of the special form

 x, ok = <-ch
 x, ok := <-ch
 var x, ok = <-ch
 var x, ok T = <-ch

 yields an additional untyped boolean result reporting whether the communication succeeded. The value of ok is true if the value received was delivered by a successful send operation to the channel, or false if it is a zero value generated because the channel is closed and empty.

Here is a functional example of a writer closing a data channel to signal a reader to stop reading, this is handled nicely in the language with the implicit close check in for range.


The close of a data channel should be performed by the writer as a write on a closed channel causes a panic, thus a close by the reader has the potential to induce a panic. Similarly a close with multiple writers could also induce a panic. The good news is that the excellent go race detector can detect such races.

However the reader can signal the writer to stop writing by utilising a secondary signalling channel. This could be done via a send to the secondary channel. However, as a signalling mechanism, closing a channel instead of a channel send has the benefit of working for multiple readers of the signalling channel. It is important this signalling channel has no data written to it as we only receive a close signal when the channel is both closed and empty.

The use of a channel of type empty struct has the benefit that it cannot be used to pass data and thus its usage as a signalling channel is more apparent. Dave Cheney has an interesting post on the curious empty struct type.

Here is an example that demonstrates the use of a secondary channel to shutdown multiple writers initiated by the reader:

Hopefully you find this useful for shutting down errant goroutines. Indeed, this is how the context package implements its own cancellation functionality. Context is the idiomatic way to perform cancellation that should also be honoured by http transports and libraries and the like.

Wednesday, 13 September 2017

Initialising String Literals at Compile Time in Go

Recently I was working on a service and realised that we had no established way of querying it for its version information. Previously, on an embedded device, I have written a file at build time to be read by the service at run time, with the location set by an env var. It would also be possible to set the version in an env var itself but these are overridable

  However a colleague suggested injecting the version string into the binary itself at compile time so I decided to investigate this novel approach.

go tool link specifies that the -X argument allows us to define arbitrary string values

go tool link
...
  -X definition
        add string value definition of the form importpath.name=value
...

go build help explains that there is an 'ldflags' option which allows the user to specify a flag list passed through to go tool link

go build help
...
    -ldflags 'flag list'
        arguments to pass on each go tool link invocation.
...

So we can pass a string definition through the go command (build, run etc)!



In the above program we can see that we define a variable foo in package main.
Thus the fully qualified name of this variable is main.foo, thus this is the name we pass.

$ go run main.go
injected var is []

$ go run -ldflags '-s -X main.foo=bar' main.go
injected var is [bar]

This can be used in novel ways to inject the output of external commands at compile time such as the build date.

$ go run -ldflags "-X 'main.foo=$(date -u '+%Y-%m-%d %H:%M:%S')'" main.go
injected var is [2017-09-13 13:44:59]

This is a nice feature that I'm sure has applications beyond my reckoning at this time. In my use case it makes for a compact, neat solution, however it does cause some indirection when reading the code making it harder to follow so I would argue should be used sparingly. It has the nice quality of being immutable over env vars which could potentially be rewritten. Anyhow it is a pretty cool linker feature!

Thursday, 7 September 2017

Struct Tags in Go

Dictionary.com defines 'tag' as:

a label attached to someone or something for the purpose of identification or to give other information

The go docs only have a short paragraph covering struct tags which follows:

A field declaration may be followed by an optional string literal tag, which becomes an attribute for all the fields in the corresponding field declaration. The tags are made visible through a reflection interface and take part in type identity for structs but are otherwise ignored.

Basically struct tags allow you to attach metadata to struct fields in the form of arbitrary string values.

The format of the metadata is: `key:val` or `key:val1,val2` in the case of multiple values. By convention the key corresponds to the name of the package consuming it.

This metadata can be intended for another package or you can make use of it yourself. It can be consumed via reflection using the types reflect.Type, reflect.StructField and reflect.StructTag as can be seen in this example.

A common use case is for serialisation/ deserialisation, the 'json' tag which is used by the encoding/json package is an example of this.

It also allows you to incorporate your initialisation code in the struct definition. An example of this is the github.com/caarlos0/env package which uses struct tags to define config structures with associated environment variables and perform boiler-plate free parsing. The package also allows you to set defaults via struct tags which again reduces initialisation code, however this is not compile-time safe.

If you are interested there is a good talk available here: video pdf

Friday, 1 September 2017

REST vs RPC

Internal vs External comms in a
microservices architecture


A Microservices architecture advocates the proliferation of many small inter-communicating services. As a result, using such an architecture can result in an increased overhead in the number of network bound, 'internal' inter-service communications.

This is in addition to the 'external' existing network comms of serving the clients accessing our service. I will refer to these two forms of communication as internal and external respectively. These two forms of communication can be observed to possess differing characteristics.

Internal comms are high frequency in nature and completely under our control, that is that we can coordinate upgrade of the client and server as necessary.

External comms are of lower frequency, can come from varied sources and backwards compatibility / versioning is more of a concern as we do not have control over the clients.

REST and RPC are two technologies that can be used for network bound communications. In my experience REST is better suited to use in an external API and RPC to use in internal comms, primarily for these reasons:

REST for external comms
- easy to version using headers
- well understood and ubiquitous, ease of client implementation
- json much easier to parse and debug than a binary format

RPC for internal comms
- low cost, around 20 times less than REST
- less boilerplate required for calls (which will be frequent), it is autogenerated
- provides easy streaming

In RPC changes to models require synchronised client and server releases, thus models often become expand (ie. add fields) only. This can lead to a lot of deprecated fields and lockstep client and server releases when models are altered, making RPC a little brittle for external APIs.

The boon of low latency comms that rpc provides cannot be understated and provides us the freedom to disregard much of the costs of network bound requests and fully subscribe to a microservices architecture.