One of the more interesting observations I made early in my career was that if you asked engineers a question you would rarely get a response… but if you made a statement that was clearly wrong, they would take all the time needed to explain to you why you are wrong.
Credit: xkcd
So with that in mind, below is my attempt to reason about monitoring, observability, alerting, etc. I am going to reason about it in terms of the tools because without a practical way to reason about an implementation it just feels like a philosophical debate. I am sure I am wrong about much of it and am looking to understand why.
Concepts
Telemetry
This is the data being emitted from the system, it comes from all over the place and isn’t necessarily being used by anything, but it is there. In the best case it has lots of metadata, but much of the data will just be a number. The interesting thing here is that most of the useful telemetry, IMHO, from the platform is flowing into tools that many of us in operations and engineering don’t even have access to.
Monitoring
I like to think about monitoring as the gauge cluster on my car. It tells me incredibly important things; that I can usually get by without. Let’s be honest, many of us that have been driving for a while could get by without almost all of the gauges on our car. Even the fuel gauge is redundant because we would learn to estimate how far we could go between fill ups and as a backup carry an extra gallon or two in the car.
The interesting part is that we would never (for most values of never), do this. You would never want to operate a vehicle without some basic gauges and this is true of our infrastructure. We want to have basic understandings of things like CPU, Disk, HTTP errors, etc. But here is the crucial thing… those bits are only interesting when you understand the context around them. Sure, the engine overheating is always bad, but is the speed of the car being 70 MPH always good? Even on the same stretch of road? Even worse, as many of us have learned, all the gauges can look fine and your car can still be in a bad spot.
Monitoring for me is the set of signals I choose to watch for. It is the things I “know” what good and bad looks like in 95% of cases. As the system gets more complex there are more gauges, and getting poked to go look at them isn’t going to help anything. In the modern world, the value is going to be in finding as few things for humans to actively watch as possible with the rest of the cases just being conditions in the system that are handled automatically.
Observability
This is when things go wrong or we are tuning. Continuing with the terrible car analogy, we only want to look at the high cardinality detailed events of what is going on in the car when we are tuning it to go faster or when it is in a bad state and we are trying to figure out which part of the car to go take apart. The battery gauge says zero charge, but is the issue the battery? The alternator? The wiring harness?
The best part of this example is just how bad this is in a car and how many of our systems match this problem. If I am racing Formula 1 or NASCAR they are going to be able to look at the telemetry from the car and know exactly what the problem is. And, by the way, it is being sent wirelessly in real time. For those of us that are not at companies like Facebook or Google, we basically have to do things like swap in a new battery or charge the battery eternally to figure out the problem. The idea with observability is being able to get back to the source code of the problem as quickly as possible. What is actually causing the issue? The thing that companies like Honeycomb and projects like OpenTelemetry seem to be wanting to bring us is the ability to get the Formula 1 tech available to us in our Kia.
What resonates the most with me is that much of the observability bits will not be things that generate alerts. It will be user reports and monitors that are wonky that will cause us to dig out our observability tools and figure out what is wrong. Don’t get me wrong, the observability tools will emit metrics, but that will not be the core focus.
Classes of Tools
Metrics
These are the raw numbers that the system is generating. Load, CPU, disk consumption, requests, etc. This is by far one of the most mature parts of the stack at this point. On a positive note, SNMP based metrics collection has mostly gone away at this point. The tools I associate with this space:
- Datadog
- Statsd
- Prometheus
- SignalFX
- collectd
- telegraf
Interestingly enough, the storage and visualization pieces are in many ways separate from the actual metrics gathering bits. In the visualization space, Grafana is the current modern winner with Graphite being the previous generation. On the storage side, I don’t know that I have heard mention of anything other than InfluxDB in a long time.
With the tools figured out, the core thought here is that the metrics gathering mechanisms almost entirely map to the monitoring side. These are the systems that power the gauge cluster. I still think it is important to have raw metrics, but in the debugging of a system, they frequently send you on wild goose chases. Much like the gauges, I can solve simple problems with my metrics, like the disk being full, but in most cases it doesn’t actually get me closer to solving the underlying problem.
Tracing
This is the actual instrumentation of code and tracing it through execution. When I look at this space Datadog has an offering, but Honeycomb seems to be the only vendor that is actually digging in deep on this side of the world. In the OSS space, the tools are basically OpenTelemetry (which is a merger of the OpenCensus and OpenTracing projects), Jaeger, and Zipkin.
To me, this is the heart of the observability side of the world. The traces are what allow you to understand what is going on in the system. Much like that little you got from running NewRelic on PHP systems, it is the ability to to visually see where things are going wrong and dig in. This is the type of diagnostic data the race car drivers are using. It is what gives those big systems companies a leg up when dealing with their systems.
The most important thing about this is the tying of the events that are happening in the system back to the code base. It should not be hard to go from a span in Jaeger or Honeycomb to the source code that generated the span. It is also trivial to tag your spans inside of the system and you get timing for free.
Listening to Charity Majors and Liz Fong-Jones talk about this and having played with tracing a bit, I am convinced that we should be moving away from a LOT of the logging we do and instead be shoving most of that data into spans that are reported back to something like Honeycomb or Jaeger.
Logging
Logging is the bane of my existence. The reality is that the way we do logging today, it really isn’t useful in anything more than an Auditing context. The massive effort to index and collate the logs so you, just maybe, will know the right incantation to put into Splunk is killer. As I said above, I am moving away from the log everything world to the create a span for everything world.
Synthetics, APM, and Checks
I mention these here because they are such a broad set of tools that do ALL THE THINGS. Some of the synthetics platforms are insanely great observability tools, helping you to dig into a waterfall of the page load, giving you clear understandings of what is actually going wrong. The problem is that most of these tools are rediculously expensive and their own ecosystem. I really want a day when I can pass a token from a Catchpoint synthetic test into my system and be able to link it up to a set of traces in the rest of my system.
There are other very obtuse signals that come from these systems. Pingdom and Nagios style checks are their own thing. They are binary, up or down. These are monitors, but only monitor things of last resort. Is it up? That question is much less important in the modern app stack. Don’t get me wrong, you should ask it, but if that is the only thing you are looking for, a 200 at the front door, you have bigger issues. That is like asking if the car started but not looking to see if it will even shift into gear or move.
Other tools
There are a few other classes of tools that we should be discussing in terms of observability and monitoring that frequently get left off the list.
From a monitoring and telemetry perspective there is HUGE amount of information being gathered by marketing and product systems. Google Analytics, Segment, Adobe Audience Manager, etc. Have you ever wondered when people use the site? How about what they do when they are on the site? What browsers are used most frequently? Product knows.
On the other side of the coin there are the exception tracking systems like Rollbar or Sentry. These systems are incredibly useful in observing how the code is breaking in production, assuming you are using a language that throws exceptions, etc.
TL;DR
Monitoring and observability are both needed and serve different purposes. You need monitoring to handle the knowns and observability for the unknowns. Yes, it is still important to know about cases where you are regularly running low on disk, but it is the observability practice that will let you figure out what stupid thing the code is doing to cause that.