Service Monitoring

Application Monitoring and Observability
"If your client tells you that application is broken, its a f**kup" - Juan, ScaleConf

OVERVIEW

“Is my system up or down?” “Is it fast or slow as experienced by my end users?” “What KPIs and SLAs should we establish, and how do we know if they’re being met?” When you’re operating at cloud speed and scale, you can’t afford to fly blind: you need to be able to answer a wide range of operational and business questions like these. In addition, you need to spot problems as they arise (ideally before they disrupt the customer experience), respond quickly, and resolve them as quickly as possible. To achieve this, you need observability into your applications and resources that work with AWS and non-AWS services.

What is observability?

“Observability” describes how well you can understand what is happening in a system, often (but not only) by instrumenting it to collect metrics, logs, or traces. Several types of tools and activities make a system observable, including monitoring, tracing, profiling, logs, and AI/Ops. Observability enables you to detect, investigate, and remediate problems.

In the cloud, observability can be hard to achieve due to sheer system complexity. Legacy monolithic apps are distributed across instances and often geographic locations. They may also be re-architected, becoming many microservices that rely on thousands of resources to operate, especially if they run on containers or serverless technology. Microservices may be updated frequently, scale elastically, or be invoked on demand. Thousands of components generate billions of metrics, logs, and traces in a never-ending stream of data.

The Key Pillars

Metrics

Detect

They are the atoms that compose into a single logical gauge, counter, or histogram over a span of time

Logs

Troubleshoot

These are the discrete events

Tracing

Pinpoint

Request-Scoped. These are a bit of data or metadata that can be bound to the lifecycle of a single transactional object in the system

Monitoring goals

Why would you spend time getting better monitoring?
To know about an issue before your customers or your boss
To know how your systems & applications are performing
To minimize your stress level

Service Level Agreement

What service you commit to provide to users, with possible penalties if you are not able to meet it.

Example: “99.5%” availability.
Keyword: contract

Service Level Objective

What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, it should be stricter than your SLA.
Example: “99.9%” availability (the so called “three 9s”).
Keyword: thresholds

Service Level Indicators

What you actually measure, to ascertain whether your SLOs are on/off-target.ú

Example: error ratios, latency
Keyword: metrics

Logging
Loki
Open Distro for Elasticsearch
Elasticsearch
Dynamo Stream
Tracing
Jaeger
Zipkin
X-Ray
Tempo
Metrics Collector
Prometheus
Grafite
Influx DB
Thanos
AWS Timestream
Visualization
Grafana

READY TO GET CHALLENGED?