Kubernetes and Cloud Native Associate (KCNA) #6: Cloud Native Observability (8%) — Telemetry, Prometheus, Cost Management

In a distributed system where dozens of Pods are scattered across multiple nodes and calling one another, when a single request slows down, where things went wrong and what happened is not visible at a glance. On a single server you could just inspect one log file, but in a cloud native environment where containers die and restart and scale up and down, you can’t trace the cause of a failure unless you’ve turned the system’s state into signals observable from the outside. The domain that covers this capability is Cloud Native Observability.

In KCNA this domain carries a small weight of 8%, but its question patterns are formulaic, making it an easy place to lock in points. In this post we’ll walk through the three pillars of telemetry, metric collection centered on Prometheus, OpenTelemetry and distributed tracing, the reliability metrics (SLI/SLO/SLA), and cost management (FinOps).

The difference between observability and monitoring #

The exam includes a question that distinguishes the two terms, so let’s settle it first. Monitoring is the activity of watching predefined metrics and thresholds. It handles situations where you already know what to look at, like “alert when CPU usage exceeds 80%.” Observability, on the other hand, refers to the property of being able to answer even questions you did not anticipate in advance, using only the signals the system emits. You can say a system has observability when you can investigate a question after the fact — like “why does the checkout take 3 seconds only for this particular user” — that you never built a dashboard for ahead of time.

To sum up, monitoring is the act of watching known problems, while observability is the property of a system that makes even unknown problems explorable. It helps to understand monitoring as something that operates on top of a system that is well equipped with observability.

The three pillars of telemetry #

Observability is built on top of the telemetry data a system emits. This data is traditionally divided into three kinds, each answering a different question. This is the classification that appears most often on the exam, so let’s lay it out in a table.

PillarFormQuestion it answersRepresentative tools
MetricsNumeric time series over timeHow many, how fastPrometheus
LogsRecords of individual eventsWhat exactly happened at that momentFluentd, Loki
TracesThe service path a single request tookWhere did this request slow downJaeger, OpenTelemetry

Metrics are time series that stack up numbers measured at regular intervals in chronological order, like “requests per second,” “error rate,” or “memory usage.” They are cheap to store and quick to aggregate, but they cannot capture the details of individual events. Logs are text events an application leaves at a specific point in time, carrying concrete context like which request failed with which message. The tradeoff is volume, which makes storage and search expensive. Traces stitch together the entire path a single request takes across multiple services, pinpointing which segment is the bottleneck between microservices.

The three pillars are complementary rather than competing. The typical flow is to detect the fact that “the error rate went up” with metrics, narrow down “which service segment is stuck” with traces, and then confirm “exactly which message it failed with” using logs.

Prometheus: the standard for metric collection #

Prometheus is the de facto standard in the metrics space and a CNCF Graduated project. The fact that it was the second project to graduate from the CNCF, right after Kubernetes, is also a frequent exam item.

Pull-based collection #

Prometheus’s biggest characteristic is its pull-based collection model. Rather than applications pushing metrics somewhere, the Prometheus server directly scrapes the target’s /metrics endpoint at fixed intervals. This act is called a scrape. Which targets to scrape is determined by configuration and service discovery, and in Kubernetes it automatically discovers Pods and Services to refresh the target list.

The fact that it is pull and not push is a perennial trap on the exam. (There is a separate auxiliary component called the Pushgateway for cases like short-lived jobs that live too briefly to be scraped, but the base model is pull through and through.)

Exporters and the time-series DB #

When an application doesn’t expose Prometheus-format metrics directly, you place an exporter. An exporter is an adapter that reads the state of an external system — a database, a message queue, a node’s hardware status — and converts it into a format Prometheus can scrape. The Node Exporter, which exposes a node’s CPU, memory, and disk, is the representative example.

The scraped metrics are stored in Prometheus’s built-in time-series database (TSDB). Each time series is identified by a combination of the metric name and labels, and these labels create the dimensions of queries downstream.

PromQL and metric types #

Stored data is queried with PromQL (Prometheus Query Language). For example, the single line below computes HTTP requests per second over the last 5 minutes.

rate(http_requests_total[5m])

Metrics are usually divided into the following three types.

TypeMeaningExample
CounterA cumulative value that only increasesTotal requests, total errors
GaugeAn instantaneous value that goes up and downCurrent memory usage, queue length
HistogramAggregates the distribution of values into bucketsDistribution of request latency (p95, etc.)

A counter is cumulative and doesn’t decrease until a restart, so you use rate() as in the example above to derive the rate of increase and turn it into a meaningful value. A gauge is read directly as the current state. A histogram is used to bucket values where distribution matters, like latency, in order to extract percentile metrics such as p95 and p99.

Alertmanager and Grafana #

Prometheus handles collection and querying, and two tools sit around it. Alertmanager receives the alerting rules evaluated by Prometheus and routes them — after deduplication, grouping, and routing — to channels like email, Slack, and PagerDuty. The role distinction is an exam point: the entity that evaluates the alerting rules is Prometheus, while the entity responsible for dispatch and routing is Alertmanager.

Grafana is a visualization tool that shows metrics as graphs and dashboards. It most often uses Prometheus as a data source, but Grafana itself isn’t tied to Prometheus, and it gathers multiple data sources into a single dashboard. In other words, understand it as a division of labor where Prometheus handles storage and querying, and Grafana handles visualization.

Distributed tracing and OpenTelemetry #

In a microservices environment, a single request passes through multiple services such as the gateway, authentication, ordering, and payment. Tracing this entire path is distributed tracing.

  • Trace: the entire journey a single request makes through the system.
  • Span: the segment that a single service or unit of work occupies within that journey. It has start and end timestamps, and multiple spans are linked as parent-child to form a single trace.

Following the duration of each span immediately reveals which service segment is the cause of the overall latency.

OpenTelemetry #

OpenTelemetry (OTel) is the standard for the way telemetry data is generated and collected, and it’s a CNCF project. Its goal is to unify the three signals — metrics, logs, and traces — under a single SDK and protocol (OTLP), so that application instrumentation isn’t bound to a specific backend. In other words, if you instrument with OpenTelemetry, you don’t need to change your code whether you send trace data to Jaeger or to another backend.

The key here is the division of labor where instrumentation is OpenTelemetry, and storage and querying are the backend. Jaeger is a representative distributed tracing system among those backends — a CNCF project that stores collected traces and visualizes request paths.

Log collection #

Logging in a container environment starts from one principle. The application inside the container does not manage log files directly but emits to standard output (stdout) and standard error (stderr), and the platform collects that output. This is because containers can die and restart at any time, so logs would be lost if kept inside the container.

The tool that gathers this output and sends it to a central store is a log collector. Fluentd and the lighter Fluent Bit are widely used CNCF projects; they read container logs on each node, process them, and forward them to a storage backend. On the storage and querying side, Grafana’s Loki is frequently mentioned for handling logs label-based like metrics, keeping operations lightweight.

Reliability metrics: SLI, SLO, SLA #

SLI, SLO, and SLA are the vocabulary for treating “is the service working well enough” as numbers. A question that precisely distinguishes the three acronyms appears almost every time, so you must keep them separated.

AcronymFull nameMeaningExample
SLIService Level IndicatorThe value actually measured99.95% availability over the last 30 days
SLOService Level ObjectiveThe target set internallyMaintain availability at or above 99.9%
SLAService Level AgreementThe contract made with the customer (compensation if breached)Fee refund if below 99.5%

Memorizing them in order keeps you from getting confused. The measured value (SLI) sits innermost, the target (SLO) the organization set for itself against that measurement sits on top of it, and the contract (SLA) legally promised to the customer sits outermost. Usually you set the SLO stricter than the SLA, so that an alarm goes off at the internal target before the contract is breached.

Error budget #

If you set the SLO at 99.9% rather than 100%, the remaining 0.1% becomes headroom in which failure is allowed. This headroom is called the error budget. Having error budget left means there’s room to deploy new features aggressively, while having spent it all is a signal to focus on stabilization. It is a mechanism for managing the balance between stability and deployment speed as a concrete number.

Golden signals #

When you’re at a loss for what to measure, the four key metrics that serve as a baseline are the golden signals.

SignalMeaning
LatencyThe time it takes to process a request
TrafficThe volume of requests the system receives
ErrorsThe proportion of failed requests
SaturationHow close resources are to their limit

If you memorize the four bundled in the order of their initials, you can spot a choice that omits one or mixes in another on a question asking about the golden signals.

Cost management: FinOps #

In a cloud native environment, just as resources are easy to add, costs can accumulate just as easily. So cost visibility is treated as one branch of observability, and the operational culture that bundles this is FinOps (Financial Operations). It refers to the ongoing practice in which engineering, finance, and operations teams together make cloud spend visible, measure it, and then optimize it.

Requests and limits decide cost #

The setting that acts most directly on cost in Kubernetes is a container’s resource requests and limits. Requests are the amount the scheduler reserves when placing a workload, so if you set requests larger than actual usage, that much node capacity sits reserved while empty, leading to over-provisioning. The node runs expensively while actual utilization stays low. Conversely, if you set them too low, the container gets terminated for exceeding its limit (OOMKilled) or gets throttled. Rightsizing — adjusting requests and limits to match actual usage — is the starting point of cost management.

OpenCost and KubeCost #

Kubernetes itself doesn’t tell you “how much did this namespace spend this month.” The tools that fill this gap are cost-visibility solutions. OpenCost is the open-source standard for Kubernetes cost measurement (a CNCF Sandbox project), allocating and displaying cost by Pod, namespace, and label. KubeCost is a product built on top of OpenCost, adding optimization recommendations and dashboards to the same cost data. Understand both tools as playing the role of converting resource usage into cost to reveal who is spending how much.

Exam points summary #

Collecting the frequently appearing exam points in this domain gives the following.

  • Distinguishing the three pillars. What metrics (numeric time series), logs (event records), and traces (the distributed path of a request) each answer. You have to spot a choice that scrambles the three roles.
  • Prometheus is a pull model. It scrapes the target’s /metrics directly. A choice that says push is a trap.
  • Distinguishing SLI/SLO/SLA. Don’t confuse the order and definitions of measured value (SLI), target (SLO), and contract (SLA).
  • The four golden signals. Latency, traffic, errors, saturation.
  • Role matching. You need to keep Prometheus (collection/querying), Grafana (visualization), Alertmanager (alert routing), OpenTelemetry (instrumentation standard), Jaeger (distributed tracing backend), and Fluentd/Fluent Bit (log collection) distinguished line by line.
  • FinOps and requests/limits. Over-setting requests leaks cost through over-provisioning.

Wrap-up #

What this post locked in:

  • Observability is a property of the system; monitoring is the activity of watching known metrics. The two are complementary
  • The three pillars of telemetry. Metrics (numeric time series), logs (events), traces (distributed path)
  • Prometheus. CNCF Graduated project, pull-based scrape, exporters, time-series DB, PromQL, counter/gauge/histogram. Around it, Alertmanager (alerting) and Grafana (visualization)
  • OpenTelemetry. The instrumentation standard, the trace/span concepts. Jaeger as a backend
  • Logging. Containers emit to stdout/stderr, Fluentd/Fluent Bit collect, Loki stores
  • Reliability metrics. SLI (measurement), SLO (target), SLA (contract), error budget, golden signals (latency/traffic/errors/saturation)
  • FinOps. Requests/limits decide cost, and OpenCost/KubeCost make cost visible

If you want a deeper look at the practical implementation of observability, K8s Advanced #5 Observability covers the flow of actually wiring up and operating Prometheus, Grafana, Loki, and OpenTelemetry together.

Next: Cloud Native Application Delivery #

Once observation is in place, the final domain is the flow of delivering applications to the cluster.

#7 Cloud Native Application Delivery (8%): GitOps, CI/CD walks through the principles of GitOps (declarative deployment with Git as the single source of truth), ArgoCD and Flux, the distinction between CI and CD pipelines, and the deployment concepts that show up often on the exam.

X