AWS Certified Developer - Associate (DVA-C02) #12 Domain 4-1 Troubleshooting and Optimization — Observability

Sunday, May 31, 2026

4 min read

We finished deployment with #11 Deployment Strategies. The last domain is troubleshooting and optimization at 18%. The first post is about observability, that is, “the tools to see what is happening.” To trace a failure, you first have to be able to read logs,metrics,traces.

CloudWatch Logs #

Collects, searches, and retains the logs of applications,services.

Concept	Meaning
Log Group	A bundle of logs from the same application,function. Retention period is set here
Log Stream	A sequence of logs from one source (an instance,execution environment)
Logs Insights	Searches and aggregates logs with a query language
Subscription filter	Streams logs in real time to Lambda/Kinesis

Lambda automatically writes logs to CloudWatch Logs (the execution role needs log permissions).
EC2,on-premises must install the CloudWatch Agent to send logs,memory,disk metrics.

Exam trap: EC2’s memory,disk utilization are not default metrics. You must install the CloudWatch Agent so they show up as custom metrics. CPU,network are provided by default.

CloudWatch Metrics #

Collects numeric values of the system,application as time series.

Standard metrics — Provided automatically by AWS services (EC2 CPU, Lambda invocation count,errors,duration, etc.).
Custom metrics — Pushed directly with PutMetricData (business metrics such as order count).
High-resolution — Down to 1-second granularity. The default is 1 minute (or 5 minutes).
Dimension — A key-value that classifies a metric by function name,environment, etc.

CloudWatch Alarms #

Acts when a metric goes outside a threshold.

States: OK / ALARM / INSUFFICIENT_DATA.
Actions: SNS notification, Auto Scaling, EC2 action, deployment automatic rollback.
You can group multiple alarms with logical conditions using a composite alarm.

X-Ray — Distributed Tracing #

In microservices,serverless, it traces where a request slows down and where it fails as it passes through multiple services.

Concept	Meaning
Segment	A unit of work handled by one service
Subsegment	A detailed call within a segment (DB query, external API, etc.)
Trace	The full set of segments produced by one request
Service Map	Visualizes inter-service call relationships,latency,errors
Sampling	Traces only a portion of requests to reduce cost,load

Lambda enables X-Ray with an activation toggle + execution-role permission. Add the SDK to your code and even external calls get captured as subsegments.
EC2/ECS/on-premises run the X-Ray daemon to send trace data.
The answer to “I want to find the bottleneck in a request that passes through multiple services” is X-Ray.

EMF — Embedded Metric Format #

When you record metrics alongside logs as structured JSON, CloudWatch automatically extracts custom metrics from those logs.

Just writing logs generates metrics, without a separate PutMetricData API call.
You leave high-cardinality context (request ID, etc.) in logs while simultaneously getting aggregated metrics.
It’s the recommended way to create custom metrics in serverless without API-call cost,latency.

Distinguishing from CloudTrail #

This is a pair often confused in questions about choosing an observability tool.

Service	What it shows
CloudWatch	Performance,logs (“how fast / what logs occurred”)
X-Ray	Request path,bottleneck (“where it’s slow”)
CloudTrail	API call audit (“who called which API”)

“Who deleted this resource” is CloudTrail; “why is it slow” is X-Ray/CloudWatch.

Exam question patterns #

“Find the bottleneck in a request passing through several microservices.” → X-Ray (service map).
“Monitor EC2 memory utilization.” → CloudWatch Agent (custom metric).
“Search and aggregate logs with a query.” → CloudWatch Logs Insights.
“Create serverless custom metrics without an API call.” → EMF.
“Who deleted the S3 bucket.” → CloudTrail.
“Notify when the error rate crosses a threshold.” → CloudWatch Alarm + SNS.
“Record a business metric (order count) directly.” → Custom metric (PutMetricData) or EMF.

Common traps #

1) The misconception that EC2 memory is a default metric #

CPU,network are default; memory,disk require the Agent.

2) Confusing CloudWatch and CloudTrail #

Performance,logs are CloudWatch; API audit is CloudTrail.

3) Thinking X-Ray is enabled by code alone #

Lambda needs tracing enabled + X-Ray permission on the execution role.

Wrap-up #

What this post locked in:

CloudWatch Logs (groups,streams,Logs Insights,subscription filters); EC2 memory requires the Agent
Metrics (standard,custom,high-resolution,dimensions) and Alarms (SNS,rollback integration)
X-Ray — trace bottlenecks via segments,subsegments,service map,sampling
EMF — extract custom metrics from logs alone (recommended for serverless)
Distinguishing CloudWatch (performance),X-Ray (path),CloudTrail (API audit)

Next — Domain 4-2 Optimization and Problem Solving #

Now that we’ve covered how to see what’s happening, the last topic is how to improve it and resolve common errors. In #13 Optimization and Problem Solving, I’ll cover caching layers, Lambda performance tuning, and the error codes and troubleshooting patterns that frequently appear on the exam.