AWS Certified Developer - Associate (DVA-C02) #12 Domain 4-1 Troubleshooting and Optimization — Observability
We finished deployment with #11 Deployment Strategies. The last domain is troubleshooting and optimization at 18%. The first post is about observability, that is, “the tools to see what is happening.” To trace a failure, you first have to be able to read logs,metrics,traces.
CloudWatch Logs #
Collects, searches, and retains the logs of applications,services.
| Concept | Meaning |
|---|---|
| Log Group | A bundle of logs from the same application,function. Retention period is set here |
| Log Stream | A sequence of logs from one source (an instance,execution environment) |
| Logs Insights | Searches and aggregates logs with a query language |
| Subscription filter | Streams logs in real time to Lambda/Kinesis |
- Lambda automatically writes logs to CloudWatch Logs (the execution role needs log permissions).
- EC2,on-premises must install the CloudWatch Agent to send logs,memory,disk metrics.
Exam trap: EC2’s memory,disk utilization are not default metrics. You must install the CloudWatch Agent so they show up as custom metrics. CPU,network are provided by default.
CloudWatch Metrics #
Collects numeric values of the system,application as time series.
- Standard metrics — Provided automatically by AWS services (EC2 CPU, Lambda invocation count,errors,duration, etc.).
- Custom metrics — Pushed directly with
PutMetricData(business metrics such as order count). - High-resolution — Down to 1-second granularity. The default is 1 minute (or 5 minutes).
- Dimension — A key-value that classifies a metric by function name,environment, etc.
CloudWatch Alarms #
Acts when a metric goes outside a threshold.
- States:
OK/ALARM/INSUFFICIENT_DATA. - Actions: SNS notification, Auto Scaling, EC2 action, deployment automatic rollback.
- You can group multiple alarms with logical conditions using a composite alarm.
X-Ray — Distributed Tracing #
In microservices,serverless, it traces where a request slows down and where it fails as it passes through multiple services.
| Concept | Meaning |
|---|---|
| Segment | A unit of work handled by one service |
| Subsegment | A detailed call within a segment (DB query, external API, etc.) |
| Trace | The full set of segments produced by one request |
| Service Map | Visualizes inter-service call relationships,latency,errors |
| Sampling | Traces only a portion of requests to reduce cost,load |
- Lambda enables X-Ray with an activation toggle + execution-role permission. Add the SDK to your code and even external calls get captured as subsegments.
- EC2/ECS/on-premises run the X-Ray daemon to send trace data.
- The answer to “I want to find the bottleneck in a request that passes through multiple services” is X-Ray.
EMF — Embedded Metric Format #
When you record metrics alongside logs as structured JSON, CloudWatch automatically extracts custom metrics from those logs.
- Just writing logs generates metrics, without a separate
PutMetricDataAPI call. - You leave high-cardinality context (request ID, etc.) in logs while simultaneously getting aggregated metrics.
- It’s the recommended way to create custom metrics in serverless without API-call cost,latency.
Distinguishing from CloudTrail #
This is a pair often confused in questions about choosing an observability tool.
| Service | What it shows |
|---|---|
| CloudWatch | Performance,logs (“how fast / what logs occurred”) |
| X-Ray | Request path,bottleneck (“where it’s slow”) |
| CloudTrail | API call audit (“who called which API”) |
“Who deleted this resource” is CloudTrail; “why is it slow” is X-Ray/CloudWatch.
Exam question patterns #
- “Find the bottleneck in a request passing through several microservices.” → X-Ray (service map).
- “Monitor EC2 memory utilization.” → CloudWatch Agent (custom metric).
- “Search and aggregate logs with a query.” → CloudWatch Logs Insights.
- “Create serverless custom metrics without an API call.” → EMF.
- “Who deleted the S3 bucket.” → CloudTrail.
- “Notify when the error rate crosses a threshold.” → CloudWatch Alarm + SNS.
- “Record a business metric (order count) directly.” → Custom metric (
PutMetricData) or EMF.
Common traps #
1) The misconception that EC2 memory is a default metric #
CPU,network are default; memory,disk require the Agent.
2) Confusing CloudWatch and CloudTrail #
Performance,logs are CloudWatch; API audit is CloudTrail.
3) Thinking X-Ray is enabled by code alone #
Lambda needs tracing enabled + X-Ray permission on the execution role.
Wrap-up #
What this post locked in:
- CloudWatch Logs (groups,streams,Logs Insights,subscription filters); EC2 memory requires the Agent
- Metrics (standard,custom,high-resolution,dimensions) and Alarms (SNS,rollback integration)
- X-Ray — trace bottlenecks via segments,subsegments,service map,sampling
- EMF — extract custom metrics from logs alone (recommended for serverless)
- Distinguishing CloudWatch (performance),X-Ray (path),CloudTrail (API audit)
Next — Domain 4-2 Optimization and Problem Solving #
Now that we’ve covered how to see what’s happening, the last topic is how to improve it and resolve common errors. In #13 Optimization and Problem Solving, I’ll cover caching layers, Lambda performance tuning, and the error codes and troubleshooting patterns that frequently appear on the exam.