AWS Certified Developer - Associate (DVA-C02) #12 Domain 4-1 Troubleshooting and Optimization — Observability

4 min read

We finished deployment with #11 Deployment Strategies. The last domain is troubleshooting and optimization at 18%. The first post is about observability, that is, “the tools to see what is happening.” To trace a failure, you first have to be able to read logs,metrics,traces.

CloudWatch Logs #

Collects, searches, and retains the logs of applications,services.

ConceptMeaning
Log GroupA bundle of logs from the same application,function. Retention period is set here
Log StreamA sequence of logs from one source (an instance,execution environment)
Logs InsightsSearches and aggregates logs with a query language
Subscription filterStreams logs in real time to Lambda/Kinesis
  • Lambda automatically writes logs to CloudWatch Logs (the execution role needs log permissions).
  • EC2,on-premises must install the CloudWatch Agent to send logs,memory,disk metrics.

Exam trap: EC2’s memory,disk utilization are not default metrics. You must install the CloudWatch Agent so they show up as custom metrics. CPU,network are provided by default.

CloudWatch Metrics #

Collects numeric values of the system,application as time series.

  • Standard metrics — Provided automatically by AWS services (EC2 CPU, Lambda invocation count,errors,duration, etc.).
  • Custom metrics — Pushed directly with PutMetricData (business metrics such as order count).
  • High-resolution — Down to 1-second granularity. The default is 1 minute (or 5 minutes).
  • Dimension — A key-value that classifies a metric by function name,environment, etc.

CloudWatch Alarms #

Acts when a metric goes outside a threshold.

  • States: OK / ALARM / INSUFFICIENT_DATA.
  • Actions: SNS notification, Auto Scaling, EC2 action, deployment automatic rollback.
  • You can group multiple alarms with logical conditions using a composite alarm.

X-Ray — Distributed Tracing #

In microservices,serverless, it traces where a request slows down and where it fails as it passes through multiple services.

ConceptMeaning
SegmentA unit of work handled by one service
SubsegmentA detailed call within a segment (DB query, external API, etc.)
TraceThe full set of segments produced by one request
Service MapVisualizes inter-service call relationships,latency,errors
SamplingTraces only a portion of requests to reduce cost,load
  • Lambda enables X-Ray with an activation toggle + execution-role permission. Add the SDK to your code and even external calls get captured as subsegments.
  • EC2/ECS/on-premises run the X-Ray daemon to send trace data.
  • The answer to “I want to find the bottleneck in a request that passes through multiple services” is X-Ray.

EMF — Embedded Metric Format #

When you record metrics alongside logs as structured JSON, CloudWatch automatically extracts custom metrics from those logs.

  • Just writing logs generates metrics, without a separate PutMetricData API call.
  • You leave high-cardinality context (request ID, etc.) in logs while simultaneously getting aggregated metrics.
  • It’s the recommended way to create custom metrics in serverless without API-call cost,latency.

Distinguishing from CloudTrail #

This is a pair often confused in questions about choosing an observability tool.

ServiceWhat it shows
CloudWatchPerformance,logs (“how fast / what logs occurred”)
X-RayRequest path,bottleneck (“where it’s slow”)
CloudTrailAPI call audit (“who called which API”)

“Who deleted this resource” is CloudTrail; “why is it slow” is X-Ray/CloudWatch.

Exam question patterns #

  • “Find the bottleneck in a request passing through several microservices.” → X-Ray (service map).
  • “Monitor EC2 memory utilization.” → CloudWatch Agent (custom metric).
  • “Search and aggregate logs with a query.” → CloudWatch Logs Insights.
  • “Create serverless custom metrics without an API call.” → EMF.
  • “Who deleted the S3 bucket.” → CloudTrail.
  • “Notify when the error rate crosses a threshold.” → CloudWatch Alarm + SNS.
  • “Record a business metric (order count) directly.” → Custom metric (PutMetricData) or EMF.

Common traps #

1) The misconception that EC2 memory is a default metric #

CPU,network are default; memory,disk require the Agent.

2) Confusing CloudWatch and CloudTrail #

Performance,logs are CloudWatch; API audit is CloudTrail.

3) Thinking X-Ray is enabled by code alone #

Lambda needs tracing enabled + X-Ray permission on the execution role.

Wrap-up #

What this post locked in:

  • CloudWatch Logs (groups,streams,Logs Insights,subscription filters); EC2 memory requires the Agent
  • Metrics (standard,custom,high-resolution,dimensions) and Alarms (SNS,rollback integration)
  • X-Ray — trace bottlenecks via segments,subsegments,service map,sampling
  • EMF — extract custom metrics from logs alone (recommended for serverless)
  • Distinguishing CloudWatch (performance),X-Ray (path),CloudTrail (API audit)

Next — Domain 4-2 Optimization and Problem Solving #

Now that we’ve covered how to see what’s happening, the last topic is how to improve it and resolve common errors. In #13 Optimization and Problem Solving, I’ll cover caching layers, Lambda performance tuning, and the error codes and troubleshooting patterns that frequently appear on the exam.

X