9 Chapter

EC2 Operations — security group, key pair, SSM

The everyday tools of EC2 operations. Security Group rule design, the difference from NACLs, the limits of key pairs and SSM Session Manager, IMDSv2, and how to harden an instance's skeleton with an AMI.

In Chapter 8 EC2 and VPC Basics we laid out the picture of launching a single EC2. This chapter is how you handle that EC2. We cover how to design security rules, how to connect, and what you need to harden to launch the same instance many times over.

The work of an operational EC2 mostly boils down to three things. First, you control who can come in with Security Groups. Second, access — in the old days SSH and key pairs, these days SSM Session Manager. Third, you harden the skeleton with an AMI to quickly recreate the same instance. Thread these three into one operational picture and the daily work of EC2 operations becomes simple.

The SG patterns we lay out here carry into the load balancer security rules of Chapter 13 ALB / NLB and ACM, and AMIs and ASGs carry into the container operations of Chapter 15 ECS and Fargate.

The structure of a Security Group #

A Security Group (SG) is a stateful firewall attached to an instance (more precisely, to an ENI). You can attach multiple SGs to one instance, and many instances can share one SG.

Inbound vs Outbound #

An SG’s rules go in two directions.

	Inbound	Outbound
Controls what	Incoming traffic	Outgoing traffic
Default	All blocked	All allowed
Touched often	Often	Almost never

Remember the defaults. Inbound is blocked by default, Outbound is allowed by default. So most SG work is adding Inbound rules.

The structure of a rule #

Example inbound rules for a web server SG

Protocol  Port      Source              Description
TCP       80        0.0.0.0/0           HTTP from anywhere
TCP       443       0.0.0.0/0           HTTPS from anywhere
TCP       22        198.51.100.10/32    SSH from my home IP

Each rule is a combination of protocol, port, and source. The Source field can hold two kinds of things.

A CIDR block — 0.0.0.0/0 (all IPs), 10.0.0.0/16 (inside the VPC), 198.51.100.10/32 (a single IP)
Another SG’s ID — sg-0abc..., and this is the truly powerful one.

Pointing an SG at an SG #

The core pattern in operations is an SG pointing at another SG.

The ALB → app EC2 pattern

ALB SG  (sg-alb)
  Inbound:  TCP 443 from 0.0.0.0/0

App SG  (sg-app)
  Inbound:  TCP 8080 from sg-alb     ← the SG itself, not an IP

Even when the ALB’s IP changes (and indeed an ALB dynamically assigns multiple IPs), the SG reference follows automatically. Rule maintenance becomes far simpler.

Commonly used SG patterns #

3-tier web app SG design

ALB SG (sg-alb)
  in:  443 ← 0.0.0.0/0
  out: all

App SG (sg-app)
  in:  8080 ← sg-alb              ← only the ALB can come in
       22   ← sg-bastion          ← SSH from the Bastion (the old way)
  out: all

DB SG  (sg-db)
  in:  5432 ← sg-app              ← only the app server reaches the DB
  out: all (or closed)

Bastion SG (sg-bastion)
  in:  22 ← 198.51.100.10/32      ← your IP only
  out: all

The key is that rules flow from SG to SG. Not from IPs.

When you restrict Outbound #

By default outbound is fully allowed, but there is a pattern of narrowing outbound too, to prevent data exfiltration when you’re exposed to an attack. You usually apply it first to production DBs and internal resources.

Example of a narrowed outbound

App SG outbound:
  TCP 5432 → sg-db                 ← DB only
  TCP 443  → 0.0.0.0/0             ← for calling external APIs
  TCP 53   → 0.0.0.0/0             ← DNS
  UDP 53   → 0.0.0.0/0             ← DNS

NACL — another layer #

A VPC’s second firewall is the NACL (Network Access Control List). It operates at the subnet level.

	Security Group	NACL
Applies at	Instance (ENI)	Subnet
Stateful	Yes	No (responses must be explicitly allowed too)
Rule kinds	Allow only	Allow + Deny
Evaluation order	All rules	By number (lowest number first)
In daily work	Touched daily	Rarely touched

NACLs aren’t used often. The default NACL allows all traffic, and SGs are granular enough. The cases where you touch a NACL are as follows.

When blocking a specific IP range (when you need a Deny — SGs have no Deny).
When temporarily blocking during an attack.
When a compliance requirement needs explicit subnet-level blocking.

The NACL stateless trap #

A NACL is stateless, so you must explicitly allow response traffic too.

NACL rule example — for a TCP 80 outbound to receive a response

Inbound  Allow  TCP  1024-65535  0.0.0.0/0   ← ephemeral port responses
Outbound Allow  TCP  80          0.0.0.0/0

1024-65535 is the ephemeral port range. Miss it and the response won’t come back. In an SG it’s automatic because it’s stateful, but a NACL needs it explicit.

The limits of key pairs #

From the old days, EC2 SSH access was done with a key pair.

Creating a key pair + SSH access

# create the key pair
aws ec2 create-key-pair --key-name my-key --query 'KeyMaterial' --output text > my-key.pem
chmod 400 my-key.pem

# specify the key when launching the instance
aws ec2 run-instances --key-name my-key ...

# connect
ssh -i my-key.pem ec2-user@<public-ip>

When the EC2 launches, the key is automatically added to the instance’s ~/.ssh/authorized_keys, enabling SSH.

The limits of the key pair model #

The key pair model breaks down as operations scale up.

Losing the key — If you lose it, you can’t recreate it. You have to recreate the instance, or mount the EBS and add it manually.
The danger of sharing keys — You have to give it to a teammate, but once it leaks you can’t revoke it.
Hard to audit — Separate logging is needed for who came in when.
Port 22 exposed to the internet — It becomes an attack surface.
No MFA — Having just the key gets you through.

EC2 Instance Connect #

A way to use a console-generated temporary SSH key just once. You still need to allow port 22 in the SG. The console’s “Connect” button uses this.

SSM Session Manager — keyless access #

The Session Manager of SSM (AWS Systems Manager) is the new standard for EC2 access. You enter the shell inside the EC2 without opening port 22 and without a key.

The flow of Session Manager

[my computer] ──HTTPS──▶ [SSM Endpoint] ◀──HTTPS──[SSM Agent inside EC2]
                          │
                          ▼
                   IAM permission check

The SSM Agent running inside the EC2 makes an outbound connection over the AWS API, and the console’s shell input flows through that channel. Because the direction is reversed (EC2 goes outbound), no SG inbound port 22 is needed.

Session Manager setup #

An AMI with the SSM Agent installed — recent Amazon Linux 2023 / Ubuntu include it by default.
The AmazonSSMManagedInstanceCore policy attached to the EC2’s IAM Role.
Outbound internet or a VPC Endpoint — an EC2 in a private subnet can also use SSM.

Connecting via CLI instead of the console

aws ssm start-session --target i-0abc1234def567890

# port forwarding is possible too
aws ssm start-session --target i-0abc... \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["80"],"localPortNumber":["8080"]}'

key pair vs Session Manager #

	key pair (SSH)	Session Manager
Port 22	Must open	No need to open
Key management	Yourself	None
Authentication	SSH key	IAM (MFA possible)
Audit log	Separate	CloudTrail / S3 automatic
Private subnet	Bastion needed	Directly via VPC Endpoint
Port forwarding	`ssh -L`	Possible via `start-session`

In operations, Session Manager is almost always the right answer. For detailed IAM setup see Chapter 2 IAM, and for security see Chapter 6 Security Basics.

Don’t confuse it with CloudShell. The CloudShell from Chapter 5 CloudShell and IAM Identity Center is a browser terminal inside the AWS console where you run the aws cli with your own IAM credentials. Session Manager is a shell inside an EC2 instance.

EC2’s metadata service (IMDS) #

The service that, from inside an EC2, returns information about its own instance (instance ID, region, IAM role credentials, etc.) is the IMDS (Instance Metadata Service).

IMDSv2 — get a token, then query metadata

TOKEN=$(curl -X PUT http://169.254.169.254/latest/api/token \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

curl -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id

The link-local address 169.254.169.254 is the metadata endpoint that only responds inside an EC2. It’s also where you receive the IAM Role’s temporary credentials. That’s why the aws cli automatically uses those credentials inside an EC2.

IMDSv1 vs IMDSv2 #

In the old days you fetched via GET without a token (IMDSv1). Because there were many incidents where SSRF attacks stole tokens, it changed to IMDSv2. You get a token via PUT and GET with it. For new instances, enabling IMDSv2 only is recommended.

Forcing IMDSv2

aws ec2 modify-instance-metadata-options \
  --instance-id i-0abc... \
  --http-tokens required \
  --http-endpoint enabled

Creating an AMI — hardening the skeleton #

To launch instances with the same setup many times, fast, there are two routes. One is to create an AMI; the other is the User data approach, where you launch from a blank AMI and auto-run a setup script.

Creating an AMI #

Right-click the instance in the console and “Create image”, or create one with the CLI.

Creating an AMI

aws ec2 create-image \
  --instance-id i-0abc... \
  --name "my-app-2026-05-24" \
  --description "Node 20 + nginx + my-app v1.2.3" \
  --no-reboot     # option — no reboot (disk consistency may drop slightly)

The created AMI is the instance’s EBS snapshot plus metadata. Launching a new instance from that AMI starts with the same disk state. AMIs are per-region, so for another region you use copy-image.

User data — the boot script #

A pattern of launching from a blank OS image and setting up with a boot script, instead of an AMI. It’s more flexible than an AMI and easier to track changes.

User data example — Amazon Linux 2023

#!/bin/bash
yum update -y
yum install -y nginx
systemctl enable --now nginx

# fetch the app code
aws s3 cp s3://my-bucket/app.tar.gz /tmp/
tar -xzf /tmp/app.tar.gz -C /opt/myapp

User data runs once at the instance’s first boot. The log is left in /var/log/cloud-init-output.log.

Golden AMI vs User data #

	Golden AMI	User data
Boot speed	Fast	Slow (script run time)
Change management	Build a new AMI	Edit the script
Reproducibility	Very high	External dependencies (yum repo, S3) may change
Suited for	Fast ASG scale / stable	Dev / fast change

In operations you use both together. You bake the OS and dependencies into the golden AMI, and slot in just the app version with User data.

Auto Scaling Group — automatic recovery #

The feature that, when an instance dies, launches a new one and auto-connects it to the ALB is the ASG (Auto Scaling Group).

The structure of an ASG

Launch Template (instance template: AMI, type, SG, key, user data)
        │
        ▼
   ┌─────────┐
   │   ASG   │  desired=2  min=2  max=10
   └─────────┘
        │
        ├─── EC2 (AZ a)  ← health check fails → terminate + launch new
        ├─── EC2 (AZ b)
        └─── EC2 (AZ b)

The basic configuration is as follows.

Launch Template — defines which EC2 to launch (AMI, type, SG, IAM, user data).
Desired / Min / Max — the count to always maintain, the minimum, and the maximum.
Health Check — based on the EC2 itself (EC2) or the ALB target group (ELB).

For container workloads, Chapter 15 ECS and Fargate is a smoother alternative. ECS absorbs the container-level ASG.

Common pitfalls #

“Why can’t the ALB reach the EC2?” — Check from top to bottom. (1) Does the ALB SG outbound match the EC2 SG inbound? (2) Is the ALB SG present as the source in the EC2 SG inbound? (3) Does the OS-level firewall on the EC2 (firewalld, ufw) also allow that port? (4) Does the ALB target group’s health check path respond 200? (5) Is the EC2 listening on that port (ss -tlnp)? It’s mostly (1) or (2), and if you wrote the SG entry as an IP, changing it to the SG itself is the operational right answer.
You have no key but need to get into the EC2 — If Session Manager is on, use aws ssm start-session. If it’s not on, stop the instance, detach the EBS and mount it on another EC2, edit ~/.ssh/authorized_keys and reattach, or snapshot the EBS and launch a new instance with a new key.
Leaving Outbound all open and leaking data — When an EC2 is compromised, if outbound is wide open, the attacker sends data to arbitrary IPs. For DB servers and internal systems, narrowing outbound too is the safest.
Arbitrary NACL blocking and no response coming back — Forgetting the NACL’s stateless nature and allowing only outbound blocks the inbound response. Almost always it’s safest to leave NACLs at the default and touch only SGs.
Operating with IMDSv1 left on — If an old AMI or old setup is running with IMDSv1, it becomes an SSRF attack surface. Apply --http-tokens required to all instances.
AMIs too large, so boot becomes slow — If you snapshot a long-running instance straight into an AMI and it grows past 5GB, boot time increases. Before creating the AMI, clean up logs / cache / temp files (yum clean all, etc.), use cloud-init clean so init runs again on the next boot, and empty swap and the journal.

Exercises #

Redraw the four SGs (alb · app · db · bastion) from “3-tier web app SG design” without looking, marking for each inbound rule whether the source is an IP or an SG. Then, based on §“Pointing an SG at an SG”, explain in one sentence why you don’t need to change the rules even when the ALB’s IP changes.
Looking at the key pair vs Session Manager comparison table, connect how Session Manager resolves each of the five items in §“The limits of the key pair model”. Also write one sentence on how this dovetails with the least privilege of Chapter 6 Security Basics.
In a situation where you want to auto-recover instances of the same setup using both a Golden AMI and User data, write down separately what you’d put in each (OS · dependencies · app version). This division reappears as the container image and task definition in Chapter 15 ECS and Fargate.

In short: An SG is an instance-level stateful firewall, and the pattern of an SG pointing at an SG is more powerful than IP-based rules. A NACL is subnet-level and stateless, so it is rarely touched. For access, SSM Session Manager — which uses neither port 22 nor a key — is the new standard, and forcing IMDSv2 blocks SSRF. You harden the skeleton by using a Golden AMI and User data together, and an ASG handles automatic recovery.

Next chapter #

The basics of EC2 are in place. Next, Chapter 10 S3 moves to the object storage that’s often handled alongside EC2. We’ll lay out everyday patterns like the shape of a bucket, policies and Public Access Block, static site hosting, and presigned URLs.