If you want to automate EC2 snapshots, the first thing to internalize is that an EC2 instance is not a backup, and a snapshot you take by hand "when you remember" is not a strategy. I learned this when an EBS volume on a client's production box went read-only after a botched kernel update, and the last manual snapshot was 23 days old. The fix is boring and reliable: tag your volumes, let AWS Data Lifecycle Manager (DLM) create EBS snapshots on a schedule with a retention count, copy them to a second region for disaster recovery, and then actually test a restore. This post walks the exact CLI commands I run in production.
Why isn't a running EC2 instance already a backup?
Because everything that can corrupt your data lives on the same instance. A bad migration, a `rm -rf` with a trailing space, ransomware, or a silently failing disk all hit the live volume. The instance staying up does nothing for you. What protects you is a point-in-time copy stored independently. On AWS that copy is an EBS snapshot: an incremental, block-level backup of the volume kept in S3-backed storage that you never see directly. The question is never whether to snapshot, it's how to make snapshots happen without a human in the loop, and how to stop them piling up forever and quietly inflating your bill.
- A snapshot is incremental: the first is a full copy, each later one stores only changed blocks, so cost is far lower than the raw volume size.
- Snapshots are regional by default. If your whole region has a bad day, a snapshot that never left that region is no DR plan.
- Without retention, snapshots accumulate. I have inherited accounts with 4,000+ orphaned snapshots from a cron job nobody owned.
- Deleting the source volume does not delete its snapshots, and deleting one snapshot in an incremental chain does not corrupt the others.
DLM or AWS Backup: which one should I use?
Two AWS-native options cover this. If you only need scheduled EBS snapshots driven by tags, use Data Lifecycle Manager. It is purpose-built for EBS, free (you pay only for the snapshot storage), and the policy is a small JSON document. If you need one backup policy spanning EBS, RDS, DynamoDB, EFS, FSx, and more, with a central backup vault and compliance reporting, use AWS Backup instead. For a single fleet of Linux web servers, DLM is the right tool and what I will show here. For a mixed estate where auditors want one pane of glass, AWS Backup earns its slightly heavier setup.
A backup you have never restored is not a backup. It is a hope with a storage bill attached.
How do I tag volumes and create the DLM execution role?
DLM finds what to snapshot by matching tags, so the policy is only as good as your tagging discipline. Tag the volumes (not just the instances) you want backed up. I use a single deliberate key like `Backup=daily` so nothing gets swept in by accident. DLM also needs an IAM service role it can assume to create and delete snapshots on your behalf. Create the default role once per account with the CLI; note the ARN it produces drops the `service-role/` path that the console version uses.
# Tag a specific EBS volume so DLM will pick it up
aws ec2 create-tags \
--resources vol-0a1b2c3d4e5f67890 \
--tags Key=Backup,Value=daily
# Create the DLM snapshot service role once per account (CLI form, no service-role/ path)
aws dlm create-default-role --resource-type snapshot
# Confirm the role ARN you will reference in the policy
aws iam get-role \
--role-name AWSDataLifecycleManagerDefaultRole \
--query 'Role.Arn' --output text
# -> arn:aws:iam::123456789012:role/AWSDataLifecycleManagerDefaultRoleWhat does a DLM policy with retention and cross-region copy look like?
Here is the policy I actually deploy. It targets volumes tagged `Backup=daily`, snapshots once every 24 hours at 03:00 UTC (pick a low-traffic window), keeps the 7 most recent locally, and copies each snapshot to a second region where it is kept for 14 days. `CopyTags: true` carries your tags onto the snapshot so cost allocation and cleanup stay sane. Save this as `policy-details.json`.
{
"ResourceTypes": ["VOLUME"],
"TargetTags": [
{ "Key": "Backup", "Value": "daily" }
],
"Schedules": [
{
"Name": "DailySnapshots",
"CopyTags": true,
"CreateRule": {
"Interval": 24,
"IntervalUnit": "HOURS",
"Times": ["03:00"]
},
"RetainRule": {
"Count": 7
},
"CrossRegionCopyRules": [
{
"TargetRegion": "us-west-2",
"Encrypted": true,
"CopyTags": true,
"RetainRule": {
"Interval": 14,
"IntervalUnit": "DAYS"
}
}
]
}
]
}Create the policy with the role ARN from the previous step. `--state ENABLED` makes it live immediately; use `DISABLED` if you want to inspect it first. DLM returns a `PolicyId` you can store in your infra notes.
aws dlm create-lifecycle-policy \
--description "Daily EBS snapshots, 7 local, 14d in us-west-2" \
--state ENABLED \
--execution-role-arn arn:aws:iam::123456789012:role/AWSDataLifecycleManagerDefaultRole \
--policy-details file://policy-details.json
# Returns:
# { "PolicyId": "policy-0123456789abcdef0" }How do I actually test a restore?
This is the step everyone skips, and it is the only one that proves the whole thing works. Restoring from a snapshot does not overwrite anything in place: you create a brand-new volume from the snapshot, attach it to an instance, and mount it to verify the data is intact. Do this in the same Availability Zone as the target instance, because a volume can only attach to an instance in its own AZ. Run this drill on a schedule, not once. I put a reminder on the calendar quarterly.
# 1. Find the most recent snapshot for the source volume
aws ec2 describe-snapshots \
--owner-ids self \
--filters "Name=volume-id,Values=vol-0a1b2c3d4e5f67890" \
--query 'reverse(sort_by(Snapshots,&StartTime))[0].SnapshotId' \
--output text
# -> snap-0fedcba9876543210
# 2. Create a new volume from that snapshot, in the test instance's AZ
aws ec2 create-volume \
--snapshot-id snap-0fedcba9876543210 \
--availability-zone us-east-1a \
--volume-type gp3 \
--tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=restore-test}]'
# -> returns vol-0newrestore12345
# 3. Attach it to a running test instance as /dev/sdf
aws ec2 attach-volume \
--volume-id vol-0newrestore12345 \
--instance-id i-0abc123def456789 \
--device /dev/sdf
# 4. On the instance: mount read-only and verify the files exist
sudo mkdir -p /mnt/restore-test
sudo mount -o ro /dev/xvdf1 /mnt/restore-test
ls -la /mnt/restore-testIf `mount` complains, run `lsblk` to confirm the kernel device name (Nitro instances often present the volume as `/dev/nvme1n1` even though you attached it as `/dev/sdf`). When you have eyeballed the data, unmount, detach, and delete the test volume so it does not sit there costing money.
What about the cost of all these snapshots?
Snapshots are cheap per GB but not free, and a forgotten policy is a slow leak. The retention `Count` is your cost ceiling: 7 daily snapshots of a volume with low daily churn might total a fraction of one full volume's size because only changed blocks are stored. The cross-region copy doubles the storage for whatever you replicate, so only copy what genuinely needs DR. I dig into this trade-off, plus tools to find orphaned snapshots, in reducing your AWS bill. If you are still standing up the instances these volumes belong to, start with AWS EC2 for beginners, and when you eventually move that workload, snapshots are also your safety net during a live site migration to a new server.
Set this up once and your backups stop depending on you remembering. Tag the volumes, ship the DLM policy from version control, set a retention count you can defend on cost, replicate to a second region, and then prove a restore works before you need it. The day a volume dies, the difference between a five-minute restore and a resume-writing incident is entirely the work you did on a quiet afternoon months earlier. Do the boring thing now.

