Red Hat Certified Engineer (RHCE) #8 Error handling: block/rescue/always, failed_when, ignore_errors
In #7 Jinja2 templates you learned how to generate per-host config files dynamically. But a playbook stops the moment one task fails on a host, skipping that host’s remaining tasks. When the exam asks for something like “run the cleanup work no matter what, even if a specific step fails” or “judge success from the result of a command,” the default behavior alone won’t earn you the points. This post covers error handling — taking direct control of a playbook’s failure flow.
Error handling splits two ways. One is deciding how to react when a failure occurs with block/rescue/always and ignore_errors. The other is redefining what counts as a failure with failed_when and changed_when. Whether you can handle changed_when — especially when working with non-idempotent modules like command and shell — is a recurring exam point.
Default behavior: it stops on failure #
To understand error handling, you first need to know Ansible’s default failure behavior. When you run a playbook, Ansible executes tasks top to bottom for each host. When one task fails on some host, that host stops immediately and does not run its remaining tasks. Other hosts keep going until their turn and stop at the same point.
This default makes sense because forcing later steps after an earlier one has failed can cause greater damage. So error handling means deliberately changing this default behavior. You explicitly decide whether to ignore a failure, whether to run cleanup work even on failure, and what to count as a failure.
ignore_errors: let the failure slide #
Adding ignore_errors: yes makes Ansible merely mark the task as failed and move on to the next task even if it fails. It does not stop the host.
- name: ignore_errors example
hosts: webservers
become: true
tasks:
- name: try to remove a package that may or may not exist
ansible.builtin.dnf:
name: legacy-tool
state: absent
ignore_errors: yes
- name: this task runs even if the one above fails
ansible.builtin.debug:
msg: "moved on to the cleanup step"There are two things to watch out for. First, ignore_errors ignores the failure after the task has ended in failure — it does not make the failure go away. The run summary still counts it as a failed task. Second, ignore_errors cannot ignore an unreachable situation where the host couldn’t even be connected to. unreachable is a separate category; to ignore that too, use ignore_unreachable: yes.
ignore_errors is a tool you apply locally to “one or two tasks that are fine to fail.” To handle the failure flow structurally across multiple tasks, use block.
block/rescue/always: structuring exceptions #
block groups multiple tasks into a single unit. Add rescue and always to it and you get a structure equivalent to try/catch/finally in other languages.
- block. Groups the tasks to run in the normal flow.
- rescue. Runs only when a task inside block fails.
- always. Always runs last, regardless of success or failure.
- name: block/rescue/always example
hosts: webservers
become: true
tasks:
- name: attempt app deployment and recover
block:
- name: place the new config file
ansible.builtin.template:
src: app.conf.j2
dest: /etc/myapp/app.conf
- name: restart the service
ansible.builtin.service:
name: myapp
state: restarted
rescue:
- name: roll back to the backup config on failure
ansible.builtin.copy:
src: /etc/myapp/app.conf.bak
dest: /etc/myapp/app.conf
remote_src: true
- name: restart the service after rollback
ansible.builtin.service:
name: myapp
state: restarted
always:
- name: record the outcome
ansible.builtin.debug:
msg: "the deployment attempt is done. I'll check the logs"Let me lay out the core behavior. When any task inside block fails, the remaining tasks in block are skipped from that point and control moves to rescue. If all of rescue’s tasks succeed, that host is treated as a success, not a failure. If block finished without failure, rescue does not run. always runs last in both cases.
The fact that a host is treated as a success when rescue succeeds is especially useful on the exam. A requirement like “attempt X, but if it fails recover with Y and keep the playbook going” is expressed exactly with this structure.
magic variables for checking the cause inside rescue #
In a rescue block you can use a few variables that tell you the cause of the failure.
ansible_failed_task. Holds information about the failed task. Get its name withansible_failed_task.name.ansible_failed_result. Holds the result (return value) of the failed task.
rescue:
- name: print which task failed
ansible.builtin.debug:
msg: "failed task: {{ ansible_failed_task.name }}"failed_when: define your own failure condition #
By default, Ansible judges success or failure from the status a module returns. But the command and shell modules look only at the command’s exit code, so they can’t catch a case where “the exit code is 0 but the output contains an error message.” For this, you decide what counts as a failure yourself with failed_when.
The standard pattern is to first capture the command result into a variable with register, then judge using that variable as the condition.
- name: failed_when example
hosts: appservers
tasks:
- name: run the health check command
ansible.builtin.command: /usr/local/bin/healthcheck
register: health
failed_when: "'ERROR' in health.stdout"
- name: judge by exit code and output content together
ansible.builtin.command: /usr/local/bin/deploy
register: deploy
failed_when: deploy.rc != 0 or 'FAILED' in deploy.stdoutfailed_when marks the task as failed when the condition is true. Conversely, setting failed_when: false means it is never treated as a failure no matter what happens. This looks similar to ignore_errors but means something different. ignore_errors is “failed but ignored,” whereas failed_when: false is “not counted as a failure in the first place.” The latter is not counted as a failure in the run summary either.
The structure of a register result #
To use failed_when and changed_when properly, you need to know the structure of the result captured with register. The keys you commonly use with the command/shell family are these.
rc. The command’s exit code (return code).stdout. Holds the entire standard output as a single string.stdout_lines. Holds the standard output as a list of lines.stderr. The standard error output.
When looking for a specific message in a condition, use the form 'message' in result.stdout, and use stdout_lines when you need to work line by line.
changed_when: define your own change condition #
changed_when decides what counts as a change. It pairs with failed_when in that it handles the changed-state judgment rather than the failure judgment. The reason this key matters in RHCE is the idempotency problem of the command and shell modules.
The command and shell modules report changed unconditionally whenever they run. Even a read-only command that doesn’t actually change the system is counted as changed, so changed doesn’t drop to 0 on the second run. On an exam where idempotency is central to the grading, this costs you points.
So you add changed_when: false to a read-only command to declare no change.
- name: shore up idempotency with changed_when
hosts: appservers
tasks:
- name: check the current version (read-only, not a change)
ansible.builtin.command: /usr/local/bin/myapp --version
register: appver
changed_when: false
- name: judge whether it changed based on the result
ansible.builtin.command: /usr/local/bin/sync-config
register: sync
changed_when: "'updated' in sync.stdout"
failed_when: sync.rc != 0As in the second task, it’s common to use failed_when and changed_when together in one task. The pattern is “fail if the exit code isn’t 0, and count as changed if the output contains updated.” When you have no choice but to use the command module, adding these two keys lets you mimic an idempotency close to that of a dedicated module.
Consider a module before reaching for command #
Before shoring up idempotency with changed_when, the principle is to replace command/shell with a dedicated module. Handle package installs with dnf, file placement with copy/template, and service control with service, and the idempotency lives inside the module. Use command/shell only when there’s no suitable module, and shore up idempotency and failure judgment with changed_when/failed_when when you do.
Keys that handle failure across the whole play #
So far this has been task- and block-level handling. There are also keys that set the failure strategy for the whole play across multiple hosts, so let me lay them out one line each.
- any_errors_fatal: true. If even one host fails a task, all hosts in that batch are aborted immediately. Use it for work where the whole thing must stop if even one machine goes off.
- max_fail_percentage: 30. Aborts the play if the percentage of failed hosts exceeds the given number. It expresses a tolerance limit like “stop if more than 30% fail.”
- force_handlers: true. Even if the play fails partway through, it force-runs handlers already queued by notify before finishing. Normally handlers don’t run on failure, so use this when you have cleanup handlers that absolutely must run.
These keys go at the play level, not on a task. They aren’t asked often, but when you spot wording like “abort everything if even one machine fails” in the requirements, reach for any_errors_fatal.
Exam points #
Let me lay out the spots where error handling makes or breaks your score.
- Judging failure from a command result. Capturing command/shell with register and then judging failure from the output content or exit code with
failed_whenis a regular. Get the'message' in result.stdoutexpression into your hands. - Shoring up command idempotency. Add
changed_when: falseto read-only commands so changed drops to 0 on the second run. It’s a frequent point-loser in idempotency checks. - Writing a recovery flow with block/rescue. “Attempt, but recover and keep going if it fails” is expressed with block/rescue. Remember that the host is treated as a success when rescue succeeds.
- Guaranteeing cleanup work with always. Cleanup work that must run regardless of success or failure goes in always.
- The difference between failed_when: false and ignore_errors. The former doesn’t count it as a failure; the latter merely ignores the failure but leaves it in the tally. Read the wording of the requirement and use the two distinctly.
- Run it twice to verify. Build the habit of running the playbook you wrote twice and checking that changed/failed come out as intended on the second run.
Wrap-up #
What this post locked in:
- Default behavior. When a task fails, that host stops immediately and skips its remaining tasks.
- ignore_errors. Ignores the failure and moves to the next task, but it remains a failure in the run summary. It cannot ignore unreachable.
- block/rescue/always. The try/catch/finally structure. The host is treated as a success when rescue succeeds, and always always runs last.
- failed_when. You decide what counts as a failure yourself. Capture the command result with register and judge by output/exit code.
- changed_when. You decide what counts as a change. Shore up idempotency on read-only commands with
changed_when: false. - Play-level keys. Set the whole play’s failure strategy with any_errors_fatal, max_fail_percentage, and force_handlers.
Next: tags and conditionals #
You’ve got control of the failure flow down. Now it’s time for the conditions that decide when to run a task, and the flow that runs the same task repeatedly.
In #9 Tags and conditionals: when, loop, until, we’ll write our way through conditional execution with when, iterating over a list with loop, retrying until a condition is met with until, and the exam-regular pattern of running only part of a playbook by selecting it with tags.