Skip to content

job-retry: terraform apply always fails creating retry SQS event source mappings #5097

@phergoualch

Description

@phergoualch

Summary

Enabling job_retry causes terraform apply to fail when creating the SQS event source mapping for the retry Lambda.

The failure is:

InvalidParameterValueException: The function execution role does not have permissions to call ReceiveMessage on SQS

What went wrong

The job-retry submodule creates these resources in the same apply:

  • the retry SQS queue
  • the retry Lambda
  • the retry Lambda IAM role
  • the inline IAM policy that grants the retry Lambda access to the retry queue
  • the Lambda event source mapping from the retry queue to the retry Lambda

The retry policy is defined correctly and already includes the required permissions:

  • sqs:ReceiveMessage
  • sqs:GetQueueAttributes
  • sqs:DeleteMessage

However, the event source mapping does not explicitly depend on that IAM policy resource.

Because of that, Terraform can create the event source mapping before the retry Lambda role has the queue permissions attached. AWS validates the execution role during CreateEventSourceMapping, does not see ReceiveMessage yet, and rejects the mapping.

Observed behavior

In my case this was not intermittent. With job_retry enabled, apply failed consistently. It did not work even once before the dependency fix.

After adding an explicit dependency from the event source mapping to the retry IAM policy, the same apply succeeded cleanly.

Reproduction

  1. Use the multi-runner module
  2. Enable job_retry on one or more runner configs
  3. Run terraform apply

Expected failure during creation of one or more *-job-retry event source mappings.

Expected behavior

The retry Lambda IAM policy should be attached before the SQS event source mapping is created.

Proposed fix

Add an explicit dependency in modules/runners/job-retry/main.tf:

resource "aws_lambda_event_source_mapping" "job_retry" {
  event_source_arn                   = aws_sqs_queue.job_retry_check_queue.arn
  function_name                      = module.job_retry.lambda.function.arn
  batch_size                         = var.config.lambda_event_source_mapping_batch_size
  maximum_batching_window_in_seconds = var.config.lambda_event_source_mapping_maximum_batching_window_in_seconds

  depends_on = [aws_iam_role_policy.job_retry]
}

Notes

This looks like a deterministic apply-ordering problem, not a missing-permission definition. The retry IAM policy itself already grants the correct SQS actions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions