Github Actions Software Development Lifecycle
Github Actions are pretty cool π as far as CI/CD pipelines go. But evolving and testing non-trivial workflows can go south real fast β οΈ. The core issue is that for any other code, the CI/CD system (if designed and implemented correctly) keeps you honest β , runs your tests π§ͺ, and only accepts successful PRs π¦.
But when you need to make changes to your CI/CD workflows, the situation is different π¬. I learned it the hard way, so you donβt have to π‘. Let me share with you our current software development lifecycle for Github Actions workflows π. The same principles can be applied to any CI/CD system π οΈ.
βQuis custodiet ipsos custodes?β ~ Juvenal
π€ What’s Github Actions anyway? π€
GitHub Actions is a powerful CI/CD platform built directly into GitHub that allows you to automate software workflows based on events in your repository. It lets you define jobs using YAML files that specify when and how to build, test, or deploy your code. These workflows are triggered by events like pushes, pull requests, or scheduled timers, and they run in isolated containers or VMs provided by GitHub-hosted runners or self-hosted environments.
While itβs often used for simple tasks like running unit tests or linting code, GitHub Actions scales to handle highly complex systems. You can compose workflows from other reusable workflows, manage dependencies, coordinate multi-service deployments, and integrate with external systems like Docker, AWS, GCP, or Slack. With the ability to run custom shell scripts, share workflows across repos, and control execution environments, it has become a full-blown automation framework, not just a basic CI tool.
Because of this power and flexibility, maintaining a non-trivial GitHub Actions setup can quickly become a software engineering problem of its own. When your CI/CD system spans dozens of workflows with conditional logic, secrets management, and complex coordination between jobs, it needs a software development lifecycle (SDLC) of its own, including testing, versioning, and rollback capabilities.
βπ€·ββοΈ Do you really need an SDLC for a CI/CD system? π€·ββοΈ
Umm… yes! Probably anyone that has a non-trivial CI/CD system needs a proper process. Our CI/CD system is responsible for detecting changes (duh!), running different checks, unit tests, integration tests, smoke tests, building container images, tagging images, coordinated deployments of frontend and backend services, manual deploys and much more. It is a complex system that involves multiple top-level workflows invoking multiple reusable workflows, each with its own set of jobs and steps. These workflows sometimes invokes yet lower-level reusable workflows and some of the steps involves launching non-trivial shell scripts.
Here is a diagram of our current CI/CD system. The blue circles π΅ are top-level workflows, the green circles π’ are reusable workflows, and the red circles π΄ are shell scripts.
The “Manual Deploy to Staging” workflow is an example of a hierarchical workflow. It is a graph of workflows and shell
scripts calling each other. It starts with a top-level workflow that can be run manually (with the workflow_dispatch
event) or automatically on a push to the main branch. It contains multiple jobs:
- resolve-sha
- fetch-secrets
- deploy-backend
- deploy-frontend
Each one of these workflows is a reusable workflow that is also used by other top-level workflows and may itself invoke nested reusable workflows or shell scripts. Each one is also expects multiple inputs and secrets.
Here is the complete graph of this workflow:
π manual-deploy-staging8.yaml
βββ deploy-backend6.yaml
β βββ find_release_sha3.sh
β βββ push-bundle-images2.yaml
β βββ tag-backend-image4.yaml
β βββ find_backend_image.sh
β βββ tag_existing_image2.sh
βββ deploy-frontend2.yaml
βββ fetch-aws-secrets.yaml
βββ resolve-sha.yaml
Let’s take a closer look at this workflow and its first-level reusable workflows. If YAML and shell scare you (they should π±), feel free to skip this.
π Deep dive into the “Manual Deploy to Staging” workflow π
Alright, you’re still here, brave soul π .Let’s look at the code of “Manual Deploy to Staging” workflow (if you can call it code) and later at the reusable workflows it uses. It is a top-level workflow that can be triggered manually.
Last warning… β οΈ I’m serious! there is a massive amount of YAML and shell commands ahead! You are totally within your rights to skip this at any point and scroll down to the cool stuff later. The main point is just to impart on you what a misery it is to maintain a non-trivial CI/CD system without a proper SDLC.
name: Manual Deploy to Staging
on:
workflow_dispatch:
inputs:
sha:
description: SHA
required: false
default: HEAD
concurrency:
group: deploy-to-staging
cancel-in-progress: false
jobs:
resolve-sha:
uses: ./.github/workflows/resolve-sha.yaml
with:
sha: ${{ inputs.sha }}
fetch-secrets:
uses: ./.github/workflows/fetch-aws-secrets.yaml
with:
secret-id: <REDACTED>
aws-region: us-east-1
secrets:
secrets-manager-access-key-id: ${{ secrets.DEVELOPMENT_SECRETS_MANAGER_USER_ACCESS_KEY_ID }}
secrets-manager-access-key-secret: ${{ secrets.DEVELOPMENT_SECRETS_MANAGER_USER_SECRET_ACCESS_KEY }}
deploy-backend:
needs: [resolve-sha, fetch-secrets]
uses: ./.github/workflows/deploy-backend7.yaml
with:
sha: ${{ needs.resolve-sha.outputs.resolved-sha }}
aws-region: us-east-1
eks-cluster-name: <REDACTED>
aws-account-id: <REDACTED>
environment: staging
fetch-depth: 1
ecr-repositories: "ecr-repo-1,ecr-repo-2"
dry-run: false
additional-tags: "manual"
suffix: "manual"
secrets:
ecr-aws-access-key-id: ${{ needs.fetch-secrets.outputs.ecr-access-key-id }}
ecr-aws-secret-access-key: ${{ needs.fetch-secrets.outputs.ecr-access-key-secret }}
eks-aws-access-key-id: ${{ needs.fetch-secrets.outputs.eks-access-key-id }}
eks-aws-secret-access-key: ${{ needs.fetch-secrets.outputs.eks-access-key-secret }}
deploy-frontend:
needs: [resolve-sha, fetch-secrets]
uses: ./.github/workflows/deploy-frontend4.yaml
with:
sha: ${{ needs.resolve-sha.outputs.resolved-sha }}
aws-region: us-east-1
deploy-env: 'staging'
env-dns-suffix: -staging
repos: '["app-1", "app-2", "app-3", "app-4"]'
repo-settings: '{"app-1":{"nx-project-name":"app", "framework": "vite"}, "app-2":{"framework": "vite", "datadog-service-name": "app-2-web"}, "app-3":{"framework": "vite", "datadog-service-name": "app-3-web"}, "octopus": {"framework": "vite", "datadog-service-name": "octopus-web"}}'
aws-sourcemaps: true
datadog-sourcemaps: true
dry-run: false
secrets:
s3-aws-access-key-id: ${{ needs.fetch-secrets.outputs.s3-access-key-id }}
s3-aws-secret-access-key: ${{ needs.fetch-secrets.outputs.s3-access-key-secret }}
development-datadog-api-key: ${{ secrets.DEVELOPMENT_DATADOG_API_KEY }}
development-datadog-app-key: ${{ secrets.DEVELOPMENT_DATADOG_SOURCE_MAP_UPLOAD_APP_KEY }}
Yep, it is a lot of moving parts. And this is just one top-level workflow. There are many more that do different things. Let’s checkout the reusable workflows it’s calling.
The resolve-sha.yaml
workflow is responsible for resolving the Git SHA to its full form. It checks out the repository
and uses git rev-parse
to get the full SHA. This is useful because sometimes you want to deploy a specific commit or
tag, and you need to ensure you have the full SHA. That’s a LOT of YAML and shuffling environment variables around for a
very simple task, which can be summed up as: if a SHA is provided, resolve it to its full form, otherwise use HEAD
.
But, it turns out resolving shas is not that simple. Sometimes we want to resolve deploy an older commit. The way Github
Actions works is that you have to check out the repo in order to be able to perform git operations. By default, it
checks the head only. So, this workflow makes sure to checkout the last 100 commits.
name: Resolve Git SHA
on:
workflow_call:
inputs:
sha:
required: false
type: string
outputs:
resolved-sha:
description: The full SHA
value: ${{ jobs.resolve-sha.outputs.resolved-sha }}
jobs:
resolve-sha:
runs-on: ubuntu-latest
outputs:
resolved-sha: ${{ steps.set-resolved-sha.outputs.resolved-sha }}
steps:
- name: Checkout repository
uses: actions/checkout@v4.2.2
with:
fetch-depth: 100
- name: Resolve Git SHA
id: set-resolved-sha
run: |
# Default to HEAD if no SHA is provided
SHA="${{ inputs.sha }}"
if [ -z "$SHA" ]; then
SHA="HEAD"
fi
# Resolve the SHA to its full form
RESOLVED_SHA=$(git rev-parse "$SHA")
echo "resolved-sha=$RESOLVED_SHA" >> "$GITHUB_OUTPUT"
The next reusable workflow is fetch-secrets.yaml
. It is responsible for fetching secrets dynamically from AWS Secrets
Manager. It has just one job to fetch a bunch of secrets from AWS secret manager. The idea is that we don’t have to
configure Github with individual secrets, but only with the credentials necessary to acess AWS secret manager, which is
the source of truth of all relevant secrets and then specific workflows can access the secrets they need. Again, note
how a seemingly simple task of fetching secrets requires an inordinate amount of YAML and shell commands.
name: Fetch AWS Secrets
on:
workflow_call:
inputs:
secret-id:
description: 'The name of the AWS Secrets Manager secret'
required: true
type: string
aws-region:
description: AWS region
required: true
type: string
secrets:
secrets-manager-access-key-id:
required: true
secrets-manager-access-key-secret:
required: true
outputs:
s3-access-key-id:
description: "The S3 Access Key ID"
value: ${{ jobs.fetch-secrets.outputs.s3-access-key-id }}
s3-access-key-secret:
description: "The S3 Access Key Secret"
value: ${{ jobs.fetch-secrets.outputs.s3-access-key-secret }}
ecr-access-key-id:
description: "The ECR Access Key ID"
value: ${{ jobs.fetch-secrets.outputs.ecr-access-key-id }}
ecr-access-key-secret:
description: "The ECR Access Key Secret"
value: ${{ jobs.fetch-secrets.outputs.ecr-access-key-secret }}
eks-access-key-id:
description: "The EKS Access Key ID"
value: ${{ jobs.fetch-secrets.outputs.eks-access-key-id }}
eks-access-key-secret:
description: "The EKS Access Key Secret"
value: ${{ jobs.fetch-secrets.outputs.eks-access-key-secret }}
snapshots-access-key-id:
description: "The Snapshots Access Key ID"
value: ${{ jobs.fetch-secrets.outputs.snapshots-access-key-id }}
snapshots-access-key-secret:
description: "The Snapshots Access Key Secret"
value: ${{ jobs.fetch-secrets.outputs.snapshots-access-key-secret }}
bedrock-access-key-id:
description: "The Bedrock Access Key ID"
value: ${{ jobs.fetch-secrets.outputs.bedrock-access-key-id }}
bedrock-access-key-secret:
description: "The Bedrock Access Key Secret"
value: ${{ jobs.fetch-secrets.outputs.bedrock-access-key-secret }}
jobs:
fetch-secrets:
runs-on: ubuntu-latest
outputs:
s3-access-key-id: ${{ steps.extract-values.outputs.s3-access-key-id }}
s3-access-key-secret: ${{ steps.extract-values.outputs.s3-access-key-secret }}
ecr-access-key-id: ${{ steps.extract-values.outputs.ecr-access-key-id }}
ecr-access-key-secret: ${{ steps.extract-values.outputs.ecr-access-key-secret }}
eks-access-key-id: ${{ steps.extract-values.outputs.eks-access-key-id }}
eks-access-key-secret: ${{ steps.extract-values.outputs.eks-access-key-secret }}
snapshots-access-key-id: ${{ steps.extract-values.outputs.snapshots-access-key-id }}
snapshots-access-key-secret: ${{ steps.extract-values.outputs.snapshots-access-key-secret }}
bedrock-access-key-id: ${{ steps.extract-values.outputs.bedrock-access-key-id }}
bedrock-access-key-secret: ${{ steps.extract-values.outputs.bedrock-access-key-secret }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4.2.1
with:
aws-access-key-id: ${{ secrets.secrets-manager-access-key-id }}
aws-secret-access-key: ${{ secrets.secrets-manager-access-key-secret }}
aws-region: ${{ inputs.aws-region }}
- name: Fetch Secret from AWS Secrets Manager
id: fetch-secret
run: |
SECRET_VALUE=$(aws secretsmanager get-secret-value \
--secret-id "${{ inputs.secret-id }}" \
--query SecretString \
--output text)
echo "SECRET_VALUE=$SECRET_VALUE" >> "$GITHUB_ENV"
- name: Extract Requested Keys
id: extract-values
run: |
KEYS="ecrAccessKeyId,ecrAccessKeySecret,eksAccessKeyId,eksAccessKeySecret,s3AccessKeyId,s3AccessKeySecret,snapshotsAccessKeyId,snapshotsAccessKeySecret,bedrockAccessKeyId,bedrockAccessKeySecret"
# Convert comma-separated keys to an array
IFS=',' read -ra KEY_ARRAY <<< "$KEYS"
for KEY in "${KEY_ARRAY[@]}"; do
VALUE=$(echo "$SECRET_VALUE" | jq -r --arg key "$KEY" '.[$key]')
# Convert camelCase or PascalCase to kebab-case
OUTPUT_NAME=$(echo "$KEY" | \
sed 's/\([^A-Z]\)\([A-Z]\)/\1-\2/g' | \
sed 's/\([A-Z]\)/\L\1/g')
echo "$OUTPUT_NAME=$VALUE" >> "$GITHUB_OUTPUT"
done
The deploy-backend7.yaml
reusable workflow is one of the most complicated workflows. It is responsible for deploying a
backend service, which involves the following jobs:
build-images
: This job builds the Docker images for the backend services and pushes them to ECR.check-images
: This job checks if the images for the backend services are available in ECR. If not, it tags them.tag-backend-image
: This job tags the backend images with the provided suffix if they are missing.deploy-backend
: This job deploys the backend services to EKS using the images built in the previous jobs.deploy-frontend
: This job deploys the frontend services to S3 and CloudFront.
It takes a large number of inputs and some of the jobs are themselves reusable workflows like push-bundle-images2.yaml
and tag-backend-image5.yaml
. The check-images
job has a bunch of shell commands and one of the steps invokes a
separate shell script - ./.github/workflows/scripts/find_release_sha3.sh
. Note that the dry-run input, which when true
is used to test workflow without actually performing any actions.
name: Deploy Backend 7
on:
workflow_call:
inputs:
sha:
description: SHA
required: true
type: string
aws-region:
description: AWS region
required: true
type: string
eks-cluster-name:
description: AWS EKS cluster name
required: true
type: string
aws-account-id:
description: AWS account id
required: true
type: string
environment:
description: Environment
required: true
type: string
fetch-depth:
description: Number of commits to fetch when checking out
required: false
type: number
default: 1
ecr-repositories:
description: Comma-separated list of ECR repositories
required: true
type: string
dry-run:
description: Dry run
required: true
type: boolean
additional-tags:
description: Comma separated list of additional tags
required: true
type: string
suffix:
description: Suffix to add to tag (e.g., build, test-success)
required: false
type: string
default: ""
secrets:
eks-aws-access-key-id:
required: true
eks-aws-secret-access-key:
required: true
ecr-aws-access-key-id:
required: true
ecr-aws-secret-access-key:
required: true
jobs:
build-images:
uses: ./.github/workflows/push-bundle-images2.yaml
with:
sha: ${{ inputs.sha }}
aws-region: ${{ inputs.aws-region }}
aws-account-id: ${{ inputs.aws-account-id }}
environment: ${{ inputs.environment }}
additional-tags: ${{ inputs.additional-tags }}
backends: '["invisible", "browser"]'
dry-run: ${{ inputs.dry-run }}
secrets:
ecr-aws-access-key-id: ${{ secrets.ecr-aws-access-key-id }}
ecr-aws-secret-access-key: ${{ secrets.ecr-aws-secret-access-key }}
check-images:
needs: build-images
runs-on: ci-runner
outputs:
any_missing: ${{ steps.check.outputs.any_missing }}
steps:
- name: Checkout repository
uses: actions/checkout@v4.2.2
with:
fetch-depth: 10
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4.2.1
with:
aws-access-key-id: ${{ secrets.eks-aws-access-key-id }}
aws-secret-access-key: ${{ secrets.eks-aws-secret-access-key }}
aws-region: ${{ inputs.aws-region }}
- name: Check images for target repositories
id: check
run: |
ANY_MISSING=false
IFS=',' read -ra REPOSITORIES <<< "${{ inputs.ecr-repositories}}"
for REPOSITORY in "${REPOSITORIES[@]}"; do
echo "Checking image for repository: $REPOSITORY"
if ! REPOSITORY_NAME="$REPOSITORY" SUBDIRECTORY="backend" ./.github/workflows/scripts/find_release_sha3.sh 0; then
ANY_MISSING=true
break # Optional: bail out immediately on first failure
fi
done
echo "any_missing=$ANY_MISSING" >> "$GITHUB_OUTPUT"
tag-backend-image:
needs: check-images
if: needs.check-images.outputs.any_missing == 'true'
uses: ./.github/workflows/tag-backend-image5.yaml
with:
ecr-repositories: '["repo-1","repo-1"]'
dry-run: ${{ inputs.dry-run }}
suffix: ${{ inputs.suffix }}
secrets:
ecr-aws-access-key-id: ${{ secrets.ecr-aws-access-key-id }}
ecr-aws-secret-access-key: ${{ secrets.ecr-aws-secret-access-key }}
deploy-backend:
needs: [check-images, tag-backend-image]
runs-on: ci-runner
timeout-minutes: 30
steps:
- name: Set BROWSER_RELEASE_TAG Environment Variable
run: echo "BROWSER_RELEASE_TAG=${{ inputs.sha }}" >> "$GITHUB_ENV"
- name: Set DOCKER_TAG Environment Variable
run: echo "DOCKER_TAG=${{ inputs.sha }}" >> "$GITHUB_ENV"
- name: Show GitHub context
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4.2.1
with:
aws-access-key-id: ${{ secrets.eks-aws-access-key-id }}
aws-secret-access-key: ${{ secrets.eks-aws-secret-access-key }}
aws-region: ${{ inputs.aws-region }}
- name: Create kubeconfig for EKS
run: aws eks update-kubeconfig --region "${{ inputs.aws-region }}" --name "${{ inputs.eks-cluster-name }}"
- name: Checkout
uses: actions/checkout@v4.2.2
- name: Set up JDK 21
uses: actions/setup-java@v4.7.1
with:
java-version: 21
distribution: 'temurin'
cache: gradle
- name: Deploy Kubernetes resources
if: ${{ !inputs.dry-run }}
working-directory: backend
run: ./gradlew the-bundle:the-deployer:deploy
env:
REGISTRY_URL: ${{ inputs.aws-account-id }}.dkr.ecr.${{ inputs.aws-region }}.amazonaws.com
ENVIRONMENT: ${{ inputs.environment }}
- name: Show command (dry-run)
if: inputs.dry-run
run: echo "[DRY-RUN] ./gradlew the-bundle:the-deployer:deploy"
Ok. That was a lot of YAML and there is no need to go through the rest of the reusable workflows and the shell scripts. The point is that it’s a lot to take in, modify safely and review. So, how do we make sure that we can evolve this system safely? That’s where the process comes in.
π The process
The process we use to evolve our GitHub Actions workflows is based on the principle of cloning existing workflows and shell scripts 𧬠(with a versioned filename), modifying the cloned versions (never the original) βοΈ, and incrementally updating other workflowsβagain by cloningβto call the modified ones π. The modified workflows can be tested in isolation π§ͺ. This approach extends all the way up to top-level workflows. If a new top-level workflow that invokes a modified reusable workflow turns out to be broken β, we simply revert to the original and troubleshoot the issue π.
Until a new top-level workflow is introduced, all the modified reusable workflows remain inertβthey donβt affect the CI/CD system at all π« since nothing triggers them.
When we do switch to a new top-level workflow, we disable the old one in the GitHub Actions dashboard βοΈ. This is crucial because most of our top-level workflows are not idempotent βοΈβwe donβt want both old and new to be triggered accidentally. This holds true even for manual workflows triggered by engineers π¨βπ», where running the wrong version could have side effects.
As you might expect, when modifying a deeply nested reusable workflow πͺ, youβll need to clone all of its ancestors, test each layer thoroughly, and ensure everything works end-to-end β . Once confirmed, the obsolete versions can be safely removed π§Ή.
Letβs walk through a real-world example to illustrate how this works.
π The process in action
Suppose we want to modify the push-bundle-images2.yaml
reusable workflow π οΈ. To do that safely, we must find all its
ancestors π§. This can be tedious, but I built a tool β github-workflow-analyzer
β to automate the process. It scans
all workflows, generates a CSV with metadata π, and visualizes the workflow graph πΌοΈ. In fact, the diagrams in this
article were created using this tool.
It captures the following fields for every workflow:
workflow
url
triggers
calls
called_by
(direct callers only)
For our purposes, weβre particularly interested in the called_by
column for push-bundle-images2.yaml
. Two workflows
directly call it:
push-backend5.yaml
deploy-backend6.yaml
The deploy-backend6.yaml
workflow belongs to the manual-deploy-staging
subgraph πΈοΈ. The push-backend5.yaml
workflow is independent.
To proceed, we clone push-bundle-images2.yaml
β push-bundle-images3.yaml
βοΈ. Then we clone and update the two
callers:
push-backend5.yaml
β‘οΈpush-backend6.yaml
deploy-backend6.yaml
β‘οΈdeploy-backend7.yaml
Next, we need to handle the workflows that call those. Focusing on deploy-backend6.yaml
, we find that it is invoked by
manual-deploy-staging8.yaml
. So, we clone that as well:
manual-deploy-staging8.yaml
βmanual-deploy-staging9.yaml
, which will calldeploy-backend7.yaml
π§βπ» Managing PRs and reviews
Hereβs the catch: since we never edit the original workflows directly, all changes involve new files π. That makes diffs hard to track in the GitHub UI π. One of my teammates proposed an elegant solution π‘: split the change into two phases β copy and modify.
- In the first PR, copy all the necessary files without any changes π.
- In the second PR, make your modifications βοΈ.
This makes reviews focused and efficient π.
π§ͺ Dry-run safety net
Before enabling new workflows, we leverage another safeguard β dry-run mode. Many of our workflows support a
dry-run
input. Initially, we set this to true
so that nothing destructive happens. This helps us catch syntax issues
and validate behavior safely π«.
Once we’re confident, we flip dry-run
to false
and commit π. Now we have two workflows doing the same thing:
- Old:
manual-deploy-staging8.yaml
- New:
manual-deploy-staging9.yaml
We disable the old one to avoid conflicts β οΈ and carefully test the new one π§. If it fails, we just re-enable the old one π. If it works, we can retire the obsolete workflows and shell scripts π§Ή.
Remember, this process must be repeated for every top-level workflow affected by the change, including
push-backend6.yaml
πͺ.
π Take home points π
- Treat your workflows like production code β GitHub Actions can grow into a sprawling system. Apply the same rigor as you would to application logic π§ .
- Use versioned cloning for safe changes β Never modify workflows directly. Clone, modify, test, and promote incrementally without risking regressions π§¬.
- Leverage dry runs to test workflows safely β Many workflows can simulate behavior without performing destructive actions. Use this to your advantage before going live π.
- Disable outdated workflows after promotion β Avoid accidental runs by disabling older versions once a new one is in place. Manual doesnβt mean safe by default βοΈ.
- Split copy and modify into separate PRs β This makes reviews more effective. π.
π»π¦ Manete alieni, amici mei