Continuous integration playbook

Maintainers: DevX Team.
Audience: any software engineer, no prior infrastructure knowlegde required.
TL;DR This document sums up what to do in various scenarios that can block the CI.

Sourcegraph’s continuous integration (CI) is what enables us to feel confident when delivering our changes to our users, and is one of the key components enabling Sourcegraph to deliver quality software. While the DevX team is in charge of managing the CI as a tool, it is essential for every engineer to be able to unblock themselves if there is a problem in order be autonomous.

This page lists common failure scenarios and provides a step by step guide to get the CI back in an operational state.

Prerequisites

In order to handle problems with the CI, the following elements are necessary:

Have access to the sourcegraph-ci project on Google Cloud Platform.
Ask #it-tech-ops for access if you do not have access.
Have the gcloud CLI installed.
Have the kubectl CLI installed.
Gain access to the CI cluster by authenticating against it with gcloud and kubectl.

Scenarios

`buildchecker` has locked the `main` branch

Severity: major
Impact:
- No pull requests may be merged except by authors of
- Pull request builds may be failing as well
Possible causes:
- buildchecker will lock/restrict push access to the main branch if a series of failed builds is detected - this can indicate that a regression has been merged into main or that critical build infrastructure is failing.

Actions

buildchecker will still allow the authors of the last few failed builds, as well as the @dev-experience team, to push to the main branch so as to make any changes necessary to restore the pipeline to a healthy state.

Follow the “Build has failed on the main branch” guide.
If the issue has been resolved, wait for buildchecker to unlock the branch or manually trigger a run (click “Run workflow”).

Build has failed on the `main` branch

Severity: minor
Impact: that commit won’t be deployed on k8s.sgdev.org and sourcegraph.com until an ulterior build passes.
Possible causes:
- The main branch runs additional checks compared to Pull Requests builds. So it’s possible that one of those checks failed.
  - 💡 The checks are dynamically generated by our pipeline generation tool. The main branch has notably much more exhaustive checks than other branches.
- The main branch have changes that weren’t in the Pull Request branch and those changes are causing a failure.
- The main branch is failing due to a previous build.

Actions

Check your build on Buildkite.
- Find its link directly in the #buildkite-main channel.
- 💡 Or run sg ci status in your shell, with the main branch checked out.
Search for the failing steps, and browse the logs (💡 run sg ci logs in your shell, with the main branch checked out) .
- Look for a failure explanation: it can be a test that failed or a command that return a non zero exit code.
Check the previous builds on the main branch on Buildkite
1. Are they failing with the same exact error?
  - Yes: see the Builds are failing in the main branch with the same error
  - No: see next point.
Is that a real failure or a flake?
1. Restart that step. Maybe it will fail again, but if it doesn’t it’ll save you time.
  - 💡 You can go to 3. while it runs.
2. See Is that a failure or a flake scenario
3. Did restarting it fixed the problem?
  - Yes: that’s a flake. See the Spotted a flake scenario
  - No: see next point.
4. Does the failure points to problem with the code that was shipped on that commit?
  1. Yes, and it’s a very quick fix that can get merged promptly:
    1. Write a short message on #buildkite-main and tell others that you’re fixing it.
    2. Submit the fix with another PR and get it merged as soon as possible.
  2. Yes, but it’s not easily and/or quickly fixed
    1. Revert the incriminating Pull Request.
    2. Open a GitHub issue mentioning the build and the context to explain to the team owning that test what happened.
    3. Checkout the PR branch.
    4. Rebase it so it includes the changes that broke it when merged in the main branch.
    5. Create a build using sg ci build main-dry-run in order to get the CI to run the same exact checks it does on the main branch.
  3. No, but it seems to fail in step or code from another team.
    1. Reach out a member of the team responsible for that test.
    2. go for a. or b. from the previous points.
5. No, and there is suspicion of a flake.
  - Yes: that’s a flake. See the Spotted a flake scenario

Builds are all failing on the `main` branch with the same error

Severity: major
Impact: no commits are being deployed on DogFood and sourcegraph.com until the problem is resolved. Cutting a release is impossible.
Possible causes:
- A previous Pull Request introduced a change that causes a test to fail.
- A previous Pull Request introduced a change that modified state in an unexpected way and broke the CI.
- An external dependency is not available anymore and is causing builds to fail.
- Some rate limiting API is throttling us and causing builds to fail.

Actions

Identify the error in common with the recent builds on Buildkite.
- 💡 See How to use loki here
Find the build where the problem appeared for the first time.
- 💡 Often it’s the first build that became red, but check that the error is the same to be sure.
Is this an external failure or an internal one?
- 💡 External failures are about downloading a dependency like a package in a script or a in a Dockerfile. Often they’ll manifest in the form of an HTTP error.
- 💡 If unsure, ask for help on #dev-chat.
- Yes, it’s an external failure:
  1. See the SSH into an agent scenario
  2. Try to reproduce the faulty HTTP request so you can observe what’s the problem. Is it the same failure?
    - Yes: Do you know how to fix it? If no escalate by creating an incident (/incident on Slack).
    - No: escalate by creating an incident (/incident on Slack).
- No, it’s an internal failure:
  1. Is it involving faulty state in the agents? (a given tool is not found where it should have been present, or have incorrect version)
    - See the SSH into an agent scenario
  2. Try to find an agent that recently successfully ran the faulty step (look for a green build on the main branch)
    1. Can you see a difference? If yes take note.
  3. Do you know how to fix it?
    - Yes: apply the fix.
    - No: Restart the agents to see if it fixes the problem. See Restarting the agents
      - Does it fix the problem? If no, escalate by creating an incident (/incident on Slack).

Build are failing on the `main` branch with different errors

Severity: major
Impact: no commits are being deployed on DogFood and sourcegraph.com until the problem is resolved. Cutting a release is impossible.
Possible causes:
- A previous Pull Request introduced a change that causes a test to fail.
- A previous Pull Request introduced a change that modified state in an unexpected way and broke the CI.
- An external dependency is not available anymore and is causing builds to fail under certain conditions.
- Some rate limiting API is throttling us and causing builds to fail.

Actions

Escalate by creating an incident (/incident on Slack).
Get some help.
Downscale the agents to be able to observe exactly what’s going on.
1. Update the autoscaler manifest to run a single agent
  1. cd sourcegraph/infrastructure
  2. Edit the manifest to set 1 on both maximum and mininimum agent count.
  3. cd buildkite/kubernetes/buildkite-autoscaler
  4. Run kubectl apply -n buildkite -f buildkite-autoscaler.Deployment.yaml
  5. Usekubectl get pods -n buildkite -w to observe the currently running agents (k9s works here too).
2. From there, any change and build will run on a single agent, allowing you to observe the behaviour live.

Spotted a flake

Severity: minor
Impact: Some builds will fail randomly, creating noise and slowing down the engineering team
Possible causes:
- Tests relying on timing.
- Race conditions.
- End to end tests are delicate by nature and can fail randomly due to the complexity of all involved components.
- State dependent test is not properly teared down and fails.

Actions

What kind of step is failing?

Is this an End-to-end tests?
- 💡 E2E tests are fragile by nature, there is no way around it.
- Take note.
Is this a Docker image build step?
- 💡 This should really not be happening.
- Is the error about the Docker daemon?
  - Yes, this is a CI infrastructure flake. Ping @dev-experience-support on Slack in the #buildkite-main or #dev-experience channels.
  - No: reach out to the team owning that Docker image immediately.
Anything else
- Take note of the failing step and go to next point.

Is that flake related to the CI infrastructure?

The CI infrastructure often involves:
- Docker daemon not being reachable.
- Missing tools that we use to run the steps, such as go, node, comby, …
- Errors from asdf, which is used to manage the above tools.
Yes: ping @dev-experience-support on Slack in the #buildkite-main or #dev-experience channels.
- If nodoby is online to help:
  - Reach out for help in #dev-chat
  - Try Restarting the agents and restart the build.

Is that flake related to the code:

See the process describe in the flaky tests page

Is this a failure or a flake?

Gravity: minor
Impact: Some builds will fail randomly, creating noise and slowing down the engineering team
Possible causes:
- Tests relying on timing.
- Race conditions.
- End to end tests are delicate by nature and can fail randomly due to the complexity of all involved components.
- State dependent test is not properly teared down and fails.

Actions

Immediately restart the faulty step.
- 💡 It will save you time while you’re looking at the logs.
- Is the step passing now?
  - Yes: See Spotted a flake scenario
  - No: Give it another try, and see next point.
Check on Grafana if there are any occurrences of the failures that were previously observed:
Go the the “Explore” section
Make sure to select grafanacloud-sourcegraph-logs in the dropdown at the top of page.
Scope the time window to 7 Days to make sure to find previous occurrences if there are any
Enter a query such as {app="buildkite"} |= "your error message" where “your error message” is a string that identiy approximately the failure cause observed in the failing step.
Is there a build that failed exactly like this?
- Yes:
  1. 💡 Double check that you’re looking at that the same step by inspecting the labels of message (click on the line to make them visible)
  2. Yes, that’s a flake. See the Spotted a flake scenario
- No: it’s not a flake, reach out the team owning those tests.

Restarting the agents

Gravity: minor
Impact: May fail ongoing builds, but that’s fine.
Possible causes:
- Manual restart by an engineer.
- Newer version of the agents needs to be deployed.

Actions

Use kubectl get pods -n buildkite -w to observe the currently running agents (k9s works here too).
In a different terminal, run kubectl -n buildkite rollout restart deployment buildkite-agent.
Wait a bit to see the agents restarting completely.
Restart the faulty build and observe it the problem is fixed or not.
- If necessary: escalate by creating an incident (/incident on Slack).

SSH into an agent

Gravity: none
Impact: none (unless a destructive action is performed)
Possible cause:
- Need to investigate a problem and suspect the agent is at fault

Actions

Find the pod you want to SSH into with one of the following methods:
1. Use kubectl get pods -n buildkite -w to observe the currently running agents and get the pod name (k9s works here too).
2. From a Buildkite build page, click the “Timeline” tab of a job and see the entry for “Accepted Job”. The “Host name” in the entry is also the name of the pod that the job was assigned to.
Use kubectl exec -n buildkite -it buildkite-agent-xxxxxxxxxx-yyyyy -- bash to open a shell on the Buildkite agent.

Continuous integration playbook

Prerequisites

Scenarios

buildchecker has locked the main branch

Actions

Build has failed on the main branch

Actions

Builds are all failing on the main branch with the same error

Actions

Build are failing on the main branch with different errors

Actions

Spotted a flake

Actions

Is this a failure or a flake?

Actions

Restarting the agents

Actions

SSH into an agent

Actions

`buildchecker` has locked the `main` branch

Build has failed on the `main` branch

Builds are all failing on the `main` branch with the same error

Build are failing on the `main` branch with different errors