Upgrading a managed instance

This page documents how Sourcegraph upgrades and machine upgrades are done for managed instances. Managed instances configuration is tracked in deploy-sourcegraph-managed-instances.

For basic operations like accessing an instance for these steps, see managed instances operations. To create a managed instance, see managed instances creation process.

General setup
Sourcegraph upgrade
Machine upgrade
In-place updates

General setup

Managed instances configuration is tracked in deploy-sourcegraph-managed - make sure you have the latest revision of this repository checked out. For basic operations like accessing an instance for these steps, see managed instances operations.

First, ensure you have the prerequisites installed and up-to-date:

Terraform CLI
Deno CLI (used for scripting upgrades)
Comby CLI (used for rewriting configuration)
jq

Many of the following commands in this guide, as well as the commands operations, use environment variables. Export the appropriate values for the upgrade so you don’t lose track:

# name of customer deployment (should match folder)
export CUSTOMER=<customer_or_instance_name>
# for API access to port-forwarded frontend
export SRC_ENDPOINT=http://localhost:4444
# see 1password "$CUSTOMER admin account", under "token" field
export SRC_ACCESS_TOKEN=$TOKEN
# should match GCP project prefix - typically the default is correct
export PROJECT_PREFIX=sourcegraph-managed
# Found in the [Managed Instances vault](https://my.1password.com/vaults/nwbckdjmg4p7y4ntestrtopkuu/allitems/d64bhllfw4wyybqnd4c3wvca2m)
export TF_VAR_opsgenie_webhook=<OpsGenie Webhook value>
# currently live instance
export OLD_DEPLOYMENT=$(gcloud compute instances list --project "$PROJECT_PREFIX-$CUSTOMER" | grep -v "executors" | awk 'NR>1 { if ($1 ~ "-red-") print "red"; else print "black"; }')
# the instance we will create
export NEW_DEPLOYMENT=$([ "$OLD_DEPLOYMENT" = "red" ] && echo "black" || echo "red")
# cf origin cert
export TF_VAR_cf_origin_cert_base64=$(gcloud secrets versions access latest --project=sourcegraph-dev --secret="SOURCEGRAPH_WILDCARD_CERT" | base64)
export TF_VAR_cf_origin_private_key_base64=$(gcloud secrets versions access latest --project=sourcegraph-dev --secret="SOURCEGRAPH_WILDCARD_KEY" | base64)

Make sure your copy of the deploy-sourcegraph-managed repository is up to date:

git checkout main
git pull origin main

Sourcegraph upgrade

This is the upgrade process for Sourcegraph releases.

There is a ~1h two-part screencast available by @slimsag walks through this whole upgrade process end-to-end, talks about some of the intricacies, future improvements to this process, and more. It is rough / on-the-spot so please disregard the brief and occassional 🐱 meows and crashing noises in the background. These videos reference some customer names, and as such can only be shared with the Sourcegraph team at this time.

0) Sourcegraph upgrade setup

Follow the general setup guide. Then, set up the appropriate version variables:

# version to upgrade to - MUST be in format 'v$MAJOR.$MINOR.$PATCH'
export NEW_VERSION=v<sourcegraph_version>
# old version used to verify upgrade
export OLD_VERSION=$(cat $CUSTOMER\/$OLD_DEPLOYMENT\/VERSION)

Validate all variables are set:

./util/validate-env.ts

Make sure to use the same shell for all the commands in this guide unless otherwise stated.

Now start a branch for your upgrade:

git checkout -b $CUSTOMER/upgrade-to-$NEW_VERSION
# all the below steps are documented assuming you are in the customer deployment directory
cd $CUSTOMER

Also refer to the upgrade notes for any additional steps that should be conducted during the update.

Set up access to the frontend by copying this output and running it in another shell:

echo "gcloud compute start-iap-tunnel default-$OLD_DEPLOYMENT-instance 80 --local-host-port=localhost:4444 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER"

Note that an upgrade is being performed:

../util/set-notice.sh upgrade

2) Mark the database as ready-only

../util/set-db-readonly.sh $OLD_DEPLOYMENT true

During this time searching will still work, but the site will refuse all database writes, e.g. this will show up in the web UI as:

[…]: pq: cannot execute INSERT in a read-only transaction

During this time:

Repositories will not update
Authentication permissions will not synchronize
LSIF precise code intel cannot be uploaded
User settings and site configuration cannot be updated
Extensions cannot be installed

3) Create a snapshot of the current deployment

../util/create-snapshot.ts $OLD_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: snapshot deployment"

This can take anywhere from a minute to several minutes, depending on how large the disk is.

Note the number of snapshots in terraform.tfvars, and prune snapshots >5 upgrades old where appropriate. Make sure to terraform apply any additional changes you make.

4) Initialize the new production deployment

Copy the old deployment’s Docker Compose configuration:

cp -R $OLD_DEPLOYMENT/ $NEW_DEPLOYMENT/
git add $NEW_DEPLOYMENT/ && git commit -m "$CUSTOMER: set up $NEW_DEPLOYMENT configuration"

Initialize the new production deployment (NEW_DEPLOYMENT) using the snapshot created in the previous step. This should only modify the deployment—not recreate it.

../util/init-deployment.ts $NEW_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: init $NEW_DEPLOYMENT deployment"

5) Make the database on the new deployment writable

Make the database on the new deployment writable:

../util/set-db-readonly.sh $NEW_DEPLOYMENT false

If you run into errors like:

bash: docker: command not found
ERROR: (gcloud.beta.compute.start-iap-tunnel) Error while connecting [4003: 'failed to connect to backend'].

This might indicate that the instance is not fully set up yet—try again in a minute.

6) Upgrade the new deployment

First check that thew new version requires no manual migration steps in docker-compose upgrade guide

Then, to upgrade the new $NEW_DEPLOYMENT deployment to $NEW_VERSION:

Upgrading to a release candidate build? Run this instead

VERSION=master ../util/update-docker-compose.sh $NEW_DEPLOYMENT/
go run ../util/enforce-tags.go $NEW_VERSION $NEW_DEPLOYMENT/docker-compose/.
git --no-pager diff $NEW_DEPLOYMENT

VERSION=$NEW_VERSION ../util/update-docker-compose.sh $NEW_DEPLOYMENT/
git --no-pager diff $NEW_DEPLOYMENT

Address any merge conflicts in the $NEW_DEPLOYMENT/ directory if needed.

Also verify that no references remain for the old version—the script does not automatically apply changes to replicas. For each reference, ensure that the entire service entry is up to date (i.e. not just the version). You can list references like so (if nothing shows up, you should be good to go):

cat $NEW_DEPLOYMENT/docker-compose/docker-compose.yaml | grep "$OLD_VERSION#v"
cat $NEW_DEPLOYMENT/docker-compose/docker-compose.yaml | grep upstream

Commit and apply the upgrade:

git add $NEW_DEPLOYMENT/ && git commit -m "$CUSTOMER: upgrade to $NEW_VERSION"
terraform apply

7) Recreate the deployment

Take down the new $NEW_DEPLOYMENT deployment and recreate it (so the startup script runs on a clean OS disk):

../util/drop-deployment.ts $NEW_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: take down $NEW_DEPLOYMENT"
../util/deploy-deployment.ts $NEW_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: restart $NEW_DEPLOYMENT"

8) Confirm instance health

Wait until the instance has fully started with the new versions:

../util/ssh-exec.sh "docker ps --format {{.Image}} | grep $NEW_VERSION#v"

You’ll receive errors or no results for several minutes while the instance finishes running the startup script.

Ensure that no containers with the wrong version are still running:

../util/ssh-exec.sh "docker ps --format {{.Image}} | grep $OLD_VERSION#v"

Access Grafana and confirm the instance is healthy by verifying no critical alerts are firing, and there has been no large increase in warning alerts:

gcloud compute start-iap-tunnel default-$NEW_DEPLOYMENT-instance 3370 --local-host-port=localhost:4445 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER

If you run into an error like:

ERROR: (gcloud.beta.compute.start-iap-tunnel) Error while connecting [4003: 'failed to connect to backend'].

This might indicate that the instance is not fully set up yet—try again in a minute.

Ensure that all containers are healthy (in particular, look for anything that says Restarting):

../util/ssh-exec.sh "docker ps"

Check frontend container logs and confirm there are no recent errors:

../util/ssh-exec.sh "docker logs sourcegraph-frontend-0"

9) Switch the load balancer target

Connect to the new instance using the SOCKS5 proxy and confirm you can access it and view the old version at https://company.sourcegraph.com/site-admin/updates. Switch over the load balancer:

../util/retarget-load-balancer.ts $NEW_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: switch load balancer to new target"

During this time users will see 500 errors:

Error: Server Error The server encountered a temporary error and could not complete your request. Please try again in 30 seconds.

For 1m35s while the google_compute_network_endpoint gets destroyed.

10) Take down the old deployment

Remove the old $OLD_DEPLOYMENT deployment and its data disk:

../util/drop-deployment.ts $OLD_DEPLOYMENT drop-disk
rm -rf $OLD_DEPLOYMENT/
git add . && git commit -m "$CUSTOMER: remove $OLD_DEPLOYMENT deployment"

11) Remove the banner indicating maintenance is in progress

Set up access to new frontend by copying this output and running it in another shell:

echo "gcloud compute start-iap-tunnel default-$NEW_DEPLOYMENT-instance 80 --local-host-port=localhost:4444 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER"

Remove the notice previously added to the global user settings:

../util/set-notice.sh none

13) Open a pull request to commit your changes

git push origin HEAD

And click the provided link to open a pull request in deploy-sourcegraph-managed.

IMPORTANT: DO NOT FORGET TO GET YOUR PR APPROVED AND MERGED, if you forget then the next person upgrading the instance will have a very bad time.

Machine upgrade

This is the upgrade process for changing the underlying machine for a managed instance.

0) Machine upgrade setup

Follow the general setup guide. Then set up a date to identify your upgrade:

export DATE=$(date +%m-%d-%Y)

Validate all variables are set:

./util/validate-env.ts

Then set up a branch for your changes:

git checkout -b $CUSTOMER/machine-upgrade-$DATE
# all the below steps are documented assuming you are in the customer deployment directory
cd $CUSTOMER

1) Prepare for maintainence

Set up access to the frontend by copying this output and running it in another shell:

echo "gcloud compute start-iap-tunnel default-$OLD_DEPLOYMENT-instance 80 --local-host-port=localhost:4444 --zone us-central1-f --project $PROJECT_PREFIX-$CUSTOMER"

Note that maintainence is being performed:

../util/set-notice.sh maintainence

Then, mark the database as read-only.

../util/set-db-readonly.sh $OLD_DEPLOYMENT true

Create a snapshot:

../util/create-snapshot.ts $OLD_DEPLOYMENT upgrade-machine-$DATE
git add . && git commit -m "$CUSTOMER: snapshot deployment"

This can take anywhere from a minute to several minutes, depending on how large the disk is.

2) Prepare resource changes in Terraform

For example, in terraform.tfvars:

- machine_type         = "n1-standard-8"
+ machine_type         = "n1-standard-32"
- data_disk_size       = 250 # GB
+ data_disk_size       = 500 # GB

Notes:

See Cost estimation: working out the VM type required for details about what you can choose for machine_type.
data_disk_size can only increase, it can never be decreased.

3) Spin up an upgraded machine

Set up configuration for the new machine:

cp -R $OLD_DEPLOYMENT/ $NEW_DEPLOYMENT/
git add $NEW_DEPLOYMENT/ && git commit -m "$CUSTOMER: set up $NEW_DEPLOYMENT configuration"

Initialize the new production deployment (NEW_DEPLOYMENT) using the snapshot created in the previous step. ⚠️ When the tool prompts you to apply, exit the command (DO NOT APPLY THE CHANGES). ⚠️

../util/init-deployment.ts $NEW_DEPLOYMENT upgrade-machine-$DATE

We want to make sure our resource changes only apply to our new machine, not both machines. Run the following to spin up the new machine instead:

terraform apply \
  -target="google_compute_disk.primary[\"$NEW_DEPLOYMENT\"]" \
  -target="google_compute_instance.primary[\"$NEW_DEPLOYMENT\"]"

Make sure the plan indicates 0 to change before applying. Commit your changes:

git add . && git commit -m "$CUSTOMER: init targetted $NEW_DEPLOYMENT deployment"

4) Prepare the new machine

Make the database on the new deployment writeable.

../util/set-db-readonly.sh $NEW_DEPLOYMENT false

If you changed the disk size, make sure to resize the disk.

4.a) Resize the disk

Special steps are needed to perform a disk upgrade (i.e. if you changed the data_disk_size variable). Shut down the deployment:

../util/ssh-exec.sh "cd /deployment/docker-compose && docker-compose down"

Take a look at our mounted disk, /dev/sdb:

../util/ssh-exec.sh "df -Th /dev/sdb"

It should have Type: ext4 and Mounted on: /mnt/docker-data. Notice that the Size is not your upgrade disk size:

Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/sdb       ext4      246G   37G  209G  15% /mnt/docker-data

To make use of the new disk space, we need to extend the filesystem on the /dev/sdb disk:

../util/ssh-exec.sh "resize2fs /dev/sdb"

Verify the file system is extended—you should now see that the Size matches your upgraded disk size:

../util/ssh-exec.sh "df -Th /dev/sdb"

Restart the instance:

../util/ssh-exec.sh "cd /deployment/docker-compose && docker-compose up -d"

For reference, the above steps were adapted from Working with persistent disks.

5) Wrap up the upgrade

Confirm instance health.
Switch the load balancer target - ⚠️ when prompted to apply, exit the command (do not apply the changes) and continue to the next step. ⚠️ This will prevent Terraform from attempting to apply resource changes to the old machine - instead, we just update the load balancer and remove the old instance at the same time.

../util/retarget-load-balancer.ts $NEW_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: switch load balancer to new target"

Take down the old deployment

../util/drop-deployment.ts $OLD_DEPLOYMENT drop-disk
rm -rf $OLD_DEPLOYMENT/
git add . && git commit -m "$CUSTOMER: remove $OLD_DEPLOYMENT deployment"

Remove the maintainence banner

../util/set-notice.sh none

Open a pull request

In-place updates

This is the upgrade process for running in-place updates. This approach is riskier than the above processes, because rolling back changes takes significantly longer.

Only use this approach for low-risk patch upgrades or docker-compose container resource changes. Do not use if the patch includes a database change.

0) Upgrade setup

Follow the general setup guide. Then redefine the NEW_DEPLOYMENT variable to match the existing deployment:

export NEW_DEPLOYMENT=$OLD_DEPLOYMENT

Validate all variables are set:

Are you upgrading to a release candidate build? It’s expected to see some errors from this script Use your best judgement to decide whether you’re missing something or not

./util/validate-env.ts

Then set up a branch for your changes:

git checkout -b $CUSTOMER/upgrade-$NEW_VERSION-in-place
# all the below steps are documented assuming you are in the customer deployment directory
cd $CUSTOMER

1) Create a snapshot

Create a snapshot:

../util/create-snapshot.ts $OLD_DEPLOYMENT
git add . && git commit -m "$CUSTOMER: snapshot deployment"

This can take anywhere from a minute to several minutes, depending on how large the disk is.

Note that since we are not marking the database as read-only, this snapshot could become out of date with the live deployment. It is only prepared as an emergency measure - rolling back to this snapshot risks data loss.

2) Prepare upgraded docker-compose

Update version references:

VERSION=$NEW_VERSION ../util/update-docker-compose.sh $NEW_DEPLOYMENT/
git --no-pager diff $NEW_DEPLOYMENT

Check for old version references or merge conflicts:

cat $NEW_DEPLOYMENT/docker-compose/docker-compose.yaml | grep "$OLD_VERSION#v"
cat $NEW_DEPLOYMENT/docker-compose/docker-compose.yaml | grep upstream

Resolve any merge conflicts that have arisen.

3) Apply changes to instance

git add . && git commit -m "$CUSTOMER: upgrade to $NEW_VERSION in-place"
terraform apply

This will only update the instance metadata and not affect the running deployment.

4) Upgrade running deployment

Apply changes to the running deployment by copying the docker-compose file to the instance and re-running docker-compose:

gcloud compute scp --project "$PROJECT_PREFIX-$CUSTOMER" --tunnel-through-iap $NEW_DEPLOYMENT/docker-compose/docker-compose.yaml root@default-$NEW_DEPLOYMENT-instance:/deployment/docker-compose/docker-compose.yaml
../util/ssh-exec.sh "cd /deployment/docker-compose && docker-compose up -d"

Upgrading a managed instance

General setup

Sourcegraph upgrade

0) Sourcegraph upgrade setup

1) Add a banner indicating upgrade is in progress

2) Mark the database as ready-only

3) Create a snapshot of the current deployment

4) Initialize the new production deployment

5) Make the database on the new deployment writable

6) Upgrade the new deployment

7) Recreate the deployment

8) Confirm instance health

9) Switch the load balancer target

10) Take down the old deployment

11) Remove the banner indicating maintenance is in progress

13) Open a pull request to commit your changes

Machine upgrade

0) Machine upgrade setup

1) Prepare for maintainence

2) Prepare resource changes in Terraform

3) Spin up an upgraded machine

4) Prepare the new machine

4.a) Resize the disk

5) Wrap up the upgrade

In-place updates

0) Upgrade setup

1) Create a snapshot

2) Prepare upgraded docker-compose

3) Apply changes to instance

4) Upgrade running deployment

5) Wrap up the upgrade