🗞 Developer experience newsletter

Welcome to the developer experience newsletter! This is a newsletter prepared by the DevX team to highlight contributions and updates to Sourcegraph’s developer experience, which is an area the DevX team focuses on but is owned by everyone.

To have your updates highlighted here, please tag your PR or issue with the dx-announce label! If you have questions or feedback, feel free to reach out in #dev-experience or in our discussions as well.

To learn more about components of Sourcegraph’s developer experience, check out the developer documentation.

Feb 24, 2022

Welcome to another iteration of the Developer Experience newsletter of notable changes since the Jan 10th issue! As a reminder, you can check out previous iterations of the newsletter in the newsletter archive.

To have your updates highlighted here, please tag your PR or issue with the dx-announce label! If you have questions or feedback, feel free to reach out in #dev-experience or in our discussions as well.

SOC2 compliance processes

A new bot, pr-auditor, is now live in sourcegraph/sourcegraph and is rolling out to a number of other repositories that houses code that reaches customers. pr-auditor will add status checks on your pull requests when you edit descriptions to indicate whether or not it has detected a “test plan” within your pull request description. If a “test plan” is not provided by the time a PR is merged, an issue will be created in the sec-pr-audit-trail repository requesting that the PR author document a test plan, or provide a reason for the exception. This serves as an audit log to help us achieve these two SOC2 control points:

GN-104 Code changes are systematically required to be peer-reviewed and approved prior to merging code into the main branch.

GN-105 Application and infrastructure changes are required to undergo functional, security, unit, integration, smoke, regression, and SAST testing prior to release to production.

What is a test plan? A test plan is denoted by content following # Test plan, Test plan:, ### Test Plan:, etc. within a pull request description. All pull requests must provide test plans that indicate what has been done to test the changes being introduced. Testing methodologies could include:

  • Automated testing, such as unit tests or integration tests
  • Other testing strategies, such as manual testing, providing observability measures, or implementing a feature flag that can easibly be toggled to limit impact

Pull request reviews are now also required by default. Branch protections have been enabled in sourcegraph/sourcegraph. In other repositories with pr-auditor review checks must be opted out of by including No review required: ... within a pull request’s test plan.

To learn more, refer to our updated testing guidance. You can find DevX SOC compliance documentation by control point in this search notebook. If you have any questions or feedback, please do not hesitate to reach out in #dev-experience or in our GitHub discussions!

Internal tools and libraries

Database migrations update

We have now eradicated two classes of errors related to database migrations:

  1. On the site-administrator and ops side, we no longer spuriously mark the database as dirty and give up any attempt at migrations at the first sign of trouble. We no longer immediately fail an upgrade because of the mere presence of an empty table or a concurrently created index. Now we only fail for actual reasons.
  2. On the development side, we no longer have to worry about two independently created migrations clashing only after both are merged into main. That was very annoying to me and now it will never, ever happen again. Check out the help page for the new sg migration to check out the new tooling.

See the migrator docs for additional info.

New lib/errors package and MultiErrors type

All errors in Sourcegraph backend services should now use the new github.com/sourcegraph/sourcegraph/internals/errors package. This consolidation helps us restrict and control the ways that we can create, consume, and compare errors, and will allows us to control library behavior clashes more easily in the future. #30558

Additionally, all usages of the old MultiError type has been replaced with a new, custom multi-error implementation (#31466, #698). This new error type is an interface that behaves much more closely to regular errors, prevents errors from disappearing due to library conflicts as was previously the case, and supports introspection with errors.Is, errors.As, and friends much more consistently.

var err errors.MultiError
for _, fn := range thingsToDo {
  err = errors.Append(err, fn())
}
return err

Check out the source code in lib/errors.

Actor propagation reminder

Unified actor propagation was introduced a few months ago as part of an effort to enable the implementation of sub-repository permissions across all Sourcegraph features. There have been gradual efforts to roll out this actor propagation to more services, which may cause behavioural changes that impact how permissions are handled if, for example, internal actors are not set explicitly. When implementing new features please ensure that actors are correctly set and read from contexts.

To learn more, check out the intro to actor propagation search notebook.

New teams package

There is now a unified library for interacting with Sourcegraph teammates for whatever fun integrations you want to build! It leverages team.yaml data as well as additional GitHub and Slack metadata:

import "github.com/sourcegraph/sourcegraph/dev/internal/team"

func main() {
  // Neither a GitHub client nor a Slack client is required, but each enables more ways
  // to query for users and/or get additional metadata about a user.
  teammates := team.NewTeammateResolver(githubClient, slackClient)
  tm, _ := teammates.ResolveByName(ctx, "Robert")
  println(tm.SlackID)
  println(tm.HandbookLink)
  println(tm.Role)
  // etc.
}

sg teammate, branch lock notifications, and Buidlkite failure mentions are all powered by this API.

Continuous integration

Slack mention notifications

We now generate notifications for failed builds based on the author of each commit (using the new teams package). Make sure to set up your teams.yaml entry with your GitHub handle to get notified when your changes fail in main!

Pipeline readability improvements

Pipeline operations can now be configured into groups with operations.NewNamedSet (#30381). The result looks like this:

Grouped operations

sg ci preview also leverages this grouping to improve readability of pipeline steps, as well as now leveraging a terminal Markdown renderer to generate nicer output! (#30724)

Build traces are now uploaded to Honeycomb

Build traces are now uploaded to Honeycomb to dive into the performance of each command that gets run in a pipeline! A link to the uploaded build trace is added as an annotation on the results of each Buildkite build.

Trace example

To learn more, check out the Pipeline command tracing docs.

Test analytics preview

We have started rolling out Buildkite test analytics support for Go tests and a subset of frontend tests that get run in continuous integration. This is still an experimental Buildkite feature, but you can learn more about it in our Test analytics docs.

Pipeline documentation

A new command, sg ci docs, can now render a full, up-to-date reference of various run types that our pipeline can generate as well as example pipelines of each, such as what gets run with various diff types. You can also see a web version of this in the Pipeline types reference.

Our pipeline development guide has also been refereshed with updated content, featuring a series of embedded search notebooks! This includes new guidance on:

Generate builds using run types

sg ci build now supports an additional argument to automatically generate a Buildkite build using a specified run type (#30932). For example, to create a main dry run build:

sg ci build main-dry-run

This now also supports run types that require arguments, such as docker-images-patch - learn more in #31193.

sg ci build

Coming soon: stateless Buildkite agents

We will soon be rolling out stateless Buildkite agents to all pipeline builds. These should improve the stability and reliability of all pipelines by removing any issues that might be caused by lingering state from other builds. Learn more in this Loom demo! (#31003)

Optimizations

  • Improvements on the server and gitserver Docker images building: after the addition of p4-fusion artifacts, the gitserver Docker image build time increased to 4 minutes to complete, which also impacted the server image. It has been fixed by caching the resulting binary, which brought the build time for gitserver down to about 40 seconds, thanks to #31317.
  • go-mockgen is now much faster: a misconfiguration was causing go-mockgen to be downloaded multiple times throughout a go generate run. This has been fixed, and run times for go generate is now much faster (#31597).

Local development

Each log entry now prints an iTerm link that links to each log statement’s source file:line in VS Code (#30439).

Workaround for MacOS firewalls

A new -add-to-macos-firewall flag, enabled by default on MacOS, is now available on sg start and sg run to avoid all those pop-up prompts you get in MacOS when firewalls are enabled. #30747

If this causes issues for you, the behaviour can be disabled with -add-to-macos-firewall=false.

sg highlights

You can now see what has changed as part of your fresh sg installation with the sg version changelog command! You can also use it to see what’s coming up next with sg version changelog -next. #30697

sg start now waits for all commands to install before starting them (#29760).

M1 macs no longer require any additional workarounds (#29815).

sg checks docker now features a custom Dockerfile parser to enable more powerful checks, such as validating apk add arguments as well as also running more existing checks. It now powers the Docker check in CI as well! (#31217)

sg setup now features an overhauled checks system to make sure your dev environment is ready to go (#29849).

sg setup now supports Ubuntu as a first class citizen and provides automated installation (#31312).

Jan 10, 2022

Happy new year, and welcome to another iteration of the Developer Experience newsletter! It’s been a little while since the last issue, so this is going to be a long one 😄 As a reminder, you can check out previous iterations of the newsletter in the newsletter archive.

To have your updates highlighted here, please tag your PR or issue with the dx-announce label! If you have questions or feedback, feel free to reach out in #dev-experience or in our discussions as well.

Internal tools and libraries

Backward-compatible database migrations are now enforced

Backward-compatible database migrations are now enforced in the CI pipeline for sourcegraph/sourcegraph - see the PR to re-enable the check at #28872. This PR contains some initial documentation on writing backwards-compatible migrations, but it is still a work in progress.

What is a backwards compatible migration?: A migration is backwards-compatible with a particular Sourcegraph version if those changes can be applied to a version without ill-effect.

What has already changed? (TL;DR): We’ve removed our use of golang-migrate that ran database migrations on startup of the frontend service and added a migrator service that runs database migrations separately from and prior to instance upgrades. This puts us well on our way to removing the entire class of frequent “dirty database” bugs that plagues many site-administrators on every upgrade.

What else is changing?: We will soon be enforcing that the unit tests of the previous minor release continue to pass with the newest database schema. This gives high confidence that any changes to the database will not negatively affect a running instance (behind at most one minor version). This allow site-administrators to upgrade an instance without requiring downtime to run the migrations.

Of course, this check will come with escape hatches in the event of flake or test failures that are locked in the past. We’re currently fleshing out the documentation on the subject, so keep an eye out for updates!

For the full announcement or to leave comments, check out the Slack discussion!

Actor propagation

Actors (used to identify a request in the context of a user or internal actor) are now propagated across all internal requests when using the httpcli library, and the various approaches for propagating actors across services has been standardized with the new actor.HTTPMiddleware. This makes it easier to enforce permissions across services. For more details, see #28117.

Database connections

dbconn.Global has been removed! This is a huge step towards bringing better database mocking to the entire codebase (check out the code insights dashboard tracking relevant migrations!)

migration from global database mocks

Tracking issues

Tracking issues now support a new marker, <!-- OPTIONAL LABEL: my-label -->, that allows you to add labels on a tracking issue that do not need to be present on child issues for them to be considered part of this tracking issue. This is useful for making tracking issues easier to find without adding labels to every single issue within the tracking issue. For more details, see #28665.

Continuous integration

Subsequent main pipeline failures will now result in a branch lock

In response to a variety of CI incidents (including INC-21 at the end of September) we have introduced automated branch locks via a tool called buildchecker. When buildchecker detects a series of CI failures, it will now automatically restrict push access to main to authors of recent failed builds and the DevX team until it sees a passed build, at which point it will unlock the branch. A notification will be posted in Slack to #buildkite-main as well mentioning the relevant teammates.

It is the responsibility of authors of recently failed builds to investigate what might have gone wrong, seek help if needed, and help get the pipeline back green. We hope this will prevent long periods of time where many commits to main go untested due to failing jobs. To learn more, check out the branch lock playbook

We’ve also made significant investments towards improving and streamlining the pipeline for better stability and observability - most recently, a large number of E2E/QA tests were dropped - which will hopefully help with minimizing locks triggered by test and infrastructure flakes.

Specifying tools and language versions ran by any continuous pipeline

In response to INC-59 we have reworked which tools and languages versions are to be used in a given CI job. Previously, the agents where running a mix of asdf and natively installed versions which created trouble when diagnosing build failures that weren’t caused by the test themselves.

It is now the responsibility of each repository to provide an adequate .tools-version file that defines what are the versions it needs. There are no more pre-installed go version for example. Presently, this approach is limited by having the plugin for that particular tool installed beforehand on the agents images (we are working on removing this limitation). The overarching goal is to make the agents reasonably independent from what they are actually building.

E2E and QA Tests survey results

RFC 544 explored the result of the e2e and qa tests survey. Thanks to the efforts of every team that took part to that survey, a large amount of irrelevant tests have been removed. As a result, those tests are about seven minutes faster than before and the average build time on the main branch is hovering around the 20 minutes mark instead of 25 minutes.

There is more to come on that topic and the Frontend Platform team has plans to rework those tests as well as providing guidance on how to write them in reliable fashion.

Buildkite agent selection

Buildkite pipeline steps should now explicitly declare queue: standard to avoid experimental or temporary agents. For more details, see infrastructure#2939.

Terraform vulnerability scanning

The security team has introduced Checkov checks to the infrastructure repository and performed a cleanup to fix or suppress all high and critical issues!

Going forward, the Checkov step of the infrastructure pipeline will be set to fail in the event it finds a Terraform security issue. If the pipeline fails a warning block will be displayed in the pipeline output - a link will take you to the handbook with guidance on how to continue, and additional output will help point you towards how to correct the issue. For more details, see Checkov Terraform vulnerability scanning

If anyone has any questions or issues, please post in the #security channel!

Sentry integration to monitor internal pipeline scripts and hooks

There are scripts and components of the CI pipeline that should never fail, independently of the tests results. These have proved be to hard to monitor, especially when the scripts are called from build hooks. Being notified when these failures happen enables faster reaction time. Here is an example to get monitor a command so that a Sentry issue in the Buildkite project is created on a non zero exit code.

Observability

The previous raw Grafana configuration used to add template variables to dashboards has been replaced with Container::Variables that abstracts away a lot of the behind-the-scenes dashboard config and potential gotchas to make it easier to define template variables on dashboards! Dashboard template variables are used to filter individual panels down by substituting variables in panel queries. Learn more in the ContainerVariable API docs.

template variables

Local development

Revamped introductory documentation

The local development docs homepage has been revamped! Check it out at docs.sourcegraph.com/dev. The quickstart docs has also been overhauled with a streamlined setup experience featuring sg setup, which has been greatly improved!

sg improvements

sg now ships a command that can reset databases as well as creating a site-admin: sg db (early adopters may have seen it under the name of sg reset). You can read more about the sg db [reset-pg|reset-redis|add-user] in the documentation

If you have ideas of other features that would be great, don’t hesitate to join the sg hack hour on Fridays at 4PM UTC!

Nov 23, 2021

Hello everyone, and welcome to another iteration of the Developer Experience newsletter!

To have your updates highlighted here, please tag your PR or issue with the dx-announce label! If you have questions or feedback, feel free to reach out in #dev-experience or in our discussions as well.

Onboarding

Significant progress has been made with sg setup, a new command that is slated to replace all the manual fenangling that must be done today to set up a Sourcegraph development environment. See a sneak peak of the upcoming iteration of the tool here!

Continuous integration

The Dev Experience team is proposing a “build sheriff” rotation in RFC 515, with the goal of distributing knowledge and responsibilities around our CI infrastructure to all of engineering through regular rotations of “build sheriffs”.

You may have noticed a daily update in #dev-experience providing an overview of how CI has behaved that day—this will be helping us track our progress towards a flake-free pipeline! If you need more details, a dashboard is now available in Grafana Cloud that features an overview of recently failed builds, steps, and potentially relevant logs. You can use this to see if lots of builds are failing on similar steps, which steps are the most problematic, and whether the issues are potentially related. A link can also be found in the Slack summaries. Let us know what you think on #26118!

image

This dashboard is powered by build logs that are now parsed from Buildkite output and uploaded to Loki, a log database available for query in Grafana Cloud using LogQL. Try it out here! This can be especially useful when seeing if a build issue is a common recurrence.

We are also trialing a number of additional annotations for build failures that should serve to help surface actionable errors more easily, and are working towards exporting an API for it that will enable more checks to easily add digestible output to builds. Let us know in #dev-experience if you have any ideas for how this could be improved!

image

Observability

A proposed revamp of how Honey events are created has been proposed in #27964, furthering work on turning internal/observation into the go-to package for all application observability needs.

Distributed tracing is now available on worker jobs, enabling Jaeger traces to be collected for worker job processing. This is currently only enabled for precise-code-intel-worker in Cloud, and enabling this for other workers is in the works.

RFC 501 REVIEW: Runtime error monitoring implementation is also progressing, which will allow errors to be more easily surfaced in Sentry to complement alerting.

Code health

Work on reducing usages of globals has continued with improvements to how site configuration is accessed that allows site configuration clients to be injected into places that require it. This makes site configuration easier to mock out and test without replacing a global variable in mocks.

On a similar note, tests have been undergoing incremental updates to leverage the more ergonomic and self-contained database mocks—a brief guide is available if you know an area of the codebase that could use a similar update!

Nov 2, 2021

Hello everyone! Welcome back to the Developer Experience newsletter. It is a compilation of announcements related to development experience at Sourcegraph. DevX is a global effort

To be mentioned here in the next iteration, please tag your PR or issue with dx-announce!

DevX team mission statement

Published Developer Experience team mission and strategy: handbook.sourcegraph.com/company/strategy/enablement/dev-experience

Buildkite incident post-mortem(s)

On Sep 19th, for about two hours, it wasn’t possible to interact with any container registries from Google Cloud platform, which interrupted the process for release 3.33. You can find the detailed report here: Postmortem Review: INC-25 Buildkite pipelines are not able to interact with container registry .

On October 26th, for another two hours, the pipeline agents were down. You can find the detailed report here: REVIEW: INC-30 Buildkite pipelines are failing due to pipeline generator failing to run

CI Pipeline highlights!

  • All-in-one pipeline - check all your build jobs in one place! #26051
  • Cross-build search for Buildkite failures: #26259
  • We are now measuring how long the pipeline stays red per day. It captures both how reliable the pipeline is and how fast it gets back to green.
    • 22th: red for 1h8m
    • 21th: red for 1h34m
    • 20th: red for 21m
    • 19th: red for 2h4m
    • 18h: red for 54m
  • Contractors are now able to access CI builds, as long as they prefix their PR with contractors/ and they have been manually added to buildkite contractors team.
  • SQL queries are now displayed on failure in Go tests, both locally and in the CI. #26020
  • One less papercut, remember the warning sign at the beginning of every step logs? It’s not there anymore. #26233
  • RFC 497 WIP: Restructuring CI Experience is now open for feedback!

SG Highlights

sg is a CLI tool that wraps commands to run the local environment and interact with various Sourcegraph resources such as CI builds or RFC.

A new home for sg documentation: https://docs.sourcegraph.com/dev/background-information/sg

  • sg ci logs - browse, grep, or save Buildkite output
    • Try the Loki integration locally for advanced search! #25835
  • sg ci status –wait - get notified as soon as your Buildkite build completes
    • image
  • sg version: displays what version of sg you’re currently running. Adding it to your message when requesting support for sg will really help!
  • Our first bug report that came from the community has been fixed! sg: include original err in install err

From the wider Sourcegraph community

Oct 8, 2021

Hello everyone! This is the first iteration of the Developer Experience newsletter. It is a compilation of announcements related to development experience at Sourcegraph.

To be mentioned here in the next iteration, please tag your PR or issue with dx-announce!

A team has been created

The Developer Experience team has been created in mid September! Our first goal is to improve the CI experience.

Buildkite incident post-mortem

Between Sep 21th and Sep 24th, our main branch builds were failing. Due to the difficulties we were having to reliably make it pass, we escalated it to an incident. To prevent new failures from piling up on the already broken branch, we made a decision to lock the main branch, which is a pretty unusual event.

You can find the detailed report inPostmortem REVIEW: INC-21 Builds failing on main, which is now in a reviewable state and open to feedback and any inputs.

We’d like to thank dearly all of those who helped to fix this: Patrick Dubroy, Robert Lin, Eric Fritz, Valery Bugakov, Tomás Senart, Thorsten Ball, Dax McDonald, Geoffrey Gilmore, Erik Seliger, Dave Tryand JH. With the actions we’ve proposed in the postmortem, we don’t expect such an event to happen in the future.

Pipeline improvements

The CI is what enables us to feel confident when delivering our changes to our users, and is one of the key components enabling Sourcegraph to deliver quality software.

Previously, it was really hard to find time to improve the CI because it was competing with infrastructure work in terms of prioritization, making it a frustrating but rational choice. With the recent team reorganization, making that hard choice is not a problem anymore as this component is now owned by the DX team.

Following up on the above incidents, it became absolutely clear that the CI is a big contender in the list of pains faced by everyone. The good news is that it’s a pretty actionable one!

Let’s start with some numbers:

  • August average build time on the main branch: 19m57, on PR 20m24s
  • September average build time on the main branch: 27m47s, on PRs 22m32 (1)
  • October average build time on the main branch: 17m48s, on PRs 9m34

image

  • Pull requests now run a smaller set of checks on average, and it is easier to add additional PR checks of your own that run over subsets of code that you care about within the pipeline generator. See the Introductory documentation to help you get started with hacking on the pipeline generator
  • Puppeteer testsare nowrun in parallel multiple smaller steps, netting almost a 50% improvement :fire:
  • (1) spiked because that’s when the executor pipeline was introduced.

What’s next?

Observability is crucial to being able to know when and on what to act. This led to the creation of RFC 496 REVIEW: Continuous integration observability which is now in a reviewable state for everyone.

More speed improvements on the builds are being worked on, stay tuned!

sg is officially entering our daily workflows

What if we had a tool that would be the entry point to interact with our development environment? That’s the idea behind sg! Thorsten Ball has been driving this, with contributions from many other engineers. After a few months in a beta state, it’s now becoming an integral part of our workflow.

sg is now the default way to run the Sourcegraph development environment locally.

  • After half a year of working on sg, the PR to remove what we once knew as dev/start.sh and enterprise/dev/start.sh has been merged. Adios, 993 lines of shell script!
  • The docs have also been updated: the Getting Started guide now uses sg.

But wait, there’s more! A new group of commands has been added, the “ci” commands.

  • sg ci preview: You can now preview which steps your branch is going to run on the CI with the sg ci preview command. See something that shouldn’t be running in there? Open a PR on the pipeline generator !
  • sg ci status: No more clicking around to find the current build in Buildkite!
  • sg ci build: will trigger a manual build, useful if working from a sourcegraph fork.

And additional goodies: the sg teammate time and handbook commands that will tell you what is the current time of that person that lives very far from you, without having to leave your terminal.

What’s next?

This is just the beginning. Work on sg setup has begun. The idea is that we can reduce the Getting Started guide from 8 pages down to “install sg and run sg setup”.

Grafana cloud is now available to all!

Just sign up via GSuite SSO on https://sourcegraph.grafana.net. This Grafana instance currently has logs for Sourcegraph Cloud, available for search with LogQL via Loki. It has support for querying inferred fields from log messages, filtering for substring matches, and more. Try it out!

Metrics and parity with /-/debug/grafana is on the roadmap—follow #25407 for updates on that!

Shoutouts to teammates that improved our dev experience in September Robert Lin, Valery Bugakov, Thorsten Ball, JH, Camden Cheek, Erik Seliger, Coury Clark and Quinn Slack .