Search Core strategy

This page outlines the strategy and goals of the Search Core team over the next year.

For context on the mission, vision, and guiding principles of Search, see the top-level search strategy page.

Quicklinks:

Where we are now

In , we grew the Sourcegraph Cloud global index to 2.1M repositories, including all repositories with 5 stars or more. Importantly, the changes we’ve made to reach this state have not been cloud-specific, and yielded trickle-down benefits to all Sourcegraph deployments (for instance, significant reductions in the memory usage of Zoekt, our trigram-based indexed search backend).

In , we conducted discovery work to better understand the bottlenecks of our search infrastructure on large monorepos (>6GB working directory). We also started growing our search index to include repositories from more non-GitHub.com and GitLab.com code hosts: for instance, you can now search 34k repositories from src.fedoraproject.org on sourcegraph.com.

What’s next and why

Goals

  • Monorepo performance: At a P75 level, gigarepo will index in less than 30 minutes, indexed searches complete in < 2s and unindexed searches complete in < 10s. Gigarepo is a representative monorepo with a HEAD working copy size of 15GB.
  • Ranking: We start tracking ranking quality, using selected search results as a proxy.
  • Code host coverage: Sourcegraph Cloud indexes public repositories globally from the most popular package hosts.

Details

Monorepo performance: is a recurrent pain point for large enterprise customers. Having replicated large monorepo setups, we identified that unindexed monorepo performance is still poor and several facets of search on large monorepos cause significant load on gitserver.

Ranking: As a first step towards improving the ranking of our search results, we will start tracking the quality of search results using the index of user-selected results as a proxy metric. Having this tracking in place will help measure the success of future improvements and drive the areas of ranking we choose to focus on.

Code host coverage: as we seek to keep expanding the code we index to include more non-GitHub.com or GitLab.com code hosts, we will add support for package host integrations (PyPi, Rubygems, NuGet, Crates, proxy.golang.org). This will not only increase our code host coverage, but also be a stepping stone towards unblocking use cases based on the dependency graph of repositories.

What we’re not working on and why

Scaling sourcegraph.com to repositories with 2 stars or more

We had a previous goal to scale the Sourcegraph Cloud global index to include every GitHub.com + GitLab.com repository with 2 stars or more (“more than one star”). While working to achieve this goal, we realized that growing the global index past our current scale wouldn’t have made a meaningful difference to the usability, universality or completeness of Sourcegraph Cloud:

  • Most OSS code search use cases were already well-serviced at our current scale.
  • The repositories in the 2 to 3 star bucket included a large number of low-quality repositories, with content only tangentially related to code, and indexing them would have negatively affected relevance of search results and performance.
  • By indexing these 2 to 3 stars repositories we will not be achieving the promise of letting users search their own code since this would only be attained by additionally indexing all 0 to 1 star repositories, a much larger number of repositories which we are currently not ready to support.

We’ve chosen instead to make progress towards indexing the entire OSS universe by indexing repositories from different code hosts as well as adding support for package host integrations. By focusing on this, we strive to make Sourcegraph.com a truly universal code search engine.

This section lists use cases that are related to this product team, along with the specific relevant features.

  • None