Rebuilding Search for High Availability in GitHub Enterprise Server: A Step-by-Step Guide

Introduction

Search is the backbone of GitHub Enterprise Server. From issue filters and pull request counts to the releases and projects pages, nearly every interaction depends on reliable, fast search indexes. For years, administrators managing High Availability (HA) setups faced a tricky balancing act: keeping Elasticsearch clusters healthy across primary and replica nodes while avoiding locked states and index corruption. This guide walks through the exact process GitHub engineering used to rebuild the search architecture, eliminating fragility and making HA maintenance far less risky. Whether you're planning a migration or just want to understand the logic behind the changes, these steps cover the core challenges, attempted fixes, and the final architecture shift that delivered durability.

Rebuilding Search for High Availability in GitHub Enterprise Server: A Step-by-Step Guide — Source: github.blog

What You Need

A GitHub Enterprise Server HA environment (primary and replica nodes)
Elasticsearch (version 7.x or earlier, as used in the legacy setup)
Administrative access to both primary and replica nodes
Familiarity with sharding, replication, and HA leader/follower patterns
A test or staging environment to validate changes
Backup of existing search indexes
Time for careful planning and rollback procedures

Step 1: Understand the Problem with Elasticsearch Clustering Across HA Nodes

Before any rebuild, you must grasp why the old approach failed. In GitHub Enterprise Server HA, the primary node handles all writes and traffic, while replicas stay in sync as read-only followers. Elasticsearch was originally integrated as a cluster spanning both primary and replica nodes. This allowed each node to serve search requests locally, but it introduced a critical flaw: Elasticsearch could elect a primary shard on a replica node. If that replica went down for maintenance, the entire search subsystem could lock up. The replica would wait for Elasticsearch to become healthy, but Elasticsearch couldn’t recover until the replica rejoined—a classic deadlock.

Additionally, upgrades had to follow a precise order; any deviation could corrupt indexes or leave them locked. The clustering across servers gave performance benefits but made the system brittle. Your first job is to document this topology in your environment and map every dependency.

Step 2: Implement Health Checks and Drift Correction

Based on the initial discovery, GitHub engineers added layers of monitoring. They implemented checks to verify Elasticsearch cluster health before allowing replica promotions or maintenance windows. These checks confirmed that all shards were allocated and that no primary shards resided on a replica that might be taken offline.

If drifting state occurred (e.g., a replica became unhealthy while still holding a primary shard), automated scripts tried to reallocate shards or restart Elasticsearch in a safe sequence. However, these were reactive measures, not a cure. They reduced incidents but did not eliminate the root cause. In your rebuild, you should replicate these checks as a temporary safeguard while you design the permanent fix.

Step 3: Attempt a Search Mirroring System

The next major attempt was to build a “search mirroring” system—an alternative that would decouple search data from the HA replication layer. The idea was to have a separate, dedicated Elasticsearch cluster that synchronizes with the primary node independently, avoiding the cross-node shard movement problem.

Database replication is notoriously hard, and GitHub discovered that mirroring without consistency guarantees caused data mismatches. The effort required strong consistency, which conflicted with the eventual consistency model Elasticsearch uses by default. After extensive prototyping, this approach was shelved because it couldn’t meet the reliability bar for production.

Key takeaway from this step: do not rely on naive mirroring. Any solution must preserve consistency and be testable in your staging environment.

Step 4: Adopt a Decoupled Elasticsearch Architecture

The final solution came from stepping back and questioning the assumption that Elasticsearch must be part of the HA cluster. GitHub engineering moved Elasticsearch to a independent cluster that runs separately from the primary/replica topology. This cluster has its own replication and failover, managed entirely by Elasticsearch’s native mechanisms. The primary and replica nodes of GitHub Enterprise Server still exist, but they no longer participate in the Elasticsearch cluster.

Instead, search requests are routed to the Elasticsearch cluster, which is designed to be resilient independently. If a GitHub replica goes down for maintenance, Elasticsearch continues serving without deadlock. Upgrades become simpler: you can upgrade Elasticsearch nodes in rolling fashion without affecting GitHub server operations.

Implementation steps for this architecture:

Provision a new, dedicated Elasticsearch cluster (e.g., 3 nodes for production).
Configure the GitHub Enterprise Server primary to index data into this new cluster (instead of the old local instance).
Set up replication within the Elasticsearch cluster (e.g., ensure at least one replica shard per index).
Update all search endpoints (issues, releases, projects, counts) to point to the new cluster URL.
Perform a full reindex from GitHub to the new Elasticsearch cluster.
Test all search features in a non-production environment.
Switch traffic to the new cluster and decommission the old cross-node setup.

This change eliminated the primary shard on replica problem entirely. It also simplified maintenance: administrators no longer need to follow a specific order when applying updates.

Tips for a Successful Search Architecture Rebuild

Start in a staging environment: Never apply these changes directly to production. Replicate your HA setup with identical hardware and software versions to validate every step.
Monitor shard allocation: Use Elasticsearch APIs (e.g., _cat/shards) to ensure no shards are unassigned or relocated to unexpected nodes after failover tests.
Test upgrade sequences: Simulate the exact upgrade path you’ll use in production. Verify that you can upgrade Elasticsearch nodes without causing downtime to GitHub search.
Prepare rollback scripts: In case the new architecture introduces performance issues, have a plan to revert to the old clustered setup quickly (including snapshot restoration of indexes).
Educate your team: Document the new architecture and train administrators on the differences. The mental model changes from “primary/replica with shared Elasticsearch” to “independent Elasticsearch cluster.”
Measure latency: After migration, compare search response times. A dedicated cluster can often improve performance because it removes cross-node coordination overhead.
Consider Elasticsearch version upgrades: Newer versions of Elasticsearch have better support for cross-cluster replication and can further simplify HA without GitHub-specific hacks.

By following these steps, you can rebuild your GitHub Enterprise Server search architecture to be truly high available, reducing maintenance risks and allowing your team to focus on what matters—improving the user experience.

Tags: