7 Key Lessons from Rebuilding GitHub Enterprise Server’s Search for High Availability

1. Search Is Everywhere in GitHub Enterprise Server

Search isn’t just the search bar. On GitHub Enterprise Server, it powers the Issues page filters, the Releases page, Projects boards, and even the counters showing how many issues or pull requests are open. If search goes down, these features stop working or become unreliable. That’s why the engineering team made durability a top priority. The goal was to reduce the time administrators spend maintaining search indexes and let them focus on what matters: serving their users. Understanding this central role of search is the first step in grasping why the architectural changes were so critical.

7 Key Lessons from Rebuilding GitHub Enterprise Server’s Search for High Availability — Source: github.blog

2. The Old Architecture: Clustered Elasticsearch

For years, GitHub Enterprise Server used Elasticsearch as its search database. In a High Availability (HA) setup, the standard pattern is a leader/follower arrangement: one primary node handles all writes and user traffic, while replicas stay in sync as read-only backups. To make Elasticsearch work in this environment, the engineering team created a single Elasticsearch cluster that spanned both the primary and replica nodes. This made data replication easy and allowed each node to process search queries locally. However, it also introduced deep coupling between the nodes.

3. Why High Availability Matters

High Availability (HA) is designed to keep GitHub Enterprise Server running even when a part of the system fails. The primary node does all the work, and replica nodes constantly stay synchronised. If the primary goes down, a replica can take over with minimal disruption. But this setup requires careful coordination of all components, including search. When Elasticsearch was clustered across these nodes, the coordination became fragile. Administrators had to follow exact maintenance and upgrade steps; any deviation could corrupt search indexes or cause them to lock, making the entire HA system unreliable.

4. Elasticsearch’s Incompatibility with Leader/Follower

Elasticsearch wasn’t built for a leader/follower pattern where one node is the writer and the others are read-only. In a standard Elasticsearch cluster, any node can accept writes, and shards move dynamically. To fit into GitHub’s HA model, the engineers had to force Elasticsearch into an unnatural state. They created a cluster that spanned the primary and replicas, but Elasticsearch could still decide to move a primary shard (the shard responsible for writes) from the primary node to a replica. This caused problems because replica nodes were not supposed to handle write traffic in the HA architecture.

5. The Locked State Nightmare

The worst-case scenario happened during maintenance. If Elasticsearch relocated a primary shard to a replica node, and an administrator then took that replica offline for maintenance, GitHub Enterprise Server would enter a locked state. The replica would wait for Elasticsearch to become healthy before starting up, but Elasticsearch couldn’t become healthy until the replica rejoined. This circular dependency meant the system was stuck. Administrators had to manually intervene, often by restoring backups or rebuilding indexes. This was a time-consuming and stressful process that defeated the purpose of HA.

6. Failed Attempts at Stabilization

Over several releases, GitHub engineers tried to make the clustered mode more stable. They added checks to monitor Elasticsearch’s health and processes to correct drifting states. They also introduced mechanisms to prevent the shard-movement problem, but none were fully reliable. The underlying issue was that Elasticsearch assumed all nodes were equal, while GitHub’s HA setup required a strict primary/replica hierarchy. Every patch only addressed symptoms, not the root cause. The team realized that incrementally improving the existing architecture would never achieve the desired level of durability.

7. The Search Mirroring Experiment

One ambitious attempt was building a “search mirroring” system. The idea was to move away from the distributed Elasticsearch cluster entirely and instead replicate search data from the primary node to each replica independently. This would avoid the shard-movement issue because each node would have its own complete copy of the search index. The challenge was that database-level replication is extremely complex. Keeping the indexes consistent required solving difficult distributed-systems problems. After significant effort, the team had to abandon the approach because it couldn’t guarantee the consistency needed for production.

8. The Turning Point: Rebuilding from Scratch

After years of work, the engineers decided to start fresh. They re-architected search to fit the HA pattern instead of forcing a general-purpose tool into an incompatible design. The new design treats search as a service that runs independently on each node. Data is replicated through a custom mechanism that respects the leader/follower hierarchy. This means that the primary node’s search index is authoritative, and replicas simply copy the stable, consistent state. The dynamic shard movement that caused locked states is eliminated entirely.

9. How the New Architecture Works

In the rebuilt system, each GitHub Enterprise Server node runs its own local Elasticsearch instance. The primary node accepts all search-write operations and periodically produces a consistent snapshot of its index. These snapshots are then transferred to replica nodes using a file-based replication protocol. Replicas load the snapshot and serve read-only search queries. Because the replication happens at the file level and on a schedule, there is no risk of Elasticsearch moving shards between nodes. Administrators can now take replicas offline for maintenance without causing any lock state. The entire HA setup becomes more predictable and easier to manage.

10. What This Means for Administrators

The architectural change dramatically reduces the operational burden. Administrators no longer need to follow delicate upgrade procedures or worry about search indexes becoming corrupted. They can perform maintenance on any node without fear of locking the system. Search is still highly available – if the primary fails, a replica can quickly take over after loading the latest snapshot. The improvements mean less downtime, simpler troubleshooting, and more confidence in the overall platform. For organizations relying on GitHub Enterprise Server, this translates to a more resilient tool that requires less hands-on management.

Conclusion: Rebuilding the search architecture for high availability was a multi-year journey full of dead ends and hard lessons. By stepping back and designing a system that matched the natural HA pattern of leader/follower, the GitHub team eliminated the fragile coupling that caused outages. The result is a search foundation that is both durable and easy to maintain. Administrators can now focus on enabling their developers rather than wrestling with Elasticsearch.