On Sat May 16, Agents experienced severe server latency, game actions failing to register, and general Scanner instability. Crucially, we know that this was not isolated to Orion Sydney and Orion Prague Anomaly live events. The ripple effects were felt by players worldwide.
As you may know, an XM Anomaly generates a hyper-concentrated spike in database actions at a small number of Portals. Hundreds of Agents deploy, attack, and mod the same Portals at the same millisecond. Ingress has been running on a new database since before the +Gamma Hyderabad and +Gamma Buenos Aires Anomalies, but Orion was the first Anomaly Series with this increased level of Agent activity.
This past weekend, we experienced a perfect storm of three major bottlenecks:
- Write contention gridlock: Our new database handles simultaneous update attempts differently than our previous database. Above a certain level of traffic, the retry attempts start to outnumber the first-try attempts and the ability to make updates slows down dramatically.
- Tens of thousands of ghost records: Game objects like Portals and Resonators change constantly, so our database utilizes an index to keep track of each change. However, every change leaves behind a ghost or a temporary tombstone row that stays in the system for at least one hour. Our database became clogged with ghosts, and delays were caused by needing to sift through tens of thousands of ghost records in order to find real game actions. Because our database handles global data, this slowed the global Portal Network, creating the ripple-effect latency experienced by players outside of the Anomaly Zone.
- Slow automatic scaling: We use autoscalers or systems that automatically add more computers to the cluster when there are more requests. The start of an Anomaly comes with a sudden rush of traffic, which the autoscalers cannot respond to quickly enough, so we preemptively force more capacity into the system before the start of each Anomaly. However, in combination with the issues above, the autoscalers still could not keep up.
Our team live-monitors and scores every Anomaly, and we have additional team members on-call for when emergencies happen. When our automated alerts identified issues and paged team members, we manually bypassed our automated systems and forced servers onto higher-end machines to try and handle the processing load, over-scaled our database instances far beyond our automated limits, and forced background systems to stay at max capacity so they would not drop further connections.
These brute-force updates kept Ingress from crashing, but they did not fix the architectural issues we were seeing. To try and prevent these traffic jams and ghosts from disrupting future events, we are currently working on the following:
- Move to in-memory Portal updates: We are completing a new game entity system that should allow game actions to process in faster, temporary server memory first.
- Bust the ghosts: We ain’t afraid of no ghosts, but we are changing how our database indexes time. This should eliminate major bottlenecks we identified last weekend during the Anomalies.
- Pre-split Anomaly Zones: Since we know which geographic rows will receive extra traffic during an Anomaly, we will split those rows into their own shards so that the database can allocate resources to those sections of data instead of having to detect the active areas and then reorganize the data after the Anomaly has already started.
As an additional mitigation step, we are going to temporarily reduce the XMP attack radius because of the write contention that we are seeing with simultaneous updates when an XMP is fired. We will increase XMP attack strength, but reduce the attack radius so that fewer Portals are attacked by each XMP fired. This temporary adjustment will go live Tue May 26 17:00 UTC and revert back to its current attack radius on Mon June 1 17:00 UTC.
We apologize for these issues, and our hope is that you have a clearer understanding of what happened and what we are doing to address it. We are working with Orion Anomaly Points of Contact to distribute Orion Anomaly medal passcodes to Agents who contributed to the Orion Sydney and Orion Prague Anomalies and need a passcode, and we are also discussing a make good specifically for Orion Sydney and Orion Prague Anomaly participants.
Additionally, we are adding a make good to the Store’s Free Items category available to all Agents, which contains the following:
- 1 Aegis Shield
- 4 Rare Multi-Hack
- 20 Power Cube
- 20 XMP
- 20 Resonator
- 8 Rare Heat Sink
- 1 ADA Refactor
- 1 JARVIS Virus
- 1 Kinetic Capsule
- 4 Very Rare Shield
- 8 Rare Shield
- 1 SoftBank Ultra Link
This free Load Out Kit will be available for 7 days starting Fri May 22 17:00 UTC to Fri May 29 17:00 UTC.
Thank you for your patience, your detailed bug reports, and your continued dedication to Ingress. As always, we are committed to continuing to improve Ingress, and thank you for your support.
—Brian, on behalf of the Ingress team