Incident Statement: Tracking API Service Disruption
Date: July 25th through 26th
Duration: 32 hours + 6 hours cleanup
Impact: Tracking API and Tracking Pages request timeouts due to database performance issues
What Happened
On July 25th, our tracking subsystem experienced service disruption due to the unintentional deletion of a critical unique index on our main database. This index is essential for processing of tracking-related queries.
Impact on Services
The loss of this index caused database queries to take much longer than normal, resulting in requests failing with timeouts.
Affected services
- Tracking API
- Tracking Pages
- Parcel Finder
Resolution Steps Taken
Our engineering team responded immediately to restore service:
- Initial Response: Initiated rebuilding the deleted unique index
- Load Management: Stopped unindexed queries
- Infrastructure Scaling: Scaled up the database cluster to speed up indexing and reduce database load
- Interim Solution: Created a non-unique index to allow queries to work properly when the rebuilding process failed due to duplicates that had been created in the meantime
- Data Cleanup: Cleaned up duplicate entries
- Full Recovery: Successfully recreated the unique index, restoring normal service
Preventive Measures
To prevent similar incidents in the future, we are implementing the following measures:
- Enhanced database change management procedures with mandatory peer review for all schema modifications
- Implementation of automated database integrity checks to detect missing or corrupted indexes before before production deployment
- Enhanced code review processes specifically focused on database schema changes and queries without index
- Improved standard operating procedures for common database incident scenarios
Our Commitment
We sincerely apologize for the disruption this incident caused. We take full responsibility for this issue and have conducted a thorough post-incident review. We are committed to implementing the preventive measures outlined above to ensure the reliability and stability of our services.
If you have any questions or concerns regarding this incident, please don't hesitate to contact our support team.
Thank you for your patience.