Analysis and Improvement of Troubles in API System

Overview

In the API system of a company providing e-commerce services, a failure occurred where the failover mechanism did not function properly after the cache system went down. Although identifying the root cause was challenging, I was responsible for analyzing and resolving the issue. By pinpointing the cause and implementing a solution, I contributed to improving the overall stability of the service.

  • Root cause analysis: I hypothesized that a bug in an external library was the source of the issue, and validated this by thoroughly investigating the library’s change logs.
  • Risk mitigation: Rather than blindly updating the library, I carefully assessed the impact and selected an appropriate version to resolve the issue safely.
  • Solution verification: To confirm the effectiveness of the fix, I built a reproducible environment and conducted thorough testing to ensure the problem was properly addressed.

Details

  • Febuary 2021 - March 2021
  • Responsibilities: Root cause analysis, risk mitigation, solution verification, etc.
  • Related technologies: Redis, Java, Spring Boot, Linux
  • Worked as a developer in a 2-member team, including one project leader and one engineer.