Optiver maintains a large and complex infrastructure spread out in different locations over the world. Within this infrastructure are thousands of components that together make our trading system. Maintaining these systems is a constant effort and confidence in the quality of the components must always be high. Given the constrained set of resources for maintenance and the behavior of computer systems, we want to find the optimal strategy to do patch management, soft reboots (no powercycle) and hard reboots (with powercycle). Such that when performing major updates to the infrastructure the probability of system failure and the need for replacement is in proximity to a physical maintenance window, while also keeping the trading infrastructure up and running at the desired reliability level.
This research will analyze historical failure rates to define a reliability picture of the infrastructure components and then through modeling and analysis propose and optimal reboot strategy, such that impact on the maintenance team is optimized for their availability and capacity.