Read in Feb 2020
Book by Brendan Gregg published in 2013
Great book on debugging production systems. It serves a comprehensive but simple mental model for how systems work and solid methodologies to look at each component. Especially the USE-method of looking at each system component for utilization, saturation, and errors: network, disk, cpu, memory, mutexes, ... Most of the time people use the 'streetlight' method, illustrated by the parable of the drunk man who was looking for his keys under the streetlight.
The reason why I can't give the 5th star is because it's focused on currently observable problems. Many of the gnarliest systems performance problems I've encountered happen for a shorter period of time under some hard-to-reproduce condition (where focus is on recovery, not understanding). You then have to dig through metrics _after_ the fact to find out what might have happened. This is often easy for errors, but not for saturation and utilization.