Systems Performance: Enterprise and the Cloud
Great book on debugging production systems. It serves a comprehensive, but simple, mental model for how systems work, and solid methodologies to look at each component. Especially the USE-method: looking at each system component for utilization, saturation, and errors: network, disk, cpu, memory, mutexes, … Most of the time people use the ‘streetlight’ method, going through random tools they know. Best illustrated in its absurdity by the parable of the drunk man who was looking for his keys in the dark under the streetlight.
The reason why I can’t give the 5th star is because it’s focused on currently observable problems. Many of the gnarliest systems performance problems I’ve encountered happen for a shorter period of time under some hard-to-reproduce condition (where focus is on recovery, not understanding). You then have to dig through metrics after the fact to find out what might have happened. This is often easy for errors, but not for saturation and utilization. Why is there nothing in the book about this?