Site Reliability Engineering: How Google Runs Production Systems
Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there’s one piece that has all of what was previously something you’d had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, and each chapter stands too much on its own—building from first principles each time, instead of leveraging the rest of the book. This makes the book much longer than it needs to be. Furthermore, it tries to be both technical and non-technical—this confuses the narrative of the book, and it ends up not excelling at either of them. I would love to see two books: SRE the technical parts, and SRE the non-technical parts. Overall, this book is still a goldmine of information to a 5/5—but it is exactly that, a goldmine that you’ll have to put a fair amount of effort into dissecting to retrieve the most value from, because the book’s structure doesn’t hand it to you—that’s why we land at a 3/5. When recommending this book to coworkers, which I will, it will be chapters from the book—not the book at large.