I had seen Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer(Editor) and Chris Jones(Editor) discussed on tech forums for a while before I bought it. The posts were all saying how thought provoking the book was, and how it led the developers to change their approach to building applications. I could totally relate to this after reading the book.
The book is a blueprint for how google manages services, and is structured as a collection of essays covering subjects including –
The advantage of this approach is you can focus on the chapters which are most relevant to you, so if you are supporting a 1bn users you will have different needs than someone who wants a side project to self manage itself.
The basic idea of a Site Reliability Engineer is that developers develop systems that support the deployed application. Specifically creating structures that monitor, scale and self repair. This means that the application can scale itself. Obviously its not that easy, and google ceases work on Site Reliability once the SLA has been reached, and interestingly if the SLA isnt reached in a given time period then they will stop the service themselves to check for other systems relying on the service
I did feel that there was a lot of material for large teams, and not every workplace has the skills for a dedicated SRS team. In these circumstances it is important for each layer to accept part of the responsibility. As a developer the most important thing I can do is keep it simple, and make metrics available.
My advice is to get this book, and choose the principles that best suit your organisation.