SimpSocial’s Path to Observability
SimpSocial prioritizes the customer experience above all else; the availability and performance of our service come first. Our teams and systems need to have a strong culture of observability in order to achieve this.
We, therefore, make significant investments in the dependability of our application. Unpredictable failures, however, are inevitable, and when they do occur, people are responsible for fixing them.
We manage a socio-technical system, and resilience is the capacity of a system to bounce back from adversity. Observability, or the measures we take to make it possible for people to “look” inside the systems they operate, is one of the essential elements of resilience.
This article will examine the steps we’ve taken to develop a more robust culture of observability and the lessons we’ve discovered along the way.
What does SimpSocial intend by observability?
We ship to learn at SimpSocial. The only location to understand and verify the effects of our work is in our production environment, where our code, infrastructure, third-party dependencies, and customers join together to create an objective reality. We describe observability as an ongoing process in which people ask inquiries about production and receive responses.
Let’s examine that in more detail:
The process that is ongoing: Successful observability calls for frequent observation.
Concerning production: We intended for our definition to be broad, all-encompassing, and illustrative of the diverse range of workflows we support.
Note the asterisk by the answers. No tool can provide you with answers; it can only provide you with leads that you can use to uncover genuine solutions. You must rely on your own mental representations and comprehension of the systems you manage.
Problem and solution in Stage 1
With our own concept of observability in hand, we evaluated our current procedures and came up with a problem statement. Our observability tooling has mostly relied on metrics up until recently. Looking at a dashboard full of charts with metrics broken down into different attribute combinations was a normal procedure. People would search for linkages but frequently give up without receiving satisfying findings.
“Metrics are simple to add and comprehend, but lack high-cardinality attributes, such as Customer ID, making it challenging to complete an investigation.”
Metrics are simple to add and comprehend, but because they lack attributes with high cardinality (like Customer ID), it is challenging to complete an investigation. A small group of observability champions had previously continued the workflow using auxiliary tools (such as logs, exceptions, etc.), attempting to access the high-cardinality data and provide a more complete picture. For the bulk of product engineers who are focused on producing new products, that skill needs regular practice.
This gap in consolidated observability experience was recognized by us as a challenge. We intended to make it simple for anyone to ask a general inquiry about production and receive answers without having to become proficient in a number of disjointed, expensive, and underequipped technologies. We made the decision to focus much more on tracing telemetry to lessen the issue.
We used a standard operational dashboard before focusing more on traces.
Why do traces?
Any observability tool has a human operating it, and humans require effective visualizations. What matters is that the tool enables you to quickly switch between multiple visualizations and gain different viewpoints on the issue, regardless of the type of data that supports the visualization.
Traces have a significant benefit over other telemetry data in that they include sufficient transactional information to provide almost any display. Without changing the underlying data or the tool, creating observability workflows on top of traces ensures a seamless consolidated experience.
Some of the several displays that trace can power
Stage 2: Putting traces in place
At SimpSocial, we make small, incremental progress while deciding what success looks like. Our main goal was to demonstrate that traces would speed up observability procedures. We needed to get traces into engineers’ hands as quickly as feasible for that.
We leveraged an existing tracing library that just so happened to be among the dependencies, rather than creating our own from scratch, to instrument our program with traces.
We leveraged Honeycomb, an existing vendor, for our proof-of-concept to save time. While utilizing their platform for scheduled events in the past, we had already had a fantastic relationship with them.
We used an existing tracing library that just so happened to be in the dependencies and made a tiny tweak to transform the trace data into the Honeycomb-native format rather than instrumenting our application with traces from scratch. Starting out, we used a straightforward deterministic sampling method and kept just 1% of all the transactions we handled.
allowing colleagues to follow certain paths
It takes a lot of effort to move a company toward traces. The learning curve for traces is significant and they are more complicated than metrics or logs. The largest problem is enabling your teams to make the most use of traces, despite the importance of instrumentation, data pipelines, and tooling. As soon as our proof-of-concept was operational in production, we turned our attention to developing an observability culture.
To underline how traces may assist them specifically solve their challenges, we spoke with directors, technical program managers, security team members, and customer service reps in addition to engineers.
The secret to success was finding allies. We put up a team of champions who were experts at observability. They assisted in supporting our theories and educating their staff about traces. However, we didn’t only speak with engineers; we also did so with directors, technical program managers, security team members, and customer support agents to underline how traces may assist them in resolving their particular issues.
Customizing our messaging allowed us to secure support. We boosted our chances of success by showcasing promise and generating excitement about the new tooling, which always entails some risk.
Stage 3: Selecting the ideal vendor
After the enabling initiative was launched, we began researching modern tracing-centric suppliers and developed a set of standards to measure prospective candidates against.
The exploratory process was deemed to be the most crucial workflow since it would allow engineers to arbitrary slice and dice production data and gain insights via visualizations and high-cardinality features. Being able to recognize an issue is crucial to diagnosing it, which requires having a clear grasp of what “normal” looks like. We intended to make it simple for engineers to investigate production by posing inquiries frequently, rather than simply when problems manifest.
“We desired complete control over the sampling and storage of data,”
Controls over data sampling and retention: We desired complete control over data sampling and retention. Deterministic sampling made it easier for us to start going quickly, but we wanted to be more selective and use intelligent dynamic sampling to keep more of the “interesting” traces (such as problems and delayed requests) while still staying under the contract limit.
Data visualizations that are accurate: We needed to ensure that, regardless of the sampling technique we employed, the observability tooling handled it transparently by exposing “true” approximations in the visualizations. Each vendor tackled this issue in a unique way; for example, some demand that all data be sent to a worldwide aggregator in order to derive metrics for important indications like mistake rate, volume, etc. Given the vast amount of data produced by our sophisticated instrumentation, we were unable to pursue this option.
Pricing: We desired a straightforward, dependable pricing structure that was in line with the benefit we would receive from the instrument. It seemed reasonable to charge for the amount of data preserved and revealed.
Engagement metrics: We wanted the vendor to be a good partner and assist us in identifying important usage indicators and levels of engagement that would enable us to monitor the tool’s uptake and effectiveness.
There isn’t a perfect merchant out there, so be prepared to give something up. In the end, we came to the conclusion that Honeycomb not only performed better for the primary workflow we had identified, but also checked the boxes for sampling, pricing, and usage metrics, saving us the expense of switching vendors.
We had completed the technical portion of the observability program after a difficult year of labor. What we had accomplished was this:
High-quality, attribute-rich traces had been automatically instrumented into our primary monolith application.
There were only a few simple ways for engineers to add unique instrumentation to their code.
To dynamically sample data and retain more of the “interesting” traces, we had deployed Honeycomb Refinery. For finer control, we urged engineers to set up unique retention rules. In order to provide people with the data they required, we offered 100% retention for the most valuable transactions when doing so was economically viable.
Stage four: Growing adoption
We returned to enabling once we had committed to Honeycomb and finished the work on the data pipeline. You must make it simple for individuals to adopt observability in order to create a culture of observability. Here are some strategies we used to assist teams in utilizing new observability tooling:
In a development environment, tracing
We provided optional tracing from the local development environment with the traces exposed in Honeycomb to acquaint engineers with tracing instrumentation and urge them to include it in their code. This made it easier for people to imagine new custom instrumentation exactly as they would when the code was put into production.
Trace views are significantly more structured and ordered than logs, which can be challenging to read and interpret.
Shortcuts for Slackbot queries
Finding the proper query quickly is the last thing you want to do when production is having issues. To a “show me web performance” message, we provided a personalized bot response. By selecting the Slackbot link, you can view a web endpoint’s performance by service.
With the help of a Slackbot that offers a quick way to a common query in our observability tooling, we streamline our observability workflow.
Stage 5: Thoughts and subsequent actions
Evaluation of adoption
It can be difficult to calculate the ROI of observability tooling. We greatly benefited from Honeycomb’s usage stats because counting the number of active users is a fantastic way to gauge how frequently engineers utilize the technology.
This graph displays the growth in the number of active Honeycomb users since the introduction of observability enabling.
We took it a step further and evaluated the value of those interactions. We hypothesized that if the observability tooling’s findings were useful, users would impart them to their peers. We chose to use the number of problems or pull requests where Honeycomb was mentioned or linked to (trace, query result, etc.) as a proxy for an adoption statistic because our engineering procedures mainly rely on GitHub issues. The number of issues mentioning Honeycomb exploded as we increased enabling at the end of 2021, indicating that we were on the right track.
A bar graph displaying the number of GitHub issues where the word “Honeycomb” appears in the issue’s name or description
Unexpected processes
Laying a strong basis for observability allowed us to build workflows we never could have envisioned. Some of our favorites are as follows:
Providing information for the cost program: Since we track all traffic and have spans for SQL queries, Elasticsearch requests, etc., we can look into spikes in the usage of various shared components of our infrastructure (like a database cluster) and associate them with a specific customer. We can estimate the cost of each transaction we serve by comparing this data with the cost of the various infrastructure parts. Unexpectedly, observability has grown to be a crucial part of our infrastructure cost program.
Enhancing security audit: We were able to keep track of all interactions with our production data console thanks to our ability to keep 100% of chosen transactions, which has helped security establish stronger oversight over access to our client data.
Next, what?
Our technical program will continue to include developing a culture of observability. We’ll concentrate on enhancing our onboarding materials, further integrating observability via traces into our R&D processes, and investigating front-end instrumentation.
Back to Blog