With Pear Deck’s specific challenges—traffic through Firebase as well as government firewalls—Prometheus was a game-changer. "We didn’t even realize how stressed out we were about our lack of insight into what was happening with the app," Eynon-Lynch says. Before, when a customer would report that the app wasn’t working, the team had to manually investigate the problem without knowing whether customers were affected all over the world, or whether Firebase was down, and where.
To help solve that problem, the team wrote a script that pings Firebase from several different geographical locations, and then reports the responses to Prometheus in a histogram. "A huge impact that Prometheus had on us was just an amazing sigh of relief, of feeling like we knew what was happening," he says. "It took 45 minutes to implement [the Firebase alarm] because we knew that we had this trustworthy metrics platform in Prometheus. We weren’t going to have to figure out, ‘Where do we send these metrics? How do we aggregate the metrics? How do we understand them?’"
Now, when a customer complains, and none of the alarms have gone off, the team can feel confident that it’s not a widespread problem. "Just to be sure, we can go and double check the graphs and say, ‘Yep, there’s currently 10,000 people connected to that Firebase node. It’s definitely working. Let’s investigate your network settings, customer,’" he says. "And we can pass that back off to our support reps instead of the whole development team freaking out that Firebase is down."
Pear Deck is also giving back to the community, building and open-sourcing a metrics aggregator
that enables end-user monitoring in Prometheus. "We can measure, for example, the time to interactive-dom on the web clients," he says. "The users all report that to our aggregator, then the aggregator reports to Prometheus. So we can set an alarm for some client side errors."
Most of Pear Deck’s services have now been moved onto Kubernetes. And all of the team’s new code is going on Kubernetes. "Kubernetes lets us experiment with service configurations and stage them on a staging cluster all at once, and test different scenarios and talk about them as a development team looking at code, not just talking about the steps we would eventually take as humans," says Eynon-Lynch.