321: Unexpected Downtime: Stress as Enhancement vs. Stress as Panic

Primary Topic

This episode explores how stress can be harnessed as a tool for enhancement rather than a trigger for panic, through the lens of a real-life SaaS crisis.

Episode Summary

Arvid Kahl discusses a recent incident where his service, Potscan, experienced unexpected downtime, transforming a potentially panic-inducing situation into a learning and improvement opportunity. He details the troubleshooting process, his decision-making under stress, and how a strategic approach led to a strengthened product and service. The episode emphasizes the power of controlled stress to drive positive change and the technical and emotional strategies employed to turn a crisis into a constructive scenario.

Main Takeaways

  1. Stress can be channeled positively to enhance focus and drive improvements.
  2. Effective crisis management involves calm, step-by-step troubleshooting rather than reactive panic.
  3. Technical problems often require considering multiple factors and external services.
  4. Sharing issues publicly can lead to community support and unexpected solutions.
  5. Post-crisis analysis and documentation can prepare one for future challenges.

Episode Chapters

1: Introduction to the Crisis

Arvid introduces the episode by sharing his recent encounter with unexpected downtime, emphasizing the stress it induced and his approach to managing it. He outlines his initial response and the steps taken to diagnose the problem. Arvid Kahl: "This episode is about turning a nightmare scenario into a learning experience."

2: Deep Dive into Troubleshooting

The chapter details the technical troubleshooting process, from initial checks to deeper system analysis, highlighting how Arvid used various tools and strategies to identify the issue. Arvid Kahl: "I started with my emergency reaction... make sure it's a real problem."

3: Resolving the Crisis

Arvid discusses how community feedback helped pinpoint a potential cause, leading to a resolution. He reflects on the importance of external perspectives in solving problems. Arvid Kahl: "One tweet can bring in a flood of ideas that may just solve your problem."

4: Lessons Learned and Future Preparations

The final chapter focuses on the aftermath, including strategic changes made to prevent future issues and the insights gained from the experience. Arvid Kahl: "Writing a postmortem helps solidify the learnings and prepare for future issues."

Actionable Advice

  1. Embrace controlled stress as a focus enhancer.
  2. Establish a calm, systematic approach to troubleshooting.
  3. Leverage community knowledge during crises.
  4. Document issues and solutions for future reference.
  5. Regularly review and test system components to prevent issues.

About This Episode

It started as a normal day — but then things started going wrong. And my stress levels rose.
But I recently learned how to use that to my advantage. Here's how I wrangled with a SaaS founder's worst nightmare, what happened, and how I turned the tables.

People

Arvid Kahl

Companies

Potscan

Books

None

Guest Name(s):

None

Content Warnings:

None

Transcript

Arvid Kahl
My therapist recently introduced me to the concept of stress as enhancement in exposure therapy. The basic idea is that a little bit of stress can focus you to a point where you make new neural connections and realize new and beneficial insights that help you change for the better. Well, I was able to put this to the test earlier this week because all of a sudden, without warning or any indication, really, Potscan slowed down to a crawl and then went down completely. And that's every SaaS founders absolute nightmare. And yet, as frustrating as that was, I embraced the situation.

I got through it and I came out the other side with a better product, happier customers, and a feeling of having learned something new and valuable. And that's what I will share with you today. I'll walk you through the event, what I did and what came out of it. This episode is sponsored by acquire.com. More on that later.

Let's jump to Wednesday morning. I recently talked about how I have pushed all my calls and face to face engagements to Thursday and Friday. So Wednesday was going to be one of those days where I could fully focus on product work. And a high profile customer of mine recently had sent me a DM and asked for a specific feature. And since serving my pilot customers who also have the ability to tell all their successful founder friends about Potscan, since that is one of my growth channels, I got to work.

I really wanted to build this, so I started the morning quite relaxed and ready to make the product better. I had nothing else planned work on the product and it was a fairly simple feature, being able to ignore specific podcasts when setting up alert an alert for keyword mentions the founder in question got too many alerts of their own name when mentioned on their own show. So you know, it was a clear value add for people like them to allow them to ignore certain podcasts. And it took maybe an hour or so to build this, including making these features available on the API, which I want to be able to do whenever I make any changes to my alerting system because I already know a lot of people are using my API to set up alerts for their own clients. It's really cool.

And when I was done with this, after an hour or so I was doing my usual pre deployment checks, which I tend to do before deployment just to see if everything is okay. And one of my admin endpoints that I use that I usually load in the browser just to look at a couple metrics, was a bit slower than usual. And that was quite odd because it had been reliably around 2 seconds in just time to query the database and get stuff. But now it took ten. And this occasionally happens when the database is a bit busy with some massive import.

But I wasn't running anything and I didn't really know what's going on. And even after a few refreshes, this didn't let up. It only got worse. So I went to the homepage of Podscan to check, and it didn't load for 15 seconds this time. Or rather, it clearly didn't even connect for 15 seconds, because once it did, the site loaded immediately and that left me slightly confused.

So the server itself responds as it should, but something on the network causes a massive delay. What could that be? I didn't do anything. And at this point I could see where this is going. 10 seconds, 15 seconds, 30 seconds.

This was going to end up with a minute long delay, plus essentially timing out the page. This was going to be downtime and my stress levels started to rise. I didn't panic. I knew that I had not made any changes to the system, but I felt like this needed my immediate attention for the next hour or so. So I took a few seconds to reflect on my situation.

I just tried to sit calmly and think about it because I didn't need to type away immediately. It would be better to just think about it. In the past, I probably would have easily spiraled into this kind of active panic where I would try to do something, just to do something. But I was thinking, what if the server is completely messed up? Or what if this is the end for Potscan?

That would be my thinking. At the past times when these things happen to other services, I did not think about this this time. This time, like my therapist said, I was leaning into the stress, because you have a choice between stress as suffering or stress as leverage, and I chose to use it. So I stepped into a place of calm, determination. I said to myself, this is a technical issue with a technical reason and a technical solution.

The Internet is a complicated place, a weird network of tubes, as they say. And my product is itself a complicated machine. Something is wrong. I will figure it out. I will attempt to solve this calmly and without skipping steps.

And I started with my emergency reaction. Step number one, make sure it's a real problem. There's this old adage that just because things work for you, they might not work for others, right? And I think that's also true. In reverse, if something is broken for you, it might still work for others.

So that's when I went to webpagetest.org and ran a test on the main homepage from somebody else's server, and when I saw connectivity tanking on their end as well, I knew it was not just a me error. My Internet here canadian countryside is occasionally spotty, but this time it wasn't at fault. So whose error was it? I had a few candidates. It could have been my servers, my programs, my PHP application itself.

It could have been my Cloudflare settings, something with the encryption, something with the caching. Cloudflare itself could have issues. My database could have issues that caused a delay. My server hosting provider Hetzner, or maybe the Internet itself could have issues at this point. So to check which one was responsible, I went down this list in order of how proximate the service was to me.

So like how close to me being able to do anything about it would that be? So I started with my server, which was hosted on the Hetzner cloud. I logged into my server. I checked its vitals like the cpu load, the RaM usage, and the disk space available, and it all looked pretty good. I quickly restarted the NgInX process and the PHP FBM supervisor, just to make sure the application is not kind of caught up in something.

But that yielded no results. Same problem. And as huge timeouts like this can be caused by connectivity issues inside an application, I also restarted my local redis instance as well. No effect. I quickly restarted the whole server.

Nothing happened. Same problem. So I then checked what would happen if I accessed my website from within the server that it was hosted on locally. The website responded immediately. If I just send a local host connect that worked and using its public URL like Podscan FM, the connection issue persisted.

It still took like 20 seconds at this point for me to get a response back. And I knew then that this was probably an issue beyond my reach. That was not a common software as a service deployment problem. There was something else, and funny enough, my stress levels went down a little bit. I knew that there were still a few avenues I could go down to check, but I had the feeling that this was something that was done by somebody else.

It was someone else's doing, and it likely wasn't intentional or malicious. There was no sign of an attack, no DDOS or anything, nor were there resource issues. There was something else going on, and I was like beyond my pay grade, as they would say. But to make sure it was not on my end, I still locked into my RDS where I keep my database. That's Amazon.

I don't know what RDS is short for. It's database services, something managed database hosting, and I checked the metrics there and if at all, they had gone down. And it kind of makes sense, right? Fewer web requests make it to the server, and that means fewer database calls. So it wasn't a database issue either.

So with my own software stack working, for better or worse, perfectly, I went up one step, the ladder, and looked into cloudflare itself. Or rather I looked at Cloudflare because it was at that point that I took a breather. I just went and had a coffee. I told my partner that I'm in firefighting mode and then went back into my office so that everybody knew what was happening, wouldn't interrupt me. I told my dog, she didn't care, but, you know, that's what it is.

I'm going to have a little office buddy. And I considered that this might be part of a larger issue. Like I said, probably wasn't me, what somebody else is doing. So I checked the status pages for all the services that I use, Cloudflare, Hetzner, even aws, who knows? I even looked at Twitter and hacker News to see if there was a widespread issue.

Sometimes if there's like a large outage somewhere, you can find it first on hacker news or Twitter, and then it slowly gets onto the status pages of the services that are actually affected, but no mention of any issues. And my experience with running SaaS businesses for now ten ish years, told me that if something changes with the stack, some configuration was changed somewhere within the stack. That's how this happens. And since I didn't change anything, I looked into the upstream partners that served the Podscan website. Cloudflare and Hetzner both have their dedicated networks and connectivity rules.

They might have changed something. And in the Cloudflare dashboard I looked for notifications, warnings, or you are over some kind of limit messages, but there was nothing there. On Hetzner, same story. The server metrics looked good. There was no warning anywhere.

And I was kind of stumped, because if something would have happened, if some configuration would have changed, like as a response to me going over a limit or doing something that didn't want to, they would have at least sent me a message. I checked my email too. Nothing there. And I was dumped. Yeah, I didn't know what to do.

And either someone up the chain had some network issues that they didn't want to tell anybody about, or the traffic to my server, which was one of several heads of servers and the only one of those actually experiencing the issues. I checked the other ones. They were perfectly fine. That server might have been artificially slowed down in its origin network. So I did what I always do when I have no idea what to do.

I went to Twitter and I shared that I had this issue and I didn't know what to do. And within minutes a lot of ideas came in and one caught my eye. Somebody explained that this might be silent connection throttling from the hosting provider, and it was something that they had experienced before using headstone service for scraping operations. And now Podscan isn't technically a scraping tool, even though kind of, it does pull in a lot of RSS feeds for podcast analysis, so you could call it scraping. It does download a lot.

But I wouldn't put it past Hetzner to have some kind of automated detection system that silently make it harder for scraping operations to succeed on their platform. But, you know, I don't think I am. Here's the thing. I will never know if this was what happened, because I haven't found anything out and they haven't told me anything about it. And ten minutes after I started looking into this particular possibility, things changed a little bit.

I ran one more test before things changed, using a company called Speedvitals to check the time to first byte from several locations around the world. Like, how long does it take for the first response to come back from the server? And the first time I ran it, timeouts happened and 72nd delays from all over the world. My website was effectively down for most of it, and I felt surprisingly calm about it. At this point, of course, I was agitated and frustrated.

Something that I care a lot about wasn't working, and soon my paying customers would notice they hadn't yet, really, because Podscan is an alerting tool and the alerts were still sent out, right. It was the incoming rag traffic that had slowed down a lot, but this wouldn't work forever. So again, try to be calm. Try to kind of not go over a manageable stress level. I went upstairs.

I grabbed another hot beverage, not a coffee, because I'm addicted to that stuff, I guess. And I told Danielle about it, and she, with the experience of having run a SaaS business with me, calmly told me that she knew that I'd figure it out eventually. And I love her for that. Like, how lucky am I to have a calm and measure partner to keep me from spiraling out of control in these situations, because I care about the stuff I do, and I take it very seriously. But she was like, you can do it.

You'll do it, don't worry. We went through this before and we did like we had issues like this with feedback panda back in the day, and sometimes it took us days to fix it and customers got super upset. Yet we still got to solve the business for millions of dollars. So, you know, we know that this can work. So back to the office I went, ready to keep working on this.

I ran the time to first byte test again just to make sure that it was still a problem. And it was back. Everything was fine. Every location reported sub second response times. You thought I was stumped earlier when I didn't know what was going on.

I was equally stumped when everything just fell back into place. It was super weird, it was super strange. It was like someone had pulled the plug from the bathtub and the water was just flowing as if nothing had ever happened. I went back into my browser and it was just like before, like under a second. In response, I went into the logs and my transcription service just had started to grab new candidates again, to just transcribe them while finally being able to just deposit the finished transcripts that they had finished but couldn't report back.

So I breathed a very big sigh of relief and immediately started working on moving away from Hetzner. Because mind you, I had no proof of this being some kind of shadow banning or silent throttling. I still don't know to this day. It might just have been a network congestion issue in the part of data center that my VPS was in, but I knew I was relying on something that was not trustworthy because it was just one server on one cloud provider, I knew I had to diversify. I would keep my Hexner server in its current configuration, always keep it up to date with if I deploy something to the Podscan repository, it would also go to my Hedsner server, but I would keep it as a backup server just in case I need to switch back to it at some point.

But my main server would move into the cloud that I knew and trusted for a while that had my data AWS. After all, I've been running my database there for the whole time that Podscan has been around, and it has been super reliable, and I think there are other benefits and I'll get to that in a second. But with Laravel forge, which I use for provisioning and orchestrating servers, and Laravel and Voyeur from where I deploy new versions, speeding up an instance on AWS was extremely simple. I just needed to adjust the security group or two for forge to be able to connect via SSH. But that was quickly done within 20 minutes I had a fully functional copy of my main server running on AWS.

I tested it under a subdomain of Podscan FM, just to see if it can work as an alternative. And this morning, the morning after, I guess after I'd been running idle for half a day, I finally made the switch through Cloudflare. Pretty easy, remapping ips from one server to the other and just hit save. That was a lot of fun. It was an absolute joy to see traffic slowly shifting from the old server to the new.

And through AWS's routing magic, also much, much faster. With my database being in the same location as my application, the round trip time dropped significantly. Some of my queries were actually cut down to like 20% of their prior duration. Was really cool. You can feed it on the website too.

It's extremely snappy. So I'm coming out of this quite horrible incident with renewed confidence in what I have built. Because first off, the service never broke. It was unavailable, sure, but not because it was overloaded or because it had like error state. It was underloaded really, and it was just working.

Didn't get any requests, but it was working. One massive insight that I got from all this was that I made a pretty good choice that some people may call premature optimization, but it really paid off this time around. And that was to make every single request between my main server and my 20 ish or so transcription servers queue based. That was a really good idea. So when a transcription gets created, it pulls something from an API, like just a URL, an audio file, which it then downloads.

That's what my transcription servers do. They get the URL to the audio file, they download the file, they transcribe it, and they send back the text. That's their main purpose. So whenever this transcription is finally created, I just don't send off an HTTP request back to my API to save it to the database. That HTTP request itself is wrapped in a job which runs on a queue, and it will be rerun multiple times if it fails.

And using Laravel Horizon and the Q retry and backoff parameters that come with it, every request will be tried like more than ten times, and the server will wait for up to ten minutes between these attempts, and the final attempt waits for a full day. That way things can crash all over the place, or slow down or whatever, but the valuable transcription data that I pay a lot of money for to be able to do on these GPU based servers is safe in a queue ready to eventually be consumed. That was a big learning queue based communication between microservices, you might call these, or internal APIs, has been really, really helpful. Look into this kind of system. If you run PHP or like RabbitMQ, if you use anything else, or maybe even like a Kafka queue or some AWS stuff, they have their own thing.

It's really useful to have a message queue to communicate between things, because then data is at least persisted somehow, right? There are other problems that come with it. Like you get duplicates, sometimes it's either zero to one or one to infinity. You have this kind of choice between making sure that things get delivered at least once or at most once. That's an issue with message queues.

But hey, if you have a system that can then test for internal complexity and you can test for consistency between these complex states, you might want to go into the message queue route. It's been really, really helpful. I also enjoyed how easy it was to move away from Hetzner as a cloud provider. And I don't move away completely. I still run several parts of the business there, like my search engine that will stay there.

But it was really fun to just move the main server to AWS within a couple minutes. I made the absolute right choice, entrusting the laravel ecosystem here with fortune and voyeur, deploying stuff and making on the flight changes to switch them over was comfortable, reliable, and functional. It was really good, really enjoyed it. And ultimately, maybe the most important part here was that I was glad I kept my stress levels under control throughout of this, and that allowed me to stay level headed when facing a problem I couldn't solve immediately myself. I grew from this experience, like, I feel more confident in my product and my ability to deal with it, and the slightly elevated but controlled stress of it all helped me focus on stepping through the steps that I knew I needed to take calmly and without losing sight of the larger issue.

One thing that I recommend is writing a postmortem, kind of like this podcast, right? The story of it, just writing it down, just taking all your learnings and persisting them onto paper or into a document, or just write something into your own business documentation. I wrote a few more emergency standard operating procedures right after the problem was resolved. So they will help me or future me, I guess, when dealing with similar issues, hopefully in an equally calm state of mind. And that's the important part, right?

You cannot control these externalities. That's the nature of the term, externality. They are things out of your control. Cloudflare Hetzner Forge aws they all could do something, intentionally or unintentionally that creates an issue for you, or more work or challenge. But running around in the panic won't solve that.

That's the stress that gets to you. What you want to do is have a cup of tea, tell yourself that you got this, and then tackle it like the professional that you are. And that's it for today. I want to briefly thank my sponsor, acquire.com. Imagine this.

You are a founder who built a solid and reliable SaaS product that is working all the time or most of the time. You've acquired customers and you're generating really consistent monthly recurring revenue. But there's something. There's a problem. If you feel you're not growing personally and the business itself, for whatever reason, maybe lack of focus or lack of skill or lack of interest, you just don't know what to do.

Well, the story that people would like to hear is that you buckled down and reignited the fire. But realistically, the situation that you might be in might just be stressful and complicated. And too many times the story here ends up being one of inaction and stagnation until the business itself becomes less and less valuable over time. Completely worthless. So if you find yourself here already or you think your story is likely headed down a similar road, I would consider a third option, and that is selling your business on acquire.com.

Because if you capitalize on the value of your time here today, that is a smart move for you as a founder. And somebody else gets to benefit from it too, because they get a business that actually already does something really cool. So acquire.com is free to list. They've helped hundreds of founders already. Go to try dot acquire.com arvid and see for yourself if this is the right option for you right now or in the future.

Thank you for listening to the Bootstrap founder today. You can find me on Twitter at avidkall aryer khl and you'll find my books on my Twitter cardstad too. If you want to support me and the show, please subscribe to my YouTube channel, get the podcast in your podcast player of choice, and then leave a rating and a review by going to ratethispodcast.com founder. It makes a massive difference if you show up there because then the podcast will show up in other people's feeds. Any of this will truly help the show and me, so that would be great.

Thank you so much for listening. Have a wonderful day and bye.