319: My SaaS Server Exploded (& How I Salvaged It)

Primary Topic

This episode provides an in-depth look into how Arvid Kahl managed a major setback when upgrading his SaaS system, Podscan, highlighting lessons learned from the failure.

Episode Summary

In this episode of "The Bootstrap Founder," Arvid Kahl shares a detailed account of attempting to refactor the core operational system of his software, Podscan. Initially, the process seemed smooth, but soon he encountered significant issues leading to a drastic drop in system performance. Kahl's narrative dives into the technical aspects of building and rolling back system changes, emphasizing the importance of reliable development and deployment practices. Despite the project's failure, he reflects on the valuable insights gained regarding his product's complexity and the importance of system observability.

Main Takeaways

  1. Detailed planning and scheduling are crucial for focused software development.
  2. Implementing a new queue system in a complex SaaS can introduce unforeseen issues.
  3. Effective rollback systems are essential for recovering from failed deployments.
  4. Technical setbacks can provide deep insights into system complexities and personal development processes.
  5. Learning from failures is as important as celebrating successes in software development.

Episode Chapters

1: Introduction to the Refactor

Arvid discusses his decision to refactor Podscan’s transcription system, aiming for improved extensibility and reduced complexity. Arvid Kahl: "I chose to completely refactor the heart of my operation."

2: Implementation Challenges

Kahl describes the challenges faced during implementation and the initial success followed by critical failures. Arvid Kahl: "I tweeted about how well it worked, and I was kind of nervous because it was a pretty central functional part of Podscan."

3: Rollback and Reflection

After noticing significant performance drops, Arvid decides to rollback the changes, leading to reflections on the learning gained from the experience. Arvid Kahl: "I rolled back what I'd worked on for 6 hours to the version that I had in the morning."

Actionable Advice

  1. Schedule Dedicated Time: Block out dedicated time for significant tasks to avoid interruptions.
  2. Test Thoroughly: Always run extensive tests simulating the production environment as closely as possible.
  3. Prepare for Rollbacks: Ensure your deployment process includes a reliable method for quickly reverting changes.
  4. Learn from Setbacks: Use failures as learning opportunities to improve both the product and your methods.
  5. Monitor System Health: Implement system monitoring to catch issues before they affect users.

About This Episode

Earlier this week, I finally found time to work on a large code change. When I deployed it, things got worse quickly. Here's what happened, why it happened, and how this changed my approach to working on a complicated software project.
One thing became apparent: while it may look like a waste of time to build something and revert back to "yesterday's version", I got massively interesting insights into my product and my process along the way.

So today, we'll talk about setbacks, getting back up again, and how we can judo a failure into making progress.

People

Arvid Kahl, Danielle (partner)

Companies

Podscan

Books

None

Guest Name(s):

None

Content Warnings:

None

Transcript

Arvid Kahl
Welcome to the Bootstrap founder. Earlier this week, I finally had a day, a full day, to work on a massive refactor of Podscan's transcription gearing system. In case this is too technical for you, consider today's episode a nerdy deep dive into the heart of where my business builds its value. I'll explore the safeguards and mental exercises that I needed and developed to deal with sizable setbacks. And there will be a setback.

And if you're interested in what a sotopreneur's tech complexity can look like, stick around to. I'll spare no details. This episode is sponsored by acquire.com. More on that later. So what happened on Monday this week, for the very first time in almost three weeks, I had a full day to myself to just spend on building software the stuff that I like.

I'm a software engineer. I love building. Didn't get to for a while. I was at microconf two weeks ago. That was a full week that I needed just for the conference.

And then the following week was full of calls that I had scheduled so I could catch up on the week that I missed at the conference. There were just a lot of interruptions happening the last couple of weeks, which meant that I didn't have any meaningful long block of time to spend on my software business. Part scan every now and then, I could build a little thing or respond to a, you know, a customer message and change a little API field here or there. But I really didn't have time to just really think and really spend a couple hours building something, testing it, deploying it. So after two weeks, these two weeks in the past, I chose a day where I would have a full day of opportunity to work on software.

But that actually took some work to get there. I had to reconfigure my schedule for this. And I tell you what, I mean, that's just my schedule, my whole scheduling approach, because for the last couple of months, I've been very open with my schedule. People could book calls almost every day of my week through my calendar link that I put into every email that I sent to my customers. Mostly because I really wanted the early pot scan users and early customers to find the perfect time for them to talk to me.

Right? Just a 15 minutes call. Whenever you find time, talk to me. I want to hear from you. Because the idea here was I could hop on a call with a potential or a paying customer whenever it suited them.

I'm in a super early stage. I need lots of feedback and every little piece of feedback really helps. It's really insightful. So every early customer interaction that had positive feedback or negative feedback doesn't matter. That is something I can build on later, because if I put my time in, people will see that and they will build a relationship with me.

And I wanted to open my schedule to everyone for as much as I could, and I did. But this also meant that I was constantly interrupted because people scheduled their conversations with me whenever it best suited them, but never me. And in the meantime, I also had a podcast to record interviews, to organize and to research and guests to invite to the show, and to be on Twitter all day. All of that still had to happen. I have to manage my editing, my transcribing of my stuff, the podcasts and my business, and a lot of extra stuff on certain days just adds to the overhead time that I can spend on building software.

I made a choice that I've had enough of an open schedule, and I reduced my available days to get any kind of call to two days in a week max. All of my customer service conversation or my customer exploration conversation, Discovery, that kind of stuff now happens on Friday. All my podcast recording now happens on Thursday. And I used to have one interview call a day on Tuesdays, Wednesday, Thursday, or whatever. Now I have as many people as I can fit into that day with some time in between.

So I'll stack my calls, I'll on Thursdays and Fridays. That frees up the rest of the week to write code, do marketing, sales, all of that, without constant interruptions. Obviously, there might still be the on off call every now and then, but I want to focus my first three days of the week to be able to build stuff. So after setting up this new schedule, I then set a day to work fully on my software project. And I finally had that day on Monday, this week.

So I chose the biggest issue that I had out of all the issues that I could work on. It's kind of an eat the frog moment. I chose to completely refactor the heart of my operation. Now, I will explain to you what that is, because you may or may not know. Podscan is the business that I'm currently building.

It's a media monitoring and data platform for podcast transcripts, and therefore it ingests thousands of hours of audio data every single hour. It transcribes them into text and then scans them for keywords to send to an API. It's a lot of data coming in and it's not a very flexible system. And whenever any of my now 24 backend servers that do the transcriptions have capacity. They ask my main server through an internal API if there's anything available.

Then my main server checks the database for an available project and sends it over. That's how it works. And this has grown over time, and it feels like it can be made more extensible and kind of less convoluted. I thought on Monday morning, let's tackle this. And I had been thinking about this for a couple of weeks, taking a couple notes every now and then, thinking about what I could do, what the structure of the new system would be.

And I came up with an internal queue system, a system of many cues that make it easier for this kind of individual podcast episode to move through all these different stages of analysis. Because Podscam works like this. I get an audio file from some feed somewhere, like some feed tells me, hey, this new audio file has just been uploaded, and this is the new podcast episode. I download it, I transcribe it. That's step one.

Then I run inference on it to extract certain information, maybe build a summary and ask questions of it through like an AI system, a couple local llms that I run on all my 24 servers. That's step two. And then I scan for keywords and I send alerts to everybody who is a paying customer or trial customer of Podscan and has alerts set up. That's step three. So transcription, inference, and scan.

I have a couple more steps planned in my document there, but I was going to start with these three. Just even without expanding the system, build a better system that can be expanded on later. And right now they all kind of kick off each other. When an audio file is available for transcription, it's presented on the API. A back end server fetches it and then responds with the full transcript a couple minutes later.

And that kicks off the inference step. And then the server handles inference, it sends it back, and then scanning starts on the main server. And that's how it works. It's a back and forth kind of system between two fleets of servers. And it works, but there's always room for improvement, right?

Well, that's what I thought on Monday. Monday morning I thought about creating three queues in which these candidates live. One for transcription, one for inference, one for scanning according to their stage of completeness. Instead of looking in my database, I would just look into the queue, pick one, and move it from one queue to the other, depending on the stage that it's in. Simple enough.

And I started building this because I finally had some time to do it right. I built the system over must have been 4 hours, and use it locally. And I kind of tried to replicate the production system as much as I could, but you know, it's still not the exact same thing. It worked pretty well. I spent some time on edge cases because I knew that production is a pretty complicated system.

What if there's an error while the transcript is being created? What if one of those servers goes down? What if I have to retry it? But there's another step that needs to be done first, that kind of stuff. And fortunately my work has some structure.

I'm not completely randomly building these things. I use the git flow model. If you're not aware of this, the idea in the Gitflow model is whenever you commit code to your repository, you only commit to the main branch, or the master branch, as it used to be called back when Gitflow was invented, when it's actually usable. Anything that is kind of a longer project sits on its own branch called develop or a feature branch for something specific, and you can build it there, and at the same time you can still fix bugs on the main branch should you need to. It's the idea.

I have a development branch for these more complicated experiments, so I can just still fix things as I need to. And after about 6 hours of diving into the code and testing intensely, I was ready to pull it from development, that's the name of the branch, to the main server through deployment. And it worked. There were a couple of bugs when I started it. There was something about a particular data point that I didn't think about that was slightly different on the production database, but it was fine and I could quickly fix that.

I tweeted about how well it worked, and I was kind of nervous because it was a pretty central functional part of Podscan, but it did its job, it worked. It was still consuming stuff from the API, it was still sending out transcripts, still getting transcript data back, still sending alerts. So that was all working. And it's really critical, because if the data ingestion and the transcript engine stuff doesn't work for my business, then nothing works. The API and notifications would break down, and that's the stuff people pay for.

So I need that to function well. So I let it run for a bit, and I started checking my metrics, and gradually fewer episodes landed in the database. Usually I have around like 2500 episodes an hour now it dropped to 2200 and then 1900, and then like half an hour later it was at like 300. So 2500? Like 2500 to 300, that's a significant drop.

That's like one in six. One six of the actual amount of episodes. That's when I stopped. Something was wrong in the system. Some queue item didn't move the way it should be, and that was lost somewhere.

And I couldn't see what the problem was. And I couldn't just play around and delete jobs in my production system. They all needed to work. It was life. It was a live system that interacted with live data.

So I rolled back what I'd worked on for 6 hours to the version that I had in the morning before I started working on this. And I tweeted about that too, because I think sharing setbacks like this is part of my approach to building in public. And I was devastated. I was super frustrated. Not just by the system failing the errors that happened, but also by feeling like I had wasted a day, a day that I had fought so much to get.

And I went upstairs because I live in a basement like any good software engineer. And I talked to Danielle, my partner, and as I explained my frustration, I realized something. I may have spent a few hours writing code that I won't use, sure, and that's frustrating. But I also had learned a lot about my product and my approach to making improvements for the product. Both the product and my approach had properties that I was not aware of before.

Yes, the queue system didn't work, but I learned how complicated my existing system is, how delicately it's balanced, and how many interference points existed as well. Like where if you just change something a little bit, a lot of things break. So that was a lot of insight into the technical debt technical complexity of my actual product here. And I also learned that my feature specification process was incomplete, like adding another abstraction layer was not required, the queuing system wasn't necessary. My existing state machine that I wasn't even aware of was a state machine until that point was efficient enough.

If I ensured reliable state transitions between these states, and what I currently have has reliable state transitions, the Q system I built did not. Hence it failed. And if I ever add more steps to the state machine, I will have to focus on improving the current state machine instead of replacing it like working with what I already have. So despite all the wasted hours that I was so frustrated about in the beginning, I learned so much from this attempt. I think it was absolutely worth it.

And next time I won't dive in with just a few notes in an ocean document. I will look into what I currently have more properly and monitor the state machine. And honestly, this is something I realized, monitoring this observability, having observability into my system. That's a feature I need to build before I can replace systems like this. And one more remark about this whole process.

I'm extremely lucky to have systems in place already to deal with a botched deployment that allow me to actually revert it. Like rolling it back was easy and it was possible. And that's maybe the most important part. If you build software, build software that you can easily, or at least without struggling too much, roll back one or two deployments. That's sometimes all you need.

And I use Laravel as my framework of choice and my programming language, PHP and all that stuff that is underneath Laravel. I use Laravel forge for hosting or for orchestration, I guess, and Laravel and Voyeur for deployments. And these tools in concert allow me to very, very easily and quickly roll back to a prior version of the application. For PHP, it's really just files in a directory that you symlink from one place to another. It's really simple.

You just move the old directory back onto where the server looks for what it should show to the people out there that use the product, and it immediately goes back to working like that. And Voya in particular let me revert to a prior deployment within seconds because I keep several dozen backup versions because I'm prepared. I have like 50 steps back should I ever need to. And reversibility is so important. So if you build software, if you build a software business and you have these experiments that you want to run, you have these features that you want to build, you absolutely need to think about reversibility.

And often the database changes with these things in particular, and in my case it was, there were a lot of changes in the database. I built new queues and they needed a table to reflect it in the database. Doesn't matter if it's in redis store or if it's an SQL, MySQL or postgres server, doesn't really matter. Database changes happen as you build. So you need to have non destructive database migrations if you're building, in particular, if you start a new feature, if you want to build something new and your database changes, I highly, highly recommend not to switch things inside the tables, but to add more tables, or not to change types, but to add new types and then migrate data from one type to the other.

That's my personal experience. You might think of it differently, but don't modify tables, add new ones. That allows you to roll back into the old data, and there might be conflicts, but it's still better than having to wait for millions, millions of rows. And I'm at that point right now, I have over 3 million rows in the podcast database, and over five or so million rows in my episodes database. At this point, it's going to be over hundreds of millions in the end.

So any small change that goes through your development system within like 2 seconds is going to take 2 hours on the actual full production system. And that impacts other things, that impacts read speed, that impacts your, what is it, the sync time between your replicas. Replicas, if you have them. There are so many little things that can happen. If you change a tiny thing in your database, add on top of it, that is always going to be easier than changing things as you go.

So this is my personal learning over now, what, 15 years of building software with migrations and databases. Make sure you don't change things as you build. And most migrations have this forward backward mentality built into their systems. Where you add a new table, there is a step back where the table gets deleted, or you make a change. And in the rollback feature, you can kind of switch it back, but that often destroys data.

Just don't do it. Build another, add another property. Do something like this. Makes it easier. So I've experienced this very, very clearly over the last couple of days, which is why I'm kind of hammering on the point here.

But let's just look at this as a lesson in reframing, right? It was a frustrating loss of time, and I thought about it for a minute, removed myself from my office, talked to my partner, and all of a sudden I noticed that it was a massive learning opportunity to understand my product better and to understand my process better, which is, for a solopreneur like myself, very important. Like, if you don't have people that constantly critique your stuff because you're the only person that works with yourself, you need to step out of your own way and look at your process to see if there is room for optimization. So next time something like this happens to you, you build something, doesn't work, you deploy something, everything explodes. Look at it through the lens of emergent insights.

If I hadn't tried to implement this, I wouldn't have learned about the existing complexity in my business. And now I know more about myself and the business, and I have more experience with what I tried, what worked and what didn't do this a few hundred times and you have all the moat you'll ever need in your business. Like you will know so much about how to build a business or a product like yours that every copycat out there is just going to fail within the first 20 or 30 of these little problems themselves. And then they're going to move on to better things. That's the mode that you going through these experiments, learning from them and never seeing them as a failure, but always as an opportunity to get more insight.

That is what makes all the difference. And that's it for today. I want to briefly thank my sponsor, acquire.com dot. Imagine you building this perfect software product that never has any bugs. Yeah, it's a dream, right?

But you know, like you, you built a great SaaS. You have customers, and you're generating consistent mrr. You're living the SaaS dream pretty much. Problem is, it's just not working for you. You're not growing for whatever reason.

And that might be your personal growth, might be a business growth. It's just that there's something that's lacking. Focus, skill, interest. You feel stuck and you don't know what to do. The story here, unfortunately, is that too often people think they should just keep working on it.

But what happens? They just pay less attention. They stop doing things. Inaction happens, and then the business becomes less and less valuable over time, or at worst, completely worthless. So if you find yourself at this point, or if you think your story is likely headed down this road where it's just not working for you anymore, I would consider a third option, and that's selling your business on acquire.com to people who would like to build, who would like to take it and take it to the next level.

Capitalizing on the value of your time and just the skillset that you have right now is a smart move. And if your business doesn't fit anymore, well, you can exchange it for money. That's kind of how acquisitions work. So acquire.com is free to list. You can always just check it out, see what you can do to make it more acquirable.

The people over there have helped hundreds of founders already. So go to try dot acquire.com arvid and see for yourself if this is the right option for you. Thank you for listening to the Bootstrap founder today. I really appreciate it. You can find me on twitter at arvid kah.

A r v I d k a h l. You find my books, my Twitter course tattoo. And if you want to support me and this show, please subscribe to my YouTube channel. Get the podcast that you're currently listening to in your podcast player of choice and then leave a rating in the review by going to ratethispodcast.com founder. Maybe because that makes a massive difference.

If you show up there and you support the show, then the podcast will show up in other people's feeds and support them on that journey. Any of this will help the show. Thank you so much for listening. Have a wonderful day and bye.