313: Challenges of Offering an API

Primary Topic

This episode delves into the complexities and security challenges of providing an API for Arvid Kahl's podcast data platform, Podscan.

Episode Summary

Arvid Kahl shares insights on safeguarding a data-centric business that revolves around an API, emphasizing the threats of data scraping and the protective strategies needed. He discusses the balance between user accessibility and rigorous data protection, including techniques like encoded IDs and stringent rate limiting. Kahl also reflects on the dual nature of APIs as both a business asset and a potential liability, pointing out the necessity of terms and conditions that restrict competitive use of the API. The episode is a deep dive into the operational and strategic considerations of running a data platform in the competitive podcasting space.

Main Takeaways

The ease of API access can increase user engagement but also raises the risk of data misuse.
Encoded IDs and complex access patterns can deter potential data thieves.
Rate limiting is crucial to prevent abuse while allowing genuine users sufficient access.
A no-freemium policy helps protect valuable data from non-paying users.
Legal and operational frameworks must evolve with the business to adequately protect its assets.

Episode Chapters

1: Introduction to API Challenges

Arvid discusses the transition from identifying his customer profile to addressing the direct challenges of offering an API. He outlines the core concerns around data accessibility and protection. Arvid Kahl: "The thing that I want to give freely to my paying customers is also the thing that I have to protect at all costs."

2: Defensive Strategies

Details on implementing encoded IDs and rate limiting to safeguard the API. Arvid explains how these measures help prevent scraping and unauthorized data collection. Arvid Kahl: "Encoded IDs help obfuscate the underlying ID to make it harder to iterate and make the record more recognizable."

3: Legal and Business Considerations

Kahl elaborates on the legal terms set up to prevent competitive use of the API and discusses the rationale behind a no-freemium policy. Arvid Kahl: "You cannot use the Podscan API to create an application or service that competes directly with Podscan's core products."

Actionable Advice

Use encoded IDs to enhance API security.
Implement strict rate limiting on API access.
Establish clear terms of use that prevent competitive threats.
Consider the business implications of offering freemium access.
Regularly review and update security measures to stay ahead of potential threats.

About This Episode

When you sell access to an API, what should you keep in mind? How would you protect your data but still make it easy to use? What product and legal repercussions does this all have?
Those are questions I asked myself this week. And today, you get to listen to the process of exploring them and the answers I found.

People

Arvid Kahl

Companies

Podscan

Books

None

Guest Name(s):

None

Content Warnings:

None

Transcript

Arvid Kahl
Welcome to the bootstrap, founder. Last week I dove into my ideal customer and the exploration of finding them. Now that I've chosen that elusive ideal customer profile, that ICP, there are consequences. And I have found that the right people to build for are there now, but what do they actually need? I'll walk you through my product challenges today.

This episode is sponsored by acquire.com. More on that later. Let's recap from last week. I decided to turn Podscan into the most comprehensive podcast data platform it can be. It's very focused on transcription, but there's more to this, and I'll get to that in today's episode.

My ideal customer and the customer profile around them is anyone who wants to build a product or service or business on top of such a data platform in the podcasting space, or utilizing podcasts for whatever they want to do. Might be marketing, might be placing people on podcasts, might be anything, as long as they need the data. That is my customer profile. And that means that I'm selling something that is extremely easy to copy and to clone and to abuse. Right?

That's what data is. The thing that I want to give freely to my paying customers. Transcripts, rankings, metadata, all other kinds of things related to podcasts is also the thing that I have to protect at all costs. That's the bizarre thing about businesses that are based on APIs. The easier it is to grab the data from the business or from the API, the more people want to actually use it, because it's simple, it's enjoyable, yet the easier it is to grab a lot of data, the more risky it gets for the actual business offering it, because people can quickly abuse it.

There are a few problematic kinds of behaviors that software businesses have to contend with if they are in the API space, in the data platform space, and they are exacerbated, particularly if you have all your value locked up in your API. And that is mostly scraping it is the biggest threat to a business. Just someone grabbing the whole database in one go. Every single podcast, every single transcript, all connections between them, all ratings, the whole thing. If somebody were to get this, that would be a problem.

And the problem is beyond that, duplicating a valuable treasure trove of data. That's pretty much what the Internet was built for, right? Every time you go to a website, a small copy of that website is made on our computers. And most of the time, website owners actually want that. That's how it works.

That's how we see content. And don't get me started on actually downloading files. We have BitTorrent and we have the file systems, distributed file systems and all that kind of stuff. Ipfs where people very willingly host copies of things on their computer so that other people have access to them. But a fully fledged database that costs hundreds of hours in work and tens of thousands of dollars to create, or has at least until now.

Yeah, I don't want that to be copied, really not. So I need to prevent it. And from the start I have been. And I need to keep staying ahead of those who want to siphon this treasure trove of data into their own systems. And with that in mind, I think I need to think defensively in a couple of ways about my product.

First, I need to make it hard to iterate over my database entries very easily. I need to make that very complicated. And if you're the example here is, I guess if you're downloading record 4287 of my database, you know that there's probably a 4288 and a 4289 and so on, right? That way anybody who's intend to scrape the website could be just building an automated system to grab every single record in a row, iterating over these numbers. And that's why I created encoded ids in my API, just like stripe is doing.

I think that both obfuscate the underlying id to make it harder to iterate and make the record more recognizable so that 4287, which is just a seemingly random number, turns into something like pod, pod underscore, and then some kind of string, eight, six, five, whatever be. There's still randomness in there, but it starts with pod and it has a certain length. It's not just a number, and that looks more like a podcast because it has the word in it and less like a random number. And I think that is also more usable, and usable in a sense that it can be copied more easily, it can be shared, risk something more semantically holistic. And if somebody were to get their hands on a list of all of these ids, well obviously they could still scrape them.

But all this really needs to do is to deter people from seeing an easy opportunity. And a lot of security in the API space is about hard security. But there is also that kind of stuff, right? It's not security by obscurity, because there is an encoding and a secret and a hashing process involved for this, but it also is just a signal that, okay, this is not going to be one of those easy APIs to scrape. And in many ways that can deter at least the less interested parties you have still have to fight the other ones, but we'll see.

Let's keep thinking about what else I need to think about. Any API that I offer needs to be severely rate limited, particularly in MySpace, because podcast information, historical data that goes back to like 2012 or whatever, when people started really podcasting, well, that doesn't change after the fact. Once it's scraped, it is true. Once somebody's downloaded this, they have something valuable and it's not changing. It's not like altering its nature all the time.

That is something that they can then build something upon. And even with mild scraping, somebody could eventually explore the whole API within a few months just using search or whatever if they were to use it all the time. Right? And that's where rate limits come in. Rate limiting means really on my trial plan, for example, that only 100 requests per day can be made to the API.

There's not a lot for a scraper that's actually used up within a couple seconds and then it can't do anything anymore. But for somebody evaluating the API product, playing with it, checking it out, 100 manual requests, that's quite a lot. And it's more than enough to see if it works, how it works, what the data is, what the speed is, that kind of stuff. And paid plans on Potskin get very liberal, but still very sensible limits, like a couple thousand, a couple, maybe tens of thousands for the bigger plans. And if somebody needs more, they can buy an enterprise plan and get in touch.

Like, I have the capacity to set these limits on a per account kind of level, right? That's, that's kind of the idea of running this API. I can control how much people get to see how often they get to use it and all that. And for anyone else, I think these limits are quite sufficient. And if they're not, they can just ask me.

I can modify them as I learn more, right? They can tell me, hey, this is, I think I'm paying $50 a month, but I think I need more. Well, maybe I can adjust the limit. I'm very flexible. This is my business.

I can do whatever I want. So, you know, that is a part of communication happening here. But I had to set limits initially to protect the API from being scraped or from people being able to do this in the first place. And finally, I guess another choice that has to do with money in this case, I guess, is there's no freemium. There's no freemium plan.

I cannot and I will not allow non paying customers to access this data because if they cannot afford the $19 a month plan, the essentials that still has significant access to the API. They can have it. They just won't. People go through great lengths of automating, account creation, and data extraction in freemium products. There are a lot of examples where people with the freemium plan have even just free trials to begin with.

But freemium in particular, that has certain limits set. And then people create, like, 20 accounts in one go, like all, like their name plus one and plus two and up to plus 20 and then some just to be able to abuse the system. And I don't want that. That's not going to happen here. Podscan is pay to play.

Like, for real. It's a business. Come on. Like, if you want this kind of data to build something with, you can shell out $20 for the most basic version. And if you can't, well, maybe you shouldn't build a business.

And I do all of this mostly because the most easy part of Podscan, if you were a copycat trying to clone it, is the interface and the kind of web facing stuff, the complicated and expensive stuff is all in the backend in a database. That's what I have to protect because that's what people are after, and that is hard to build. And product limitations like what I've been talking about aren't the only barriers that I can throw into people's path here. Of course, as a German, I like rules and regulations, and I draft the terms and conditions for the API even before I had built it. Like, maybe not the best idea for an indie hacker, but it just comes very natural to me.

So it happened. I had that in place before I even activated the first user on the API. The first sentence of these terms should make it absolutely clear what's okay and what is not. That sentence goes like, you cannot use the Potscan API to create an application or service that competes directly with Potscans core products. That's the sentence, right?

You cannot build Podscan clonescan.com from the Podscan API, because if you do, I will turn off the API and I might take other steps because that is a breach of these agreements. And I added a few sentences in there about storing the data. Also not allowed. If it's not meant for immediately serving your customers. If that is what it's for, that's fine.

But, like, storing it and using it for your own purposes, no. That is a limitation that every API user agrees with upon connecting to the botskin APIs. That's also part of this. The moment you start using it you agree to these terms, it's very clearly outlined, and there is an intentional act to connect to an API, so I assume consent. And of course all of this is flexible for sure.

But the point is, people cannot use it to build a clone that is legally not okay. And they cannot use my data without re requesting it from the servers after a certain while, which also helps with stale data floating around. And it kind of protects the integrity of the data that I present. And when you limit access like this, you also limit opportunity. I'm quite aware of this, and that's the kind of hard balance to strike here.

I want my users to feel like they can build anything they want on top of these APIs, but I also very much still want to be in control of the data that powers these other products. It's funny, because I've been looking into my competitors like the pod chasers and the listen notes or the data platforms for podcasts, and they have very, very similar terms and just rules. And I've talked to several of those founders, and they are super highly protective of that data because they know how valuable it is, and they know how easy it is for people to copy large amounts of data and then do something with it and sell it for cheap. So they are usually charging a lot of money even for the most basic access. And they are highly, highly restrictive in terms of how much data you can get at any given point and any kind of amount of data that you can grab.

It's really noticeable just how protective people are. Podcast data. I find it very interesting because it's so funny to think that podcasting itself is a highly open ecosystem, but the aggregation of this, the aggregation of not just podcast information itself, but metadata, like transcripts, what I'm doing, or like audience information, what James Potter is doing with we phonic, that kind of stuff. Like there is so much additional data being aggregated and it's so expensive to do it, that people protect that data very, very, very strongly. And I got a message earlier this week on my help desk chat, the widget that I have, the little bubble from a founder who wondered just how much they could cache the data, not even storage cache it, that they receive from my API is a few seconds fine.

Can they go like into their own cache to be sent out in an email later? And it got quite specific and it just reminded me how much just in time decision making running a software business really is about both on their end, because they needed to figure out, is this a tool that we can use for our purposes but also on my end because I needed to kind of think about, well, how much do I allow people to do here? Right? Is it fine to cache it for an hour? Is 24 hours okay?

Can they write it into like a redis, or can they write it into like an SQL database somewhere, but then delete it? It was an interesting consideration, like how much am I willing to bend my own rules to facilitate things by people that I have kind of validated and can trust? That is something, I guess that I will even find more often of this particular challenge. The more I go into kind of deals with bigger and bigger businesses. But right now it already with these founder kind of people that I talk to that are on my level, that built businesses, built small businesses to try things, it's already happening here.

I found that a very interesting observation. I need to remind myself that all of this is just stuff we make up and that these rules are there to be flexible and sometimes broken for an opportunity that comes our way, but they're also there to protect us from other things that other people see as opportunity. That's kind of the theme of this whole thought for me. How can I build a defensive business that can also be offensive, that can just go for an opportunity without risking too much? That's where I'm at.

And the more data I ingest with Potscan, the more data is kind of pulled into the system and transcribed and analyzed, the more critical these choices and partnership agreements will become, right? Right now, my users have personal access to me. They can write to me, they can dm me, and I often have a personal history with them from prior Twitter conversations or email exchanges. But someday these will be bigger and bigger businesses trying to get their hands on as much as they can, because there is no personal connection, there is no mutual trust. So that kind of brings me to another conundrum that I have here.

And there are some kinds of data that I collect from a wide variety of sources that I might not want to share on the API at all. I have them, but I don't want to give them out. I'm thinking about this, and let me explore my thoughts with you here. Audience size, I mentioned this earlier, is one of the best kept secrets in the podcasting world, right? That it's really hard to come by data.

No hosting provider, no podcast player or the creator of such software gives away even a glimpse at the actual numbers behind the podcast that they work with or on. The only people who know how many listeners they have are the owners of the podcast themselves. Sometimes they don't even know, because their podcast is run by a network and the network knows, but they don't. They only get like rough numbers and even those they don't share. It's a really, really tough thing to come by in such a situation.

What would one do? Well, I guess guesstimates are the thing I can present. There are a lot of tools out there that check the Apple podcast charts and look for review counts, and then check the size of the social media profiles that are mentioned in the podcast descriptions, and then compile them into some kind of score. Podchaser has a score, listennotes has a score. And I think I'm working on something similar as well, because I'm effectively building the same kind of system in the background.

But I could share all these metrics that kind of go into the score in my API, right? I could share very specifically how many reviews this podcast has in the United States, or at least it had over the last couple of weeks, because I have a full history of review counts on Apple. That's what I've been building. So why not add it to the API? I'm thinking about this, and I'm struggling with this quite a bit.

I want my users to be able to get as much as they can from the platform. Obviously, this data can be useful to someone. I don't know what it's going to be like. Maybe somebody wants to build a review system for themselves, and I don't know what the data can do. I just know that it's there and people could use it.

But I also want to keep some secret sauce to myself, because if I create a score from that data, well, the score is valuable all in itself, and it doesn't need the data. And I've been looking at how other platforms solve this, and most of them just really don't. They don't give you the data, if anything at all. They share this rough score, a simple ranking, like four out of ten star rating or something, or top 10%, that's all you get. And even that tends to be only available in the more expensive tiers, like you have to pay for it.

And only then even do you get these weird little ranked scores. And honestly, it feels odd to limit data like this. But I think that's what I will do with Podscan as well. It's just like, realistically, audience information is probably the most expensive non AI work that Potscan could do, right? AI work is like GPU, like transcription and inference, and that stuff that is expensive because that runs on hardware, is just really, really rare right now.

It's hard to access and people charge a premium for it. But even computationally on a cpu or the many, the 48 cpu is that my measly PHP server has. There's a lot of computation because when you scan for audience information, scan for social media profiles and that stuff, it involves constantly scanning the web, parsing websites. Occasionally I need proxies to reliably get results, and all of that has a cost. For this reason, I think I'll make anything indicating reach and audience and listener data a premium and higher plan feature for Potscan.

Right now, Potscan Essentials is the starting plan for $20. Premium is 50 and the higher is like 100 plus, right? That's like my enterprise plan is at 500. Hey, if somebody wants it for a little bit cheaper and has a good reason and they reach out to me, they're probably going to get it. I'm at this point in my business, but you know that that is kind of where it is, 20 5500.

So at 50 plus this data will exist on the API and the API will not return these fields for essential customers and only returning example data or rounded numbers for trial accounts. I think I do not want abuse for this because that is so relevant. The data is so hard that I need to protect it. Hard to get and hard in itself. Very true.

So I'll have to figure out how I can actually communicate this. Like for people who are on those lower plans that there is more there, but they can have it unless they actually pay for it. I probably have to put it in the documentation and maybe inside the product as well. But I think that's the way forward for this. It costs me to create, so it should also cost people to consume.

And of course I'll have to make sure that all these limitations and protections are also present in the user facing website. Obviously scraping happens right at that level, not always on the API, but right on the website. And I can already feel that my eagerness to present all kinds of interesting data there might lead to data extraction that is not easily fought with rate limits and IP blocks on the API side. I really need to make sure that my website is protected too. It is like there are rate limits in there as well, but I need to make sure that it's actually working and people can circumvent it easily.

No doubt I'll run into other API and data related issues in the future here as well. And you might even think of one right now that I missed. That is very clear to you and I have not thought about at all. In that case, please just reach out to me, tell me this is an important topic and I would really, really like input here. Feel free to send me a Twitter DM or email me at arvidscan fM.

Like I am very responsive either way. Twitter dm maybe not as much because Twitter's interface is horrible, but yeah, send me an email and I really appreciate all the wonderful feedback that I've been getting over the last couple of weeks as I've shared the Podscan journey in public. It's been a lot of fun. Got a lot of really cool shoutouts on Twitter for it. And in the community, people have been responding way more to my emails and my newsletters than before.

So I mean, obviously building in public is interesting. I follow several podcasts and people who are building their own businesses in public religiously, on my dog walks. I listen to that kind of stuff. So I get it and I quite enjoy it and I will never stop. So, you know, just enjoy it and send me what you think.

That's really what this is. I've been having so many cool conversations with my users and my customers, both of them highlighting things that they need, highlighting things like that they are interested in, how they approach their work, what they want, what they would like to see, what they use somewhere else. It's been really, really enlightening. Any feedback is welcome. And maybe this also allows you, as you are building your own thing, to allow for more feedback from others.

Right? I know we have our ideas, we have our plans, and we have our goals. And we have this conception as solopreneurs and indie founders that we are the only arbiters of choice. We make the decisions. So feedback is often something we struggle with because it can throw something in the path that we are not willing to deal with.

But you can and should get some feedback. And if you want to try what it feels like from the other side, send me a message. Tell me what you think. I really appreciate it. And that's it for today.

I want to briefly thank my sponsor, acquire.com comma, whom I intend to use to eventually sell Potscan for many, many millions of dollars. But we'll see, because that is not always the outcome. In some situation, you might have a really cool software business. You have customers, you have thousands of customers, you have good MRR, you're living the dream, but you're stuck in some kind of equilibrium. You can't really go anywhere.

It's working, but it's not growing. It's working well, but it's just not going anywhere. Or maybe it's even just going slowly, slowly down and you don't want to lose the value of the thing that you have created. And I know the situation is different for every single founder out there because we, we've all been building very different businesses. I always say that entrepreneurship is some, is building something that nobody else has ever built before in this exact way.

So obviously the outcome is always going to be slightly different and unique between all of us. But if we are in a situation like this where we hit a skill ceiling or a time ceiling or just an attention ceiling, whatever it might be, the outcome is often the same and we lose interest, the business suffers and in the end, it becomes less and less valuable, maybe at worst completely worthless. And that does need to happen. You create value. You should be able to convert it into cash for real, right?

You should be able to sell it to somebody who does not have the ceiling or has a different ceiling that can be useful for your business right now. So in that case, think about selling your business. You don't have to do it today or this week, but you should always consider that this is an option for you to take your business in the future. You can go to try dot acquire.com arvid to just see for yourself if this is something for you. Because the people over at acquire, they've been helping hundreds at this point, I guess thousands of customers or people, founders like us, sell their business for a solid price that other people are willing to pay and that will change your life.

So they know how to make a business more sellable, more presentable. That is something you should look into from the beginning anyway. So yeah, go to acquire.com, check it out, free to list. And I think this might be the right option for you at some point. So why not check it out today?

Thank you so much for listening to the boots of founder today. You can find me on Twitter at avidka Sen. Send me a message there if you're interested. You'll find my book searchesault, the embedded entrepreneur on my Twitter course. Find your following that too.

All of this is still very valid and useful. I occasionally look into my own books to just figure out where my next steps should be and if I'm not forgetting anything, it's quite useful really. Good for my journey and good for your journey as well. If you want to support me in this show, please subscribe to my YouTube channel, get the podcast in your podcast player of choice, and leave a rating and a review by going to ratethispodcast.com founder. I mentioned James Potter earlier who built refhonic like a podcast audience size kind of API platform.

He also built ratethispodcast.com dot. So I've been kind of suggesting his product to any listener of the show for years at this point and never knew. We've been chatting, right? Like, all the people building businesses in the podcasting space are kind of connected with each other. So that was really fun to have a chat and find out that we've been, you know, Internet nerd friends for a while and just didn't know.

Well, any, any rating, any review will really make a massive difference if you show up on those rating platforms because then the podcast will show up in people's feeds. They will learn more about my journey, the products, and they will also just be more exposed to the knowledge that I'm trying to share. And I think that is really important to me and I appreciate if you help me with this. Thank you so much for listening. Have a wonderful day and bye.