Things That Don't Scale, The Software Edition

Episode Summary

ꜜ

The podcast discusses how software companies have used hacks and shortcuts, or "things that don't scale", to get their products to work when they didn't have time to build things the right way initially. The hosts discuss how Paul Buchheit, the creator of Gmail, used an existing Google Groups product to view his own email before building out features like writing emails and inviting other users. This "dirty hack" allowed him to launch and spread Gmail within Google before they had fully built out the infrastructure and features people expected from email. They also discuss how early Facebook launched each university on completely separate code, databases, and servers to avoid having to scale one giant database. Users would have to go to harvard.thefacebook.com versus stanford.thefacebook.com. It took Facebook years after launch to build one unified database for all users globally. Justin.tv, the precursor to Twitch, also used hacks to deal with scaling live video. They built the ability to turn high traffic pages into static pages to avoid crashing servers. They also pre-populated video streams across their video servers based on website traffic before officially launching the stream to viewers. Other examples are shared from iMeme, Friendster, Google and more. The hosts conclude that the best product decisions are often made quickly, when you just "turn on the water" and see what breaks, instead of over-engineering a perfect system. The imperfect hacks allow startups to build something users want so they eventually earn the privilege to later build more scalable systems.

Episode Show Notes

ꜜ

Dalton Caldwell and Michael Seibel on software hacks that don't scale. Companies discussed include Google, Facebook, Twitch, and imeem. Watch the first video on doing things that don't scale here: https://youtu.be/4RMjQal_c4U

Apply to Y Combinator: https://www.ycombinator.com/apply/

Episode Transcript

ꜜ

SPEAKER_00: We'll get a founder that's like, oh, how do I test my product before I launch to make sure it's going to work? I always come back and tell the founders the same thing. If you have a house and it's got full of pipes and you know that some of the pipes are broken and they're going to leak, you can spend a lot of time trying to check every pipe, guess whether it's broken or not and repair it, or you can turn the water on. You'll know. You'll know exactly the work to be done. Hey, this is Michael Seibel with Dalton Caldwell. Today we're going to talk about what does it mean to do things that don't scale, the software edition. In this episode, we're going to go through a number of great software and product hacks that software companies used to figure out how to make their product work when perhaps they didn't have time to really build the right thing. Now, Dalton, probably the master of this is the person we work with, a guy named Paul Buhite who invented this term, the 90-10 solution. SPEAKER_01: He always says something like, how can you get 90% of the benefit for 10% of the work? Always. This is what he always puts on to people when they tell them it's really hard to build something and it'll take too long to code it. He'll just always push on this point. And you know, founders don't love it. Right? Would you say that's a fair assessment, Michael? SPEAKER_00: That's a fair assessment. Yes. Founders hate it. SPEAKER_01: But tell the audience why it's worth listening to the guy. Why does he have the credibility to say that to people? SPEAKER_00: Well, PB is the inventor of Gmail and as kind of a side project at Google, he invented something that 1.5 billion people on earth actively use. And he literally did it doing things that don't scale. So I'll start the story and then please take it over. So as I remember it, PB was pissed about the Gmail product, the email product he was using. And so Google had this newsletter product. The first version of Gmail, he basically figured out how to put his email into this Google Groups UI. And as he tells the story, kind of his eureka moment was when he could start reading his own email in this UI. And then from that point on, he stopped using his old email client. And what I loved about this is that as he tells the story, every email feature that any human would want to use, he just started building from that point. And so, you know, he would talk to YC batch and he's like, and then I wanted to write an email. So I built writing emails. And if you know PB, like he could have got a couple of days reading emails without replying at all. So he didn't need writing emails to start. I remember him telling the first time he got his like coworker, like literally like his desk mate or something to try to use it. And his desk mate is like, this thing's pretty good. It loads really fast. It's really great. The only problem is PB, it has your email in it and I want it to have my email. And he was like, oh shit. Okay. I got to build that. I forgot about that. Perfect night and night solution. And so then it started spreading throughout Google. SPEAKER_00: And do you remember when it broke? No. What happened? Oh, so he told this story where like one day PB came in late to work, which is, you know, knowing PB every day, you know, and everyone was looking at him really weird. And they're all like a little pissed. And he got to his desk and someone came over to him and was like, don't you realize that Gmail has been down like all morning? And PB was like, no, I just got to work. And so he's like trying to fix it, trying to fix it. And then his coworkers see him like grab a screwdriver and go to the server room. SPEAKER_00: And it was like, they were like, oh God, why don't we trust PB with our email? Like we're totally screwed. And I think he figured out like there was a corrupted hard drive. And I remember that point of story, he was like, he says, and that day I learned that people really think emails important and it's got to always work. And like, I think the reason, I think the reason he did it, man, is because he liked SPEAKER_01: to run Linux on the desktop and he didn't want to run outlook. Like the Google like suits were trying to get him to run outlook on windows. And he was like, I don't really want to run windows, but yeah, it was the dirtiest hack. As I recall in this final part of the story, it was hard for him to get Google to release it because they were afraid it was going to take up too much hardware. And so there was all of these, there was all these issues where there's a good, there was a decent chance I think it never would have been released. SPEAKER_00: Well this part was that everyone thought Gmail's like invite system was like some cool like growth hack. Virality hack. Like virality hack. It's like, oh, you got access to Gmail. You got, I think four invites to give someone else. And these are like precious commodities. And it was, it was another product. It was just another version of things that don't scale. They didn't have enough server space for everyone. They physically did not have enough servers. SPEAKER_01: So they had to build an invite system. Yes. There was not an option to not, but basically there was no option other than building an invite system. It was not like genius PM growth hacking. It was like, yeah, well we saturated the, the hard drives are full. So I guess we can't invite anyone else to Gmail today. SPEAKER_00: That's it. That's it. So let's start another story about Facebook early days that, that is similar in this light. SPEAKER_01: So let me paint the picture. Back when you started to start up a long time ago, you had to buy servers and put them in a data center, which is a special room that's air conditioned that just has other servers in it. And you plug them in and they're, it's, they have fast internet access. And so being a startup founder until AWS took off, part of the job was to drive to the suburbs or whatever, drive to some data center, which is an anonymous warehouse building somewhere, go in there and like plug things in. And what was funny is when your site crashed, it wasn't just depressing that your site crashed. It actually entailed getting in your car. Part of being a startup founder was waking up at 2 AM and getting in your car and driving to like Santa Clara because your code wedged, you had to physically reboot the server and your site was down until you physically rebuilt to the server. So anyway, I'm just trying to set the stage for people. So this was, this was what our life was like. Okay. And so my company, I mean, we had a data center in Santa Clara and there was a bunch of other startups there as well. And so something that I like to do was to look at who my neighbors were, so to speak. There was never people there, it was just their servers. And there'd be a label at the top of the rack and you could see their servers and you could see the lights blinking on the switch. Okay. So this is what it was like. And so our company was in the data center in this data center in Santa Clara. SPEAKER_01: And then one day there's a new tenant and oh, new neighbors. So I look at it and the label at the top of the cage next to ours, you know, three feet away, the label said thefacebook.com. SPEAKER_01: And I remember being like, oh yeah, I've heard of this. Like, cool. Like, sounds good. And they had these super janky servers. I think there was maybe eight of them when they first moved in. And they were like super cheap, like super micro servers. You know, like the wires were hanging out like, you know, I'm like, cool. But the lights were blinking really fast. Okay. And so what I remember was that there was labels on every server and the labels were SPEAKER_01: the name of a university. And so at the time, one of them, one of the servers was named Stanford, one of them was named Harvard, you know, like, and it made sense because I was familiar with the Facebook product at the time, which was like a college social network that was at like eight colleges. Okay. So then I watched every time we would go back to the data center, they would have more servers in the rack with more colleges. And it became increasingly obvious to me that the way they scaled Facebook was to have a completely separate PHP instance running for every school that they copy and pasted the code to. They would have a separate MySQL server for every school and they would have like a memcache instance for every school. And so you would see like the University of Oklahoma, you know, you'd see the three servers next to each other. And the way that they managed to scale Facebook was to just keep buying these crappy servers. They would launch each school and it would only talk to a single school database. And they never had to worry about scaling a database across all the schools at once. Because again, at the time, hardware was bad. Okay, MySQL was bad. Like the technology was not great. If they had to scale a single database, a single users table to hundreds of millions of people, it would have been impossible. And so their hack was the 910 solution like PBE used for Gmail, which is like, just don't do it. And so at the time, if you were like a Harvard student and you wanted to log in, you would, it was hard coded to the URL was harvard.thefacebook.com, right Nan? And so if you try to go to stanford.thefacebook.com, it'd be like, you know, error. Like that was just a separate database. And so then they wrote code so you could bounce between schools. And it actually took them years to build a global users table, as I recall, and avoid this this hack. And so anyway, the thing they did that didn't scale is to copy and paste their code a lot and have completely separate database instances that didn't talk to each other. And I'm sure people that work at Facebook today, I bet a lot of people don't even know the story. SPEAKER_01: But like, that's what it took. That's the real story behind how you start something big like that, versus what it looks like today. SPEAKER_00: So in the case of Twitch, all, if not all, like most of the examples of this came from this core problem. And it's why I tell people to not create a live video site. A normal website, even a video site, on a normal day will basically have peaks and troughs of traffic. And the largest peaks will be 2 to 4x the steady state traffic. So you can engineer your whole product such that if we can support 2 to 4x the steady state traffic, and our site doesn't go down, we're good. On a live video product, our peaks were 20x. Now, you can't even really test 20x peaks. You just experience them and fix what happens when 20x more people than normally show up on your website because some pop star is streaming something. And so two things kind of happened that were really fun about this. So the first hack we had was if suddenly some famous person was streaming, on their channel there'd be a bunch of dynamic things that could load. Like your username would load up on the page with our channel and the view count would load up and a whole bunch of other things that would basically hit our application servers and destroy them if 100,000 people were trying to request the page at the same time. So we actually had a button that could make any page on Justin TV a static page. All those features would stop working. Your name wouldn't appear, the view count wouldn't update. Like literally a static page that loaded our video player and you couldn't touch us. We could just cache that static page and as many people as possible want to look at it. Now to them, certain things might not work right. But they were watching the video, the chat worked because that was a different system. The video worked, that was a different system. And we didn't have to figure out the harder problems until later. Later actually Kyle and Emmet worked together to figure out how to cache parts of the page while make other parts of the page dynamic. But that happened way, way later. Dude, that reminds me, let me give you a quick anecdote. SPEAKER_01: Yes. Remember Friendster before MySpace? Yeah, of course. Every time you would log in, it would calculate how many people were two degrees of separation from you and it would fire off a MySQL thread. Where you would log in, it would look at your friends and it would calculate your friends and friends and show you a live number of how big your extended network was. And the founders, you know, John Abrams, he thought this was like a really important feature. I remember talking to him about it. Guess what MySpace's do things that don't scale solution was? SPEAKER_00: They made it out for one. If they were in your friends list, it would say this is in your friends, you know, so SPEAKER_01: and so is in your friends list. And if it wasn't, it would say so and so is in your extended network. SPEAKER_00: There it is. That was it. That was the feature. And so, so Friendster was like trying to like hire engineers and scale MySQL and they're SPEAKER_01: running into like too many threads on Linux issues and like updating the kernels. And MySpace was like, so and so is your extended network. That's our solution. Anyway, carry on. But that's same deal. SPEAKER_00: So our second one was, it always happened with popular streamers. Our second was, if you imagine, if someone is really popular and there's a hundred thousand people who want to watch their stream, we actually need multiple video servers to serve all of those viewers. So we basically propagate the original stream coming from the person streaming across multiple video servers until there was not enough video servers to serve all people who are viewing. The challenge is, is that we never had a good way of figuring out how many video servers we should propagate the stream to. And if a stream would slowly grow in traffic over time, we had a little algorithm that could work and like spin up more video servers and be fine. But what actually happened was that a major celebrity would announce they were going on and all their fans would descend on that page. And so the second they started streaming, a hundred thousand people would be requesting the live stream, bam, video server dies. And so we were trying to figure out solution, solution, solutions, and like, how do we, how do we model this? How do we, like, there were all kinds of like overly complicated solutions we came up with. And then once again, Call dynamic got together and they said, well, the video system doesn't know how many people are sitting on the website before the video stream, before it starts, starts trying to start video. But the website does. All the website has to do is communicate that information to the video system and then it could pre populate the stream to as many video servers as they would need to, and then turn the stream on to users. So what happened now in this setup is that some celebrity would start streaming. They would think they were live. No one was seeing their stream while we were propagating their stream to all the video servers that are needed. And then suddenly the stream would appear for everyone and would look like it worked well. And like the delay was a couple seconds. It wasn't that bad, right? It was like, you know, like dirty, super dirty, but it worked. And honestly, that's going to be kind of the theme of this whole setup, right? Super dirty, but it worked. You had a couple of these in iMeme, right? SPEAKER_01: Yeah, there was a couple that we had in iMeme. So one of them, so at the time, again, like to set the stage, the innovation of showing video in a browser without launching real player. No one here probably knows what that is, but it used to be to launch a video, it would launch another application in the browser that sucked and it would like crash your browser and you hated your life. Okay. So one of the cool innovations that YouTube, the startup YouTube had before it was acquired by Google was to play video in flash in the browser that required no external dependencies. It would just play right in the browser. At the time that was like awesome. Like it was like, it was a major product innovation to do that. SPEAKER_01: And so we wanted to do that for music at iMeme. And we were looking at the tools available to do it and we saw all this great tooling to do it for video. And so rather than rolling our own tools that was music specific, we just took all of the open source video stuff and hacked the other video code that we had so that every music file played on iMeme was actually a video file. It was a .flv back in the day. And it was actually a flash video player. And the entire, it was basically of, we were playing video files that had like a zero bit SPEAKER_01: in the video field and it was just audio. And we actually were transcoding uploads into video files. You know what I'm saying? Like the whole, the entire thing was, was, it was a video site with no video. I don't know how else to explain it. And I do think this is a recurring theme is a lot of the best product decisions are ones made kind of fast and kind of under duress. SPEAKER_00: SPEAKER_01: I don't know what that means, but it's like when it's like 8 p.m. in the office and the SPEAKER_01: site's down, you tend to come up with good decisions on this stuff. SPEAKER_00: So we had two more at Twitch that were really funny. The first one talking about duress was our free peering hack. So streaming live video is really expensive. Back then it was really expensive and we were very bad fundraisers. That was mostly my fault. And so we were always in the situation, we didn't have enough money to stream as much video and we had this global audience of people who wanted to watch content. And so we actually hired one of the network ops guys from YouTube who had figured out how to kind of scale a lot of YouTube's early usage. And he taught us that you could have free peering relationships with different ISPs around the world and so that you wouldn't have to pay a middleman to say serve video to folks in Sweden. You can connect to your servers, you go, I forgot what they're called. Yeah, it saves you money and it saves them money. SPEAKER_01: That's what they want. SPEAKER_00: Yeah. And there were these massive like switches where you could basically like run some wires to the switch and bam, you can connect to the Swedish ISP. Now the problem is that some ISPs wanted to do this free peering relationship where basically you can send them traffic for free, they can send you traffic for free. Others didn't. They didn't want to do that or like they weren't kind of with it. And so I think it was Sweden but I don't remember. Some ISP was basically not allowing us to do free peering and we were spending so much money sending video to this country and we're generating no revenue from it. We couldn't make a dollar on advertising. And so what we did is that after 10 minutes of people watching free live video, we just put up a big thing that blocked the video that said your ISP is not doing a free peering relationship with us so we can no longer serve you video. If you'd like to call to complain, here's a phone number and email address. And that worked. How fast did it take for that to work? I don't remember how fast. I just remember it worked and I remember thinking to myself it's almost unbelievable. Like that ISP was a real company. We were like a website in San Francisco and hey that worked. And then the second one was translation. So we had this global audience and we would call these translation companies and we'd ask them how much would it cost to translate our site into these 40 different languages and they were like infinite money. We don't have infinite money. SPEAKER_00: And so I think we stole the solution from Reddit. We were like what happens if we just build a little website where our community translates everything? And so basically it would just like serve up every string in English and it was like serve to anyone who came to the site who wasn't from an English speaking country and was like do you want to volunteer to translate the string into your local language? And of course people were like well what if they do a bad job translating? I was like well the alternative is it's not in their language at all. Let's not make the perfect enemy of the good. And I think we had something where we would get three different people to translate it and match but that happened later. But we basically got translation for a whole product for free. Maybe to end because I think this might be the maybe the funniest of them all. Tell a Google story because I think this one's like the really like. SPEAKER_01: So look for the Facebook story that was firsthand where I personally witnessed the servers with my own eyes. I'm 100 percent confident that is what happened because it was me. This Google story is secondhand and so I may get some of the details wrong. I apologize in advance but I'll tell you this was relayed to me by someone that was there. All right. You ready? So look the original Google algorithm was based on a paper that they wrote which you SPEAKER_00: SPEAKER_01: can go read page rank. It worked really well. It was a different way to do search. They always didn't have enough hardware to scale it because remember there was no cloud back then you had to run your own servers. And so as the Internet grew it was harder and harder to scale Google. You still with me? Like there were just more Web pages on the Internet. So it worked great when the Web was small but then they kept having more Web pages really fast and so Google had to run as fast as they could to just stay in the same place. Just to run a crawl and reindex the Web was like a lot of work. And so the way the work at the time is they weren't reindexing the Web in real time constantly. They had to do it in one big batch process back in the day. And so there was some critical point. SPEAKER_01: This is probably in the 2001 era again this is secondhand I don't know exactly what it was but there was some critical point where this big batch process to index the Web started failing and it would take three weeks to run the batch process. It was like the reindex Web.sh. It was like one script that was like do Google. And it started failing. And so they tried to fix the bug and they restarted it and then it failed again. And so the story that I heard is that there was some point where for maybe three months SPEAKER_00: SPEAKER_01: maybe four months I don't remember the exact details. There was no new index of Google. They had stale results. So anyone any user of Google they didn't know that the users didn't notice. Any user of Google was seeing stale results and no new Web sites were in the index for quite some time. And so obviously they were freaking out inside of Google. And this was the genesis for them to create MapReduce which they wrote a paper about which was a way to parallelize and break into pieces all the little bits of crawling and reindexing the Web. And Hadoop was created off of MapReduce. There's a bunch of different software used. And I would argue every big Internet company now uses the descendants of this particular piece of software. And then it was created under duress when Google secretly was completely broken for an extended period of time because the Web grew too fast. SPEAKER_00: But I think this is the most fun part about this story. When the index started getting stale did Google shut down the search engine? That's the coolest part. Like people just didn't realize. They didn't know. And did they build this first? SPEAKER_01: Again in terms of do things that don't scale did they build MapReduce before they had any users? No. Like they basically made it this far by just building a monolithic product and they only dealt with this issue when they had to. SPEAKER_00: You know I think this is like such a common thing that comes up when we give startup advice. You know we'll get a founder that's like oh how do I like test my product before I launch to make sure it's going to work? And I always come back and tell the founders the same thing. Like if you have a house and it's got full of pipes and you know some of the pipes are broken and they're going to leak you can spend a lot of time trying to check every pipe, guess whether it's broken or not and repair it or you can turn the water on. And like you'll know. Like you'll know exactly the work to be done when you turn the water on. I think people are always surprised that that's basically all startups do is just turn the water on, fix what's broken, rinse and repeat. And like that's how big companies get built. It's never taught that way though right? It's always taught and like oh somebody had a plan and they wrote it all down. It's like never. SPEAKER_01: Never. You earn the privilege to work on scalable things by making something people want first. You know what I think about sometimes with Apple is picture like Wozniak hand soldering the original Apple computer and like those techniques compared to like whoever it is that works on Apple that designed AirPods. Like it's the same company but like Wozniak hand soldering is not scalable. But you know because that worked they earned the privilege to be able to make AirPods now. SPEAKER_01: And because Google search was so good they earned the privilege to be able to create super scalable stuff like MapReduce and all these other awesome internal tools they built. But if they would build that stuff first it wouldn't be Google man. SPEAKER_00: And so to wrap up kind of what I love about things that don't scale is that it works in the real world right? The Airbnb founders taking photos, the DoorDash folks doing deliveries. It also works in the software world right? Like don't make the perfect the enemy of the good. Just try to figure out any kind of way to give somebody something that they really want and then solve all the problems that happen afterwards. And you're doing way better. All right thanks so much for watching the video.