Wednesday, January 17, 2007

From Hot Concept to Hot Site in Eight Days

From Hot Concept to Hot Site in Eight Days

By James Hong evolved from an idea into one of the biggest sites on the Web in less than two months. Such rapid growth meant the site had to scale quickly, especially in the first eight days.

It all started the evening of October 3, 2000, when I was sitting in my living room sharing a few beers with my roommate, Jim Young, and my brother, Tony. Jim had just mentioned that he thought a girl we had met at a party was a perfect "10," when the idea suddenly came to me: "Wouldn't it be funny to have a Web site where you could rate random pictures of people from 1 to 10?"

For lack of anything better to do, we kept talking about it. We built the site in our heads, arguing over what kind of functionality the site would have, designing the user interface, and deciding on the details. After three hours, we had a whiteboard with a Web-site layout drawn on it and a burning desire to build the site.

We also had the time. I was unemployed and working on an online resource for Simple Object Access Protocol (SOAP) developers, called, with my brother. Jim was an electrical engineering grad student at the University of California at Berkeley. Because the concept was so simple, the site took only a couple of days to build.

Keep It Simple

After hours of arguing over the Web-site design, the whiteboard had only three pages on it: one page where people voted on the appearance of others; another where people submitted their own pictures; and a final page where people viewed their own rating. We thought that simple was better.

The simplicity of the site's user interface contributed to its addictive nature. Ultimately, we built the site using Apache, PHP, and MySQL.

Initially, we decided to create the site with only three CGI scripts. While that would have been easier, we eventually decided to make the site more scalable in case we wanted to make changes or additions. Instead, we used a state-machine architecture.

Every page view on our site can be thought of as a certain state. For instance, when a user is looking at a picture, he or she is in the "Vote" state. After a user votes, our machine's initial step is to process the "Exit" tasks of the Vote state, such as tallying the vote in the database.

Based on switching variables, the machine then determines what the next state should be. In the simplest case, the next state should be a return to the Vote state.

The state machine's final task is to perform the "Enter" tasks of the next state. This can include selecting a suitable photo to show the user. The Enter task also renders the HTML for the user, which, in this case, would be a page with another photo on which to vote (see Figure 1).

Using this structure for such a simple site might seem like overkill, but it has definitely paid off. Changing the site is extremely easy due to the fact that all of the various tasks are centralized, including the routing between states.

Another advantage of using the state-machine architecture is that it forced us to create distinct interfaces between states. By load balancing across a farm of identical servers, each running an instance of the state machine, the architecture made it easy to scale the site by just adding more Linux machines.

The Deluge

Not long after building our site, we noticed that starting at around 10 a.m. Pacific Standard Time, our machines' performance began degrading to the point at which our servers were being shut down. Upon further inspection, we noticed that the rate at which SYN connections were being made was overwhelming our servers. When one computer wants to initiate a connection with another, it does so by sending a SYN packet that basically says, "Hello, I want to talk to you, can you hear me?" Many Denial of Service (DoS) attacks involve flooding a Web site with SYN packets, so we immediately thought we were under a DoS attack.

Of course, this problem disappeared as soon as we solved the real problem—our system didn't have enough capacity. It turns out that we weren't under attack, but rather the demand for our Web site was more than our system could handle.

Reaching a position in which we might add machines was in and of itself a problem. When we started the site and were immediately flooded with hits, we considered obtaining some colocation space and setting things up ourselves. Then we realized a few things:

  • We didn't have the money to buy servers, firewalls, and load balancers.
  • Even if we had the money, it would take a long time to get them.
  • We didn't have the experience to set these up and maintain them.
  • We didn't have the resources to handle this side of things.
  • Hosting Web servers wasn't our core competency.

At that point, we had never heard of managed hosting. We learned about it when searching online for potential Web-hosting services. With managed hosting, customers lease machines that are already racked, instead of renting space in a data center. The managed host guarantees the uptime, handles the server maintenance and monitoring, and sells bandwidth based on actual usage instead of pipe width. Even more importantly, the host has extra machines on hand and can add servers at a moment's notice.

This option let us lease our machines without having to arrange for bank financing (no bank would have lent us money, anyway). With managed hosting, we could outsource our entire network operations department. Thus, this decision was a no-brainer.

We chose to use Rackspace Managed Hosting because it was top ranked by a couple of informational Web sites we consulted. This ended up being a great choice. That first week, I called Rackspace nearly every night around 3 a.m. to request another server. Each time, the new machine would be up and running by the time I awoke the next morning. By the end of the week, we had gone from one Web server to seven.

Database Overload

Once we had all the machines we needed to handle the massive amount of HTTP requests we were receiving, the database started bottlenecking. Our system architecture consisted of seven Web servers running Linux, and a Sun E220 that stored our database. One thing we learned through testing was that the open-source tools performed significantly better on a single-processor 700-MHz Pentium III machine running Linux than they did on a quad-processor Sun machine. MySQL is probably optimized for Linux because the open-source community develops it.

We found a way to help our database keep up with the traffic for our particular application. First, nearly every query made is a SELECT call. Second, there's no reason why all votes must be counted in real time. Given these circumstances, we decided to replicate the active portion of the database on each Web server so that SELECT calls could be made locally. We then started caching votes on each machine, and configured the master database (now a Linux box) to poll each server periodically to collect votes and maintain replication integrity. This method shifts much of the database load to the individual servers, significantly reducing the load on our primary database machine. If the primary database ever becomes overloaded, we can simply add two more server machines and another layer of caching, as illustrated in Figure 2.

Economics 101

Hosting our users' pictures—something we had originally done on our own—was another scaling issue we faced. This was such a big issue—due to the costs involved—that we almost decided to shut the site down. On its second night of operation, shortly after a article about our site was published, we were forced to take the site down at 10 p.m.

We had already been operating under an incredible load for the two hours following 8 p.m., when the article went up. I estimate that we served more than 3GB worth of pictures in those two hours, and the number was accelerating. Because we weren't generating any revenue, it was clear that the economics of this plan just didn't scale. Not only did serving pictures incur bandwidth charges, but it also bottlenecked our CPUs.

After stressing out for a couple more hours, I remembered that Yahoo Geocities gives its users FTP access, meaning that we could quickly upload the pictures to a Geocities account. As soon as I realized that we didn't have to host photos ourselves, I called Jim. As an interim measure, we sent new users to Geocities to set up their own accounts and we let them submit the URLs for their pictures, instead of the pictures themselves.

As Jim began working on the solution, it occurred to me that some companies might actually want to host peoples' photos and pay us a bounty for sending them users. By directing our users to these companies, we turned one of our major costs into a revenue stream.

Scaling the Human Element

We had another problem with some users submitting pornography and other inappropriate photos. Initially, we decided to solve this by adding a link under each photo that said, "Click here if this picture is inappropriate." If a photo received enough clicks, based on a formula we had derived, the picture was removed.

This worked pretty well, but not well enough. I sent the chairman of a large advertising network a link to our site with a note proclaiming that: "The odds of getting an inappropriate picture are extremely low." Ten minutes later, I received his reply: "Unfortunately, the first picture I saw was that of a topless woman."

He informed us that if we wanted companies to advertise on our site, we'd have to filter each picture as it came in. Jim built an interface for us to do so. However, we soon realized that we couldn't spend all day screening pictures. The system's human component wasn't scalable. That's when we arrived at the moderator idea.

We decided to build a system in which moderators could vote on whether to approve or reject a picture before it passed on to the main site. If a picture got enough votes either way, it was approved or rejected. By making the decision collective, no single moderator could approve or reject a picture independently.

To help detect any rogue moderators, the system tracks each moderator's accuracy. A vote is counted as wrong when the moderator's vote goes against the final outcome of the picture. For instance, if one person votes to approve, but all others vote to reject, the one person is wrong. Moderators whose accuracy ratings drop below our threshold are kicked out.

We decided to take the moderator system one step further by adding security levels. The higher a moderator's security level, the more his or her votes counted. We also gave higher-ranking moderators special privileges, like an expert mode in which they could judge pictures much faster. We gave the highest ranking moderators the ability to reject or accept moderator applications, and the ability to kick out rogue moderators. Today, these top-level moderators essentially run the moderator section of the site and decide on the specific guidelines for what makes a photo inappropriate. More than 1000 moderators are currently active, and they form our strongest community.

A Full Night's Sleep

I got about 15 hours of sleep over's first eight days—the time during which we addressed most of our scalability issues. Eight days after launching, we broke the one million page view barrier, reaching more than 1.8 million page views that day. By the end of November, we made NetNielsen's list of the top 25 advertising domains.

The site now runs smoothly, and has handled as many as 14.8 million page views in a single day without even yawning. Looking back, I think that week of scaling easily wins distinction as the most stressful, most exhausting, most rewarding week I've ever had in my life. In this trial by fire, we certainly learned an incredible amount about building and scaling a Web application.

James is a cofounder of Eight Days, which runs the Web site. [Editor's Note: Since publication the URL has been changed to] He has a Bachelors Degree in electrical engineering and computer science, and an MBA, both from U.C. Berkeley.

No comments: