Airbnb: An Inside Look at it’s Data Science Journey

Pranav Anand Joshi
Mar 2, 2019
6 min read

“Airbnb is a community marketplace that brings hosts, people with a room or a house to rent, together with renters, people who are looking for a unique accommodation experience.”

Overview

The company was founded in August 2008 by the team of Brian Chesky, an industrial designer, Nathan Blecharczyk, a computer scientist, and Joe Gebbia, who has a background in a graphic and industrial design. The company is in San Francisco. Airbnb was founded when a prominent design conference was going to happen in San Francisco. Chesky and Gebbia, with their backgrounds in design (they were both graduates of the Rhode Island School of Design), predicted that the hotels near the venue would be full. They offered air-bed accommodation at their loft. The uptake was overwhelmingly positive.

Since the company's founding is has grown as a corporation by opening 11 new offices in 2012 and by offering accommodations in more than 33,000 cities in 192 countries. In 2012 alone Airbnb underwent more than 100 percent growth, from 120,000 listings at the beginning of the year to more than 300,000. Guest uptake has also been staggering, with over 4 million guests since the company started -- 3 million of those having used the company's services since 2012.

Data Science at Airbnb

Airbnb is a community marketplace that provides access to millions of unique accommodations in more than 65,000 cities and 191 countries. In addition to accommodations, with Experiences, Airbnb offers unprecedented access to local communities and interests, while Places lets people discover recommendations by the people that live there. A huge part of company’s success has been due to the data science team. Data science has played a crucial role at Airbnb since the company’s beginning. The first member of Airbnb Data Science team was among the first 10 hires at Airbnb. Today, the team includes data scientists, data engineers, business analysts, machine learning engineers, and several product and infrastructure-focused data teams that touch almost every part of the Airbnb platform. To start, what do we mean by data science? Put simply, Airbnb data science team’s mandate is to use data to inform decisions. When guests and hosts interact with Airbnb, their interactions with the website and each other are represented by data. By sifting through that data for trends and patterns, Airbnb are better able to listen to the voice of their customers. Data scientists at Airbnb are responsible for many tasks to empower the company to be more data informed - they instrument logs, build data pipelines, define metrics, develop education resources, build internal data tools, and create reports and dashboards. However, they can categorize the core of Airbnb work into three main focus areas: Product Insights, Experimentation, and Predictive Modeling.

Machine Learning & Predictive Modeling at Airbnb

One well-known application of Machine Learning at Airbnb is our Smart Pricing feature. If you would like some help in choosing how to price your listing, Airbnb data science team provide you with a suggestion of a price that they think will work well for you. The suggestions are generated based on a machine learning algorithm that takes into account a variety of points of information, including the date for which you are setting the price, listing’s location, amenities, your booking history, and many others. Smart Pricing lets hosts set their prices to automatically go up or down based on changes in demand for similar listings. Hosts are always responsible for setting prices and are free to accept or reject any suggested price. Pursuing an end-to-end machine learning project can often be costly and time consuming. This means that validating the ideas before fully investing in engineering and implementation is a high leverage activity. One of the ways in which data scientists would validate whether a modeling-based solution would work is to build a prototype. Python/R turns out to be an extremely powerful tool for this purpose because we can easily perform data wrangling and feature engineering once training data is loaded into Python/R. Furthermore, with the training data prepared, data scientists can try out a wide variety of models to understand how much gain airbnb would get versus a naive, non-modeling solution. For example, when data scientists try to predict revenue at the listing and guest level, they built a wide variety of prototypes in Python/R to validate that a model-based prediction is worth pursuing (by comparing the RMSE of the challenger models with the incumbent model). After the prototyping step, they work with engineers to take the prototype into production, using open-source technology like Aerosolve or Python Scikit-learn.

Data team personalization challenges

From a data perspective Airbnb as a company could be seen as cross between a travel marketplace, such as Orbitz, and an entertainment exploration space, such as Netflix. At Netflix you’re frequently not sure exactly what movie you’re looking for. You just know the genre, for example, a sci-fi movie. In Airbnb’ s case you may not know exactly what kind of accommodation you’re looking for beyond perhaps a bed in South Beach. Using Netflix or Airbnb individuals look at reviews, but the value is in the quality of system-made recommendations.

Following this logic, the next step in Airbnb’s evolution has to be in the personalization space. This method delivers value, cuts down on the search time for the renter, and improves the yield for the hosts, while promoting higher satisfaction scores. To achieve this Airbnb recruited Mike Curtis as vice president of engineering at the beginning of 2013. Curtis was previously at Facebook, where he was director of engineering and focused on promoting user growth, following eight years at Yahoo.

In an interview with TechCrunch, Nathan Blecharczyk, Airbnb’s CTO commented that while their engineering team was composed of only 50 engineers, Curtis had been brought in to create a world class collaborative team of "folks from different disciplines." Airbnb uses a data scientist engineering model similar to that used at companies like Netflix.

Curtis will face several personalization challenges: matching hosts with guests, setting price levels based on demand, screening hosts and guests, overseeing ranking and review mechanisms, as well as monitoring feedback systems. With over 4 million guests and the company’s ever-changing host/guest interaction, this is clearly a company that is awash in data. It recently released some numbers on the most hospitable cities, with Tampa, Florida, appearing at the top, followed by Mendocino, California, and Eugene, Oregon.

To identify these cities, the company used metrics from their reviews, labeled as "cleanliness," "check in," "communication," and "accuracy." Managers were intrigued to learn what determined the distribution of scores, and so they drilled down into the data, examining seven more metrics: guest age, host age, guest gender, host gender, group size, length of stay, and booking lead-time. Results indicated that older hosts (aged more than 50 years), younger guests (from 30 to 39), stays of three to six days, and smaller group sizes (of one or two guests) were the demographics that drove the best reviews. The question for the Airbnb team then is what can it create to address the other, less positive, review demographics? Was its better personalization? Better prices? Perhaps more variety?

To analyze the approximately 20 terabytes of new data created daily and the approximate 1.2 petabytes of archived data Airbnb, has employed a variety of technologies. Brenden Mathews and Henry Cai, engineers at Airbnb, have detailed these at a presentation to a Meetup group. The systems they have used include Hadoop on Apache’s Mesos cluster manager, an operating system known for its ability to provide efficient resource isolation across distributed applications. Mathews and Cai note that it is advantageous for its "formalized scaling capabilities, as well as its familiarity to many engineers working in Big Data." They describe it as the gold standard.

Additionally, they incorporate a system with their Mesos implementation known as Chronos, a communication framework built by Google for achieving improved throughput and low latency for applications and systems created to access data across thousands of servers. They also use Storm, an open-source (EPL license) software solution that enables them to reliably process unbounded streams of data. Storm also enables users to develop real-time analytics and online machine learning algorithms. It is extremely fast, having been benchmarked at over a million tuples processed per second per node. Storm is scalable as well as fault tolerant, giving extremely high degrees of reliability.

Airbnb’s engineers also use Jenkins CI (Continuous Integration), an open-source integration SCM and testing tool written in Java with a very large number of plugins (approximately 784) to support complex upgrades and frequent code merging. The engineering team also utilizes Hive, Pig, Cascading, and other tools on their Hadoop implementation.

Beyond the crunching of Big Data, the engineering teams at Airbnb are also facing other business-related technology challenges, successfully avoiding the temptation to become so engrossed in the big data projects that they ignore what drives revenue. One of these key areas was mobile data. While the company had initially developed a mobile iPhone app, it recognized that, since the firm operates in over 192 countries, and since Android is the world's most popular smartphone operating system, it would need to create an in-house development team to address this opportunity. This was achieved through building a first-class UI that presents property descriptions, uses filters for searches, lists property details and amenities, and has booking facilities and contact mechanisms for connecting with the host.

“Airbnb has great lessons for start-ups and established firms alike. First, building a business on data is an imperative in order to create a competitive position. Second, building a great engineering team to develop that architecture is essential -- half-measures in big data don’t work. Third, never lose track of who is really important -- the customer. The data is there to support business decision-making and that is always focused on driving value in the marketplace for the consumer.”