Machine Learning Project

Summary

For this project, I utilized machine learning techniques to generate business value from a data set of hotel bookings. I used supervised learning algorithms to solve the regression problem of predicting the cost of a hotel booking and the classification problem of predicting whether or not a hotel booking will be canceled. I also used the unsupervised learning technique of clustering to perform customer segmentation.

A complete description of the project can be found below. The hotel bookings data set can be accessed in the project's GitHub repository.

Motivation

My motivation for this project was to practice using the skills and tools listed below. In particular, I wanted to clearly document the entire machine learning workflow and apply that workflow to solve a variety of business problems.

Skills and Tools Used

Skills

  • Machine learning
    • Regression
    • Classification
    • Clustering
  • Exploratory data analysis
  • Data cleaning
  • Data visualization
  • Project documentation

Tools

  • Python
    • scikit-learn
    • pandas
    • Matplotlib
    • NumPy
  • Jupyter Notebook

Data Collection

Problem Specification

I started by determining the types of machine learning problems to focus on for this project. These problem types are listed below.

  • Supervised learning
    • Regression
    • Classification
  • Unsupervised learning
    • Clustering

There are numerous other types of machine learning problems but regression, classification, and clustering are perhaps the most common business use cases for machine learning. Thus, I wanted to focus on these three problem types to ensure that the machine learning techniques I utilize in this project are applicable to a wide variety of practical business problems.

Data Requirements Specification

I then specified the requirements for the data set I would work with in this project. These requirements are listed below.

The data set must:

  • contain data relevant to the operations of a business
  • contain at least 10,000 observations (this is an arbitrary benchmark that approximates the data set size needed to solve a non-trivial machine learning problem)
  • contain both continuous, numerical attributes and discrete, categorical attributes (so as to be relevant for both regression and classification problems)
  • contain observations that can be grouped in a meaningful and interpretable way (so as to be relevant for clustering problems)
  • be relatively clean, easy to work with, and well-documented
  • be free and publicly accessible
  • be in a standard format (such as a CSV file)

Data Set Selection

Next, I began looking for a data set that met my requirements. Eventually, I settled on a data set containing hotel booking information that was uploaded to Kaggle, an online community of data scientists, by user Jesse Mostipak. The data set was originally created and documented by Nuno Antonio, Ana Almeida, and Luis Nunes for the article "Hotel booking demand datasets" published in Data in Brief (Volume 22, February 2019).

Exploratory Data Analysis

Data Profile Report

I then created a data profile report to explore the contents of the hotel bookings data set.

In [27]:
# Import package used for data manipulation in Python
import pandas

# Import package used for exploratory data analysis
from pandas_profiling import ProfileReport

# Import the data set (as a CSV file) into a pandas DataFrame
raw_data = pandas.read_csv('hotel_bookings.csv', index_col = False)

# Create and display a report summarizing the data in the hotel bookings data set
profile = ProfileReport(raw_data, title='Hotel Bookings Data Profile Report', html={'style':{'full_width':True}})
profile.to_notebook_iframe()