Airline Customer Review — Web Scraping
Last week I started working on a new project proposed by British Airways on the site The Forage. The project consisted of collecting customer reviews from the web and performing data analysis on the collected information.
I decided to take a step further and collect data from 10 airlines, each airline is the biggest in its founding country. In this post, we’re going over key topics to the development of this project, in the next post we’re going to build a dashboard in Power BI with the collected data to get useful insights.
All posts from this series:
- Airline Customer Review — Insights
- Airline Customer Review — Web Scraping [This post]
- Airline Customer Review — Power BI
Web Scraping
Web scraping consists of collecting data from websites by making requests to pages and parsing the returned HTML, this is particularly useful when the website doesn’t provide an API. In this project, we’re going to use the suggested website SkyTrax.
Luckily this website can be paginated to access one review at a time, so let’s take a look at one review to understand the page structure: https://www.airlinequality.com/airline-reviews/british-airways/page/1/?sortby=post_date%3AAsc&pagesize=1.
Now let’s take a look at its HTML to understand the used tags, note that I manually removed part of the content to fit the screen.
In Python, we’re going to use requests to get this HTML and Beautiful Soup to parse it to collect and save the information we need in a dictionary.
Let’s start by making the request, parsing the response, and getting the review. Note that the review is in an <article/> tag and has a class starting with “review-”, so we can collect it by doing the following:
import requests
from bs4 import BeautifulSoup
response = requests.get(page_url)
content = BeautifulSoup(response.content, "html.parser")
content_reviews = content.find_all("article", class_ = lambda value: value and value.startswith("review-"))
review = content_reviews[0]
Now we can use the parsed review to get the inside tags, this will be done in a similar way. Let’s start with the easier tags:
review_dict = {}
try:
review_dict["date"] = review.find("meta").get("content")
except:
pass
review_dict["header"] = review.find("h2", { "class": "text_header" }).get_text()
review_dict["rating"] = review.find("div", { "class": "rating-10" }).find("span").get_text()
review_dict["content"] = review.find("div", { "class": "text_content" })).get_text()
The date isn’t always present, so we locate it inside a try block. Now that we got the basic information, we need to get the ratings per category stored inside the table:
table = review.find("table", { "class": "review-ratings" })
data = table.find_all("td")
keys = data[::2]
values = data[1::2]
for key, value in zip(keys, values):
key = key.get_text()
star_value = None
try:
star_value = value.find_all("span", { "class": "star fill" })[-1]
except:
pass
if star_value:
value = self.get_text(star_value)
else:
value = value.get_text()
review_dict[key] = value
This can seem complex to those not used to Python, so let’s dive into it:
- First, we find the review’s table and get all the <td/> tags, these tags define the cells of the table
- Next, we use the prior knowledge that the header we want always precedes its value on the table, creating the keys array by accessing the even indexes of the cells array and the values array by accessing the odd indexes
- The value for a cell can be a number (rating of category) or text (type of category), considering that, we use a try block to get the review if it’s a number, if this fails we get the value text like we did previously
Now that we’re able to create a dictionary containing the information we want to store, let’s create a method to get all the review from a given airline and store it on a table.
import pandas as pd
table = None
try_max = 2
try_count = 0
last_review_id = 0
while try_count < try_max:
try:
current_id = last_review_id + 1
print(f"Will get review id {current_id}")
url = f"{base_url}/page/{current_id}/?sortby=post_date%3AAsc&pagesize=1"
review = get_review(url)
review["id"] = current_id
reviews_df = pd.DataFrame([review])
try_count = 0 # Resets to get the next review
last_review_id += 1 # Increments because will try to get next
if table is not None:
table = pd.concat([table, reviews_df])
else:
table = reviews_df
except:
try_count += 1
time.sleep(0.25) # Waits 0.25 secs to avoid overload on the website
Some important points:
- The base_url variable corresponds to the base URL for an airline, for example: https://www.airlinequality.com/airline-reviews/british-airways/
- The code described previously was encapsulated inside a function called get_review()
- Note that in the url variable, we’re going to get the reviews from oldest to newest since the pagination is ordered in ascending order by date
- Once we try to get a page twice and it doesn’t return anything, the loop stops, and the table is ready to be used
Data Cleaning
Now that we have collected our dataset from an airline, we need to clean the data to make sure that we store only formatted data. Let’s take a look at how our raw dataset is:
Let’s start by formatting the headers so they all follow the same pattern — snake case, this is important to be used during merge on the next step:
dataframe = pd.read_csv(path_to_csv)
new_columns = { col: re.sub(r"\s+", '_', col.lower()) for col in dataframe.columns }
dataframe = dataframe.rename(columns=new_columns)
Since we don’t want to process duplicate reviews, next we’re going to delete them by the id:
dataframe = dataframe.drop_duplicates("id")
Note that our content column has two useful pieces of information in it: if the trip was verified and the content of the customer’s review. Let’s split this information into two separate columns:
split_series = self.dataframe["content"].str.split(pat=separator, n=1, expand=True)
dataframe["trip_verified"] = split_series[0]
dataframe["content"] = split_series[1]
Now, the header column generally starts and ends with double quotes. Let’s remove this information to make the dataset look cleaner:
dataframe["header"] = dataframe["header"].str.strip('"')
We have two columns that can be considered as boolean, but have text in it: trip_verified and recommended. Let’s change them to boolean:
dataframe["trip_verified"] = dataframe["trip_verified"].str.lower()
true_array = self.dataframe["trip_verified"].str.contains("trip verified")
dataframe["trip_verified"] = true_array
dataframe["recommended"] = dataframe["recommended"].str.lower()
true_array = self.dataframe["recommended"].str.contains("yes")
dataframe["recommended"] = true_array
Now we get to one of the most important steps, handling missing data. In our dataframe, we have two types of values: numeric and categorical. Since the numerical value represent the ratings for business points they must be greater than zero, so let’s replace the missing numerical values with -1 so we know in the future that this data is not valid. For the categorical values let’s just signalize that the value was not informed:
for col in dataframe:
column = dataframe[col]
if column.dtype.name in ["object"]:
dataframe[col] = dataframe[col].fillna("Not informed")
elif column.dtype.name in ["int64", "float64"]:
dataframe[col] = dataframe[col].fillna(-1)
Finally, let’s strip the categorical columns to remove empty spaces from the values and keep a better format of our dataset:
for col in dataframe:
column = dataframe[col]
if column.dtype.name == "object":
dataframe[col] = dataframe[col].str.strip()
It’s important to note that all the raw datasets collected need to be cleaned, so ideally, the code provided above would be encapsulated in functions to be used in a for loop.
Dataset Merging
After cleaning all the collected datasets, we need to merge them to be used as a source for our Power BI dashboard. Let’s start by dropping columns that we won’t use in the dashboard:
table = None
dataframe = pd.read_csv(source_csv_path)
dataframe = dataframe.drop(["id", "header", "content",
"date_flown", "route"], axis="columns")
Since we’re going to merge the dataframe by column name, we can ignore the column's original type and verify if both dataframes have the same columns:
# Defines match function
def verify_columns_match(self, main: pd.DataFrame, dataframe: pd.DataFrame):
main_columns = set(main.columns)
dataframe_columns = set(dataframe.columns)
difference = [*main_columns.symmetric_difference(dataframe_columns)]
return len(difference) == 0
dataframe = dataframe.astype("object") # Ignores orignal types
if table is not None:
if not verify_columns_match(table, dataframe):
print(f"Error: Dataframes do not match columns")
exit(1)
dataframe = sort_dataframe_columns(dataframe)
table = pd.merge(table, dataframe, on=None, how="outer")
else:
table = sort_dataframe_columns(dataframe)
Note that since we’re going to use pd.merge() first we need to sort the dataframe by columns before assigning it to the final variable. Let’s define the sort_dataframe_columns() function used previously:
def sort_dataframe_columns(self, dataframe: pd.DataFrame):
columns = dataframe.columns
dataframe = dataframe.reindex(sorted(columns), axis="columns")
return dataframe
Note that all these steps need to be applied to every dataframe that we want to merge, so this code must be wrapped inside a for loop.
Conclusion
Now that we have collected the airline’s reviews using web scrapping, cleaned and merged the dataframes. In the next post, we’re going to use Power BI to get useful insights from the collected data.
The proposed, more structured code, is available on my GitHub.