House Prices — Part II

4 min readFeb 21, 2023

Introduction

Now that we have our dataset ready, we’ll start to clean the data that we’ve just collected. Let’s start by creating some more directories in our project tree:

project/
├── data/
│   ├── raw/
│   │   ├── kc_house_data.csv
│   ├── interim/
├── src/
│   ├── data/
│   │   ├── utils.py
│   │   ├── cleaning.py
│   │   ├── __init__.py
│   ├── clean.py
│   ├── __init__.py
├── requirements.txt
└── README.md

The data/interim directory will contain the dataset that we’re going to clean and transform.
The src/data directory is a custom Python Package that we’re creating to store our data cleaning code.
The src/clean.py file will contain the main logic of our cleaning code.

Loading the Dataset

The first thing we need to do is to load our dataset, in order to keep our code clean were going to create a src/data/utils.py file containing the following code:

import pandas as pd

def load_data_to_dataframe(data_path: str) -> pd.DataFrame:
    """ Loads the .csv path into a pandas dataframe

    Parameters:
    data_path: Path to the .csv file

    Returns:
    dataframe: Returns the created pandas dataframe
    """
    dataframe = pd.read_csv(data_path)
    return dataframe

Cleaning the Dataset

Next, we’re going to follow multiple steps for data cleaning. There’s a vast range of options available online for this task, so we might not cover all of them here. For this particular task, we’re going to use our data report post to help us.

Missing values

When we finish selecting the features that are relevant to our problem, we need to start dealing with missing values. Since our dataframe didn’t have any missing values this operation wasn’t needed.

Duplicates

Since duplicates can affect the accuracy of our prediction model in the future by adding biases, any duplicate value needs to be removed. Our dataset also didn’t have duplicate values, so this operation wasn’t needed.

Outliers

Since we’re dealing with real-world data we’re going to consider that the outliers present in our dataset might represent valid input values, such as the sale of luxury homes. From a scientific point of view, we’re also going to consider that these values might represent misleading data and have a separate dataset for it to compare during the training of our model.

Considering that based on our report we know that our data don’t follow a normal distribution we’re going to use the common Interquartile Ranges technique to remove outliers:

def drop_outliers_iqr(dataframe: pd.DataFrame, ignore_columns: list[str] = []) -> pd.DataFrame:
    """ Returns a dataframe without outliers

    Uses IQR approach to remove outliers.

    Parameters:
    dataframe: Dataframe on which the operation will be performed
    ignore_columns: Columns to ignored when looking for outliers

    Returns:
    dataframe: Returns the new dataframe without outliers
    """
    cols = [c for c in dataframe.columns if c not in ignore_columns]
    q1 = dataframe[cols].quantile(0.25)
    q3 = dataframe[cols].quantile(0.75)
    iqr = q3-q1

    condition = ~(
        (dataframe[cols] < (q1 - 1.5 * iqr)) | (dataframe[cols] > (q3 + 1.5 * iqr))
    ).any(axis=1)

    cleaned_df = dataframe[condition]
    return cleaned_df

With this method, 5893 records were dropped from our dataset.

Data formatting

When we analyze the variables presented in the first post, we can see that the grade variable is the only one representing ranges that will fall into three categories. So we will replace each range with a number:

Zero will replace the range 1–4 that has a low-quality level of construction and design.
One will replace the range 5–9 that has an average level of construction and design.
Two will replace the range 10–13 that has a high-quality level of construction and design.

This can be done by adding the following code to the src/data/cleaning.py:

import numpy as np

def grade_formatting(dataframe: pd.DataFrame) -> pd.DataFrame:
    """ Returns a formatted dataframe

    This formats the grade to be an integer instead of a range.
    If the grade is less or equal to 4 it will become zero, if it is
    between 5 and 9 (included) it will become 1, and if it is more or qual
    to 10 it will become two.

    Parameters:
    dataframe: Dataframe on which the operation will be performed

    Returns:
    dataframe: Returns the formatted dataframe
    """
    cleaned_df = dataframe.copy()

    cleaned_df["grade"] = np.where(cleaned_df["grade"] <= 4, 0, cleaned_df["grade"])
    cleaned_df["grade"] = np.where((cleaned_df["grade"] >= 5) & (cleaned_df["grade"] <= 9), 1, cleaned_df["grade"])
    cleaned_df["grade"] = np.where(cleaned_df["grade"] >= 10, 2, cleaned_df["grade"])
    return cleaned_df

Saving the Dataset

After performing the necessary steps to clean our data, we can save the intermediary dataset so we can keep a record of the changes that were made. In the src/data/utils.py we can add the following:

def save_data_to_csv(dataframe: pd.DataFrame, data_path: str):
    """ Saves the dataframe in the provided path as a .csv file

    Parameters:
    dataframe: Dataframe to be saved
    data_path: Path to the .csv file
    """
    dataframe.to_csv(data_path, index=False)

Conclusion

Since all the described steps were created inside our own custom packages, they have to be called by our clean.py program:

def main():
    try:
        dataframe = load_data_to_dataframe(
            data_path="../data/raw/kc_house_data.csv")
    except:
        print("It was not possible to read the provided .csv file")
        exit(0)

    dataframe = grade_formatting(dataframe)

    ignore_cols = ["id", "date", "waterfront",
                   "lat", "long", "zipcode", "yr_renovated"]
    df_no_outlier = drop_outliers_iqr(dataframe, ignore_columns=ignore_cols)

    # Save interim dataframes
    save_data_to_csv(dataframe, data_path="../data/interim/kc_house_data.csv")
    save_data_to_csv(dataframe=df_no_outlier, data_path="../data/interim/kc_house_data_no_outlier.csv")


if __name__ == "__main__":
    main()