House Prices — Part IV

7 min readApr 10, 2023

Introduction

Now that we have the processed data that we need to build our model, let’s jump right into it. Let’s start by creating more directories in our project tree:

project/
├── data/
│   ├── raw/
│   │   ├── kc_house_data.csv
│   ├── interim/
│   │   ├── kc_house_data.csv
│   │   ├── kc_house_data_no_outlier.csv
│   ├── processed/
│   │   ├── kc_house_data.csv
│   │   ├── kc_house_data_no_outlier.csv
├── src/
│   ├── data/
│   │   ├── utils.py
│   │   ├── cleaning.py
│   │   ├── modeling.py
│   │   ├── __init__.py
│   ├── models/
│   │   ├── utils.py
│   │   ├── model_build.py
│   │   ├── __init__.py
│   ├── clean.py
│   ├── modeling.py
│   ├── model_build.py
│   ├── __init__.py
├── requirements.txt
└── README.md

The models/utils.py file will contain helper functions to build our logic.
The models/model_build.py file will contain the code to build the model.
The src/model_build.py file will contain the main logic code to build our model.

Price Forecast

The first thing we’re going to do is to forecast the price of a house, but there’s a little problem with doing this: we don’t have in our dataset the price of a house changing over time in order to learn from it, and then forecast future values. To bypass this problem we’re going to do a little trick: we’ll split the houses into clusters and consider the price of houses inside a cluster as the continues price over time. In the models/utils.py file let’s create the code for it:

from typing import Dict, List, Union
import pandas as pd
from sklearn.cluster import KMeans

def get_clusters(dataframe: pd.DataFrame, n_clusters = 10, ignore_cols: List[str] = []) -> Union[pd.DataFrame, KMeans]:
    """ Assign each house in the dataset a cluster

    Parameters:
    dataframe: Dataframe to be labeled
    n_clusters: Number of clusters to split
    ignore_cols: List of columns to ignore while creating the clusters

    Returns:
    dataframe: Returns the labeled dataframe
    kmeans: Returns the trained KMeans algortihm to be used in future predictions
    """
    data = dataframe.drop(ignore_cols, axis="columns")

    kmeans = KMeans(n_clusters)
    labels = kmeans.fit_predict(data)

    dataframe["cluster"] = labels
    return dataframe, kmeans

Now that each house belongs to a cluster, let’s create the necessary code to forecast future values following this guide provided by Peixeiro. In the models/model_build.py file we can add the following code:

from typing import Dict, List
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_percentage_error as mape

def get_forecast_by_cluster(clusters: Dict[str, pd.DataFrame]) -> Dict[str, List]:
    """ gets the forecast model for each cluster

    Parameters:
    clusters: Dict containg the cluster as key to dataframe

    Returns:
    results: Dict containing the cluster as key to array of test values
    and predicted values
    """
    results = {}
    for key in clusters:
        cluster_df = clusters[key]
        cluster_df = normalize_clusters_by_date(cluster_df)
        seq_df = window_input_output(cluster_df, 5, 5)

        # Builds train and test set
        X_cols = [col for col in seq_df.columns if col.startswith("x")]
        X_cols.insert(0, "price")
        y_cols = [col for col in seq_df.columns if col.startswith("y")]

        # Will use all, but the two last rows for train
        X_train = seq_df[X_cols][:-2].values
        y_train = seq_df[y_cols][:-2].values

        # Will use the last two rows for test
        X_test = seq_df[X_cols][-2:].values
        y_test = seq_df[y_cols][-2:].values

        dt_seq = DecisionTreeRegressor(random_state=42)
        dt_seq.fit(X_train, y_train)
        dt_seq_preds = dt_seq.predict(X_test)

        results[key] = []
        results[key].append(y_test[1])
        results[key].append(dt_seq_preds[1])
        print(f"MAPE for cluster {key} is {mape(dt_seq_preds.reshape(1, -1), y_test.reshape(1, -1))}")

    return results

In this function, we have three particular points to highlight:

Since we grouped houses into clusters to consider them as the same data changing over time, we need to group all the data by the date provided in the dataset. In the models/utils.py file we can add:

def normalize_clusters_by_date(dataframe: pd.DataFrame) -> pd.DataFrame:
    """ Groups each cluster by the date

    Parameters:
    dataframe: Dataframe to be manipulated

    Returns:
    dataframe: Dataframe grouped by date
    """
    dataframe = dataframe[["date", "price"]]
    dataframe = dataframe.groupby("date").mean("price")

    return dataframe

2. Following the guide provided by Peixeiro, in order to do a multi-prediction problem it’s necessary to create multiple inputs to train our model and use them as predictions. This is done in the following function:

def window_input_output(dataframe: pd.DataFrame, input_length: int, output_length: int) -> pd.DataFrame:
    """ Creates a sequence of observation to train the model

    Parameters:
    dataframe: Dataframe to be manipulated
    input_lenght: Number of training observations
    output_lenght: Number of labels

    Returns:
    dataframe: Dataframe with aritifical observations
    """
    df = dataframe.copy()
    
    for i in range(1, input_length):
        df[f"x_{i}"] = df["price"].shift(-i)


    for j in range(output_length):
        df[f"y_{j}"] = df["price"].shift(-output_length-j)
        
    df = df.dropna()
    return df

3. We’re using the Mean Absolute Percentage Error as our metric to evaluate the quality of our model.

Now that we’ve considered these points, we can visualize the results obtained with the training data.

Evaluation of the dataset without outliers

In these images, we can see the evaluation by cluster in each one of the datasets created in Part II of this series. In general, our model didn’t have good results and this is probably related to the fact that we didn't have a large amount of continuous data by each house and had to group them into clusters. Due to this fact unfortunately we won’t be able to answer the questions proposed in Part I.

Price Prediction

Well, now that we had all this work we’ll at least use it to build a second model: if we can’t help our company by predicting the price change over time, we can help them by evaluating a house given its characteristics.

Let’s start by loading our dataset again and splitting it into test and training data. In the data/utils.py file let’s add the necessary code for it:

def get_train_test_data(
        dataframe: pd.DataFrame, target_variables: List[str],
        test_size: float = 0.2
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """ Splits the dataset for training and testing

    Parameters:
    dataframe: Dataframe to be split
    target_variables: Target variables to be used by the model
    test_size: Quantity of data that will go to the test set
    """
    features = list(dataframe.columns)
    for var in target_variables:
        features.remove(var)

    X = dataframe[features]
    y = dataframe[target_variables]

    return train_test_split(X, y, test_size=test_size, random_state=42)

Now that we have our data loaded, we’re going to build a model based on the XGBoost library. This algorithm has proven to have high quality when predicting values. We’ll also use the RandomizedSearchCV method from Scikit-Learn to tune our model and choose the best parameters for it. In the models/model_build.by let’s add the following code:

def xgboost_train(X_train: pd.DataFrame, y_train: pd.DataFrame, n_iter: int = 5) -> RandomizedSearchCV:
    """ Trains the XGBoost model against the data

    Parameters:
    X_train: The data to be used during training
    y_ttrain: The labels for each train data
    n_iter: Number of iterations to be used during parameters tuning in the model

    Returns:
    search: The randomized search object containing the tunned model
    """
    params = {
        'max_depth': [3, 5, 6, 10, 15, 20],
        'learning_rate': [0.01, 0.1, 0.2, 0.3],
        'subsample': np.arange(0.5, 1.0, 0.1),
        'colsample_bytree': np.arange(0.4, 1.0, 0.1),
        'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
        'n_estimators': [100, 500, 1000]
    }

    xgb_model = xgb.XGBRegressor(random_state=42)
    search = RandomizedSearchCV(
        estimator=xgb_model,
        param_distributions=params,
        scoring='neg_mean_squared_error',
        random_state=42,
        n_iter=n_iter, verbose=2)

    search.fit(X_train, y_train)

    return search

And now we can test the built model with the test data, in the models/utils.py let’s add the following code:

def xgboost_test(xgb_search: RandomizedSearchCV, X_test: pd.DataFrame, y_test: pd.DataFrame) -> np.ndarray:
    """ Testes the trained XGBoost against the data

    Parameters:
    xgb_search: The randomized search object
    X_test: The data to be tested
    y_test: The labels for each test data

    Returns:
    y_pred: The predicted values for each test data
    """
    y_pred = xgb_search.predict(X_test)
    print("Predicted RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

    return y_pred

When plotting the obtained results we can see:

Price predictions on the dataset with outliers

Price predictions on the dataset without outliers

The proposed model performs better in the dataset where the outliers weren’t removed and this proves what we had initially supposed: these outliers aren’t actually a problem, but real-world valuable data.

Business Analysis

Based on the obtained results, it’s possible to say that it would be risky to use our model in a real-world scenario, especially for houses with prices higher than US$ 1.000.000,00.

Based on these restrictions would be nice in future work to build alternatives for the XGBoost model or even improve the featured selection for this process.

Conclusion

Through this series, we were able to learn multiple concepts relates to data manipulation and machine learning models. Unfortunately, the questions initially proposed aren’t answerable, but we were able to use all of our hard work to build a predictive model!

More of these posts will come soon, probably in a less expanded format since I observed that this format takes far more time than I initially thought. Thank you for following these posts, and remember that all the code is available on GitHub.