House Prices — Part III

João Pedro Picolo
4 min readMar 16, 2023

Introduction

With a cleaned dataset, it’s important to decide which variables are related to our target variable: the price. Let’s start by creating more directories in our project tree:

project/
├── data/
│ ├── raw/
│ │ ├── kc_house_data.csv
│ ├── interim/
│ │ ├── kc_house_data.csv
│ │ ├── kc_house_data_no_outlier.csv
│ ├── processed/
├── src/
│ ├── data/
│ │ ├── utils.py
│ │ ├── cleaning.py
│ │ ├── modeling.py
│ │ ├── __init__.py
│ ├── clean.py
│ ├── modeling.py
│ ├── __init__.py
├── requirements.txt
└── README.md
  • The data/processed directory will contain the dataset that we’re going to use to train and test our models.
  • The src/modeling.py file will contain the main logic code to model our data.

Features Manipulation

When building a model, we can use prior knowledge to better understand the problem and create features that we believe would be useful for our model. One of the features that are hidden in our dataset and, from previous knowledge, directly affects the price of a house is its age. Let’s create a src/data/modeling.py file to calculate it:

import pandas as pd

def calculate_house_age(year_of_contruction: pd.Series, current_year: int) -> pd.Series:
""" Returns the age of the house

Parameters:
year_of_contruction: Series containing the construction year for each house
current_year: Year on which the dataset was built

Returns:
age: Returns the age series
"""
age = current_year - year_of_contruction

return age

Another relevant feature is hidden in our dataset: the years since the last renovation was made on the house. Let’s add another function to the file we just created:

def calculate_last_renovation(year_of_renovation: pd.Series, house_age: pd.Series, current_year: int) -> pd.Series:
""" Returns the years since the last renovation

Parameters:
year_of_renovation: Series containing the year of the last renovation. Zero if it was not renovated
house_age: Series containing the age of each house
current_year: Year on which the dataset was built

Returns:
last_renovation: Returns the Series containing the years since last renovation
"""

last_renovation = []
for idx, year in year_of_renovation.items():
if year > 0:
last_renovation.append(current_year - year)
else: # If there was no renovation, the year is equal to the house age
last_renovation.append(house_age[idx])

return pd.Series(last_renovation)

Features Selection

From our Data Report, it’s already possible to know which features have a negative correlation with the target variable. This summed with previous knowledge of the problem leads us to remove the following features from our dataset: id, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15.

These features are removed because they do not necessarily affect the price that a customer would pay for a house. Note that the features yr_built and yr_renovated were removed once features that better represent these informations were created in the previous step.

Conclusion

This was a smaller post, but crucial to the creation of our model since we want only to use high-quality features. The manipulation process is done by our modeling.py program:

from data.utils import load_data_to_dataframe, save_data_to_csv
from data.modeling import calculate_house_age, calculate_last_renovation


def main():
datasets = ["../data/interim/kc_house_data.csv",
"../data/interim/kc_house_data_no_outlier.csv"]

for dataset in datasets:
try:
dataframe = load_data_to_dataframe(data_path=dataset)
except:
print("It was not possible to read the provided .csv file")
exit(0)

# This dataframe is from 2015
df_date = 2015

# Calculates new features
dataframe["age"] = calculate_house_age(
year_of_contruction=dataframe["yr_built"], current_year=df_date)
dataframe["last_renovation"] = calculate_last_renovation(
year_of_renovation=dataframe["yr_renovated"], house_age=dataframe["age"], current_year=df_date)

# Remove columns that do not affect the target variable
ignore_columns = ["id", "yr_built", "yr_renovated",
"zipcode", "lat", "long", "sqft_living15", "sqft_lot15"]
dataframe = dataframe.drop(labels=ignore_columns, axis="columns")

# Formats the date column to contain month and year only
dataframe["date"] = pd.to_datetime(dataframe["date"], format='%Y%m%dT%H%M%S')
dataframe["date"] = dataframe["date"].dt.strftime('%m-%Y')

# Save processed dataframe
save_data_to_csv(dataframe, data_path=dataset.replace("interim", "processed"))

if __name__ == "__main__":
main()

To have a better understanding of the impact of our manipulation, let’s take a look at the correlation between the remaining features:

Correlation Matrix

Note that all the variables have a positive correlation to the price variable, but not the ones we created. It actually makes sense: would you pay less or more for an older house? Probably less. And what about a house that has not been renovated in years? Probably less as well.

--

--