House Prices — Part I

João Pedro Picolo
3 min readFeb 16, 2023

--

The Purpose

This is the first chapter of a new Data Science Project that’ll be working on over the next few days, weeks, months, or forever. The intention here is to build this project as part of my Portfolio while learning new things every day, so changes might be done in the code base after I finish this documentation but I’ll try as much as possible to add new posts as these updates happen.

The Project

Imagine you’re a data scientist working for a company that buys houses to sell them in the future with as much profit as possible. If you’re able to give good predictions you’re probably going to get promoted in the next few days or you can even start your own business, so during this project, we’re going to answer a few questions such as:

  • Which house should I buy and how long should I wait to sell the house that I just bought?
  • Given a period of time on which I can wait to sell a house, what house should I buy? What happens if I add a budget constraint?

Of course that we’re probably going to discover other questions along the way, but these are the main questions that we’re going to solve.

[Edit] Note: Unfurtunately, due to the dataset limitations we weren’t able to build a forecast model to answer these questions in Part IV of this series. So we actually answer a simples question to help our company: given a set of characterists of a house, for how much should I buy or sell it?

The Dataset

For this particular project, we’re going to use the House Sales in King County dataset available at Kaggle. This dataset contains multiple columns referring to the characteristics of a house such as the price, number of bedrooms, bathrooms, and more.

To explain each variable presented in the dataset I’ll reference this post from Murillo. I’ll just copy and paste the descriptions below in case his page gets removed for any reason in the future:

id            - Unique ID for each home sold
date - Date of the home sale
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
sqft_living - Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the property was
condition - An index from 1 to 5 on the condition of the apartment,
grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
sqft_above - The square footage of the interior housing space that is above ground level
sqft_basement - The square footage of the interior housing space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

The Environment

In order to develop this project I’ll use a Python Virtual Environment created with some tips given by Guthrie in this post. The project will be available on my GitHub.

Initially, we’re going to have the following project tree:

project/
├── data/
│ ├── raw/
│ │ ├── kc_house_data.csv
├── requirements.txt
└── README.md
  • In the data/raw directory, we’ll be saving the original, unprocessed dataset.
  • The requirements.txt file contains some Python dependencies that will be used.
  • The README.md file contains the project description.

--

--