House Prices — Data Report

João Pedro Picolo
3 min readFeb 20, 2023

--

I’ll use this post to analyze the data present on the House Sales in King County dataset. My intention is to clarify some of the motivations I had for my decisions during the elaboration of our project.

Data Validation

All the methods described below are part of the Pandas Library.

  • The original dataset contains 21613 rows and 21 columns. The description of each column can be found in my first post. The information above can be obtained with the .info() method.
  • There are no null values present in the dataset, to verify this you can run the .isnull().sum() methods if needed.
  • When we run the .duplicated().value_counts() over the dataframe we can also see that we don’t have duplicate data on our dataset.
  • We can validate that all the values of the view, condition, and grade columns fall within the correct range using the .unique() method.
  • Since we’re dealing with numeric columns representing world-size metrics, we need to ensure that no numerical column has a negative value otherwise it’s invalid data. In our case, we won’t consider the values from the lat and long columns since we have previous knowledge that these values can be negative. We can count the negative values with the following code in order to consider numeric columns only:
numerical_columns = [c for c in dataframe.columns if dataframe[c].dtype.name != 'object']
print(dataframe[dataframe[numerical_columns] < 0].count())

Data Exploration

After validating our data, we can start exploring the dataset to understand the behavior of the variables. We will use the Matplotlib Library and the Seaborn Library to plot the necessary data for the analysis.

Correlation

Initially, it’s important to understand how each variable interacts with the other. We can do it by analyzing Pearson’s Correlation Coefficient between each variable:

Correlation Matrix

We can observe that the id and zipcode columns are the only columns that have a negative linear correlation with price, our target variable.

Outliers

Outliers are values that are atypical in our dataset, the presence of these values makes it harder to interpret the data. We can use a boxplot to visualize these values:

Boxplots from the main variables

We can observe that our dataset has multiple outliers, their presence not necessarily indicates an error pointing that we should remove them as explained by Frost.

Distribution

Just like the outliers, the data distribution is also an important topic during analysis since it can add biases to our model. For the same previously seen columns we can see the data distribution:

Histograms from the main variables

Conclusions

Since we already validated the data on our dataset, we can finally use it to train a machine learning model. The decision if we should remove any outlier, or redistribute the data will be done during the project implementation once removing these values can actually downgrade the performance of our model.

--

--