Customer Churn Prediction

7 min readMay 7, 2023

The Project

The customer churn rate is an essential metric for companies, as it can indicate how many clients are stopping using a specific product over time. Once this metric is known, companies need to predict which customers are going to churn and the reasons behind this since this information can be used to bring the customer back to the services provided.

This project will use a set of machine learning algorithms to predict which customers will churn. The data used in this post is the Telco Customer Churn Dataset. This dataset contains 21 features over 7043 users, each feature description can be read in the link provided.

Data validation

Before starting to build our models, it’s important to make sure that the data presented in the dataset is valid, i.e., the data types are correct, there are no missing values, there are no unexpected values, etc.

Data types

Let’s start by visualizing if each feature corresponds to the expected data type:

When comparing to the descriptions provided in Kaggle, we can observe that the TotalCharges feature is of type object instead of float. When deep diving into this feature it’s possible to understand that this behavior is caused by the presence of empty string values in this column. To fix this let’s replace the empty spaces with NaN values and convert the column to type float.

TotalCharges manipulation to the correct data type

Missing values

Now that all the columns are corresponding to the expected data type, we need to check for any missing values.

We observe that the column we converted in the last step is the only one with missing values. Instead of dropping these rows, we will fill in the missing values by interpolating the column.

Data exploration

Since the dataset resource provided us with the possible values for each feature, we can observe if all the columns in the dataset have only the expected values.

This visualization allows us to understand two facts:

If the values are as expected, and they are.
Count how many values of each feature is present in the dataset by customer churn.

The second fact is also the start of the next process: the explanatory analysis, i.e., understanding how the data behaves across the dataset. We can observe for example that people with no dependents tend to have a higher churn rate.

Exploratory Analysis

In this step, we try to understand the behavior of our dataset. The first example is the count plots presented previously.

I’m sure that some people notice that we didn’t count the features tenure, MonthlyCharges, and TotalCharges. The reason for not counting them is that they’re numerical values and counting doesn’t add much to our analysis at this point.

Useful information can be obtained through these columns by using other methods, for example, let’s visualize the charges of clients over the months they used the service (tenure feature).

This plot allows us to understand that the major part of clients is churning the services between 1 and 30 months of use. Naturally, the charge value increases over time since this variable indicates the total amount spent.

Let’s check if the monthly charges have more information to provide us.

If we look at the same time range for the highest churn rates (1–30 months), this plot tells us that most clients that churned are paying higher monthly charges. This could indicate that offering lower values to these clients will reduce the churning rate.

Data Processing

Once we understood the behavior of our data, we can start pre-processing the dataset to be used by our models in the future.

Dealing with categorical data

Machine learning models understand numeric data, so it’s necessary to convert all the categorical data into numerical. This will be done by using the One Hot Encoder method, this approach creates N new features to replace any categorical feature with N possible values in the dataset.

The same result can be obtained by using the get_dummies() method from Pandas, but Nair provided some reasons why to use the method provided by sklearn instead of the one provided by pandas.

Due to the behavior of the one-hot-encoder provided by the sklearn library let’s start by separating the churn column and converting it to numerical values by using pandas. Then let’s drop this column and the customerID since this information will not be hot-encoded by the sklearn library.

Drops unused columns and converts the target variable

Now let’s convert the remaining categorical features by using the technique previously described:

Since this method creates new columns, it’s necessary to create a new dataframe with the headers corresponding to the new information. Fortunately, this is easily provided to us by the library.

Let’s add the separated target column to the dataframe so we can keep the data integrity.

It’s important to notice that even though we passed numerical columns to the hot encoder, the library did not perform any conversion on those features since they were already in the correct type.

Dataset split

Now that the dataset is in the expected type format for our models, we need to split the data into training and testing sets. This step is important so our model won’t test on previously seen data allowing us to better represent real-world scenarios and not leading to overfitting.

Splits the data into training and testing sets

The split is done by using a method provided by sklearn. This tells us that 80% of the data will be used for training while 20% will be used for testing.

Model creation

Since we’re dealing with a classification problem, we’re going to try a different couple of methods in order to pick the one with the higher ability to classify correctly clients that churned. The methods tested are:

XGBoost: implements boosting for decision tree algorithms under the hood.
SVM: maps data in high-dimensional space to create a hyperplane between classes.
KNN: create centroids and define parameters to separate classes.
Logistic regression: uses the dependency between variables to calculate the hyperplane that separates features.

Each model has a set of parameters that can be studied in the provided documentation. We’ll not explore deeply these parameters since this is not the purpose of this post, but it’s important for us to tune them in order to get the best out of each model.

The parameter tuning is done by the Randomized Search Cross-Validation of sklearn. The accuracy of each model can be seen in the following plot.

We can observe that the best classifier for this problem and dataset is the Logistic Regression Classifier having the right prediction for almost 82% of cases.

Business Analysis

Considering all the hard work to develop this model, comes a useful question: when to use this model? How can my business benefit from it?

Considering that we know the profile of customers who had churned based on the previous exploration of the data, this model could be run every month in order to predict customers who are about to churn.

Once we know this customer, it’s possible to elaborate personalized offers. For example, if I predict that my client Michael will churn, I can use this client’s information to retain him:

If Michael has a month-to-month contract, I can offer a longer contract with a discount on the monthly charges for the same service. This would reduce the profit margin but would keep the client.
If Michal has no tech support, maybe he’s having problems using the provided services. This would allow the company to include free periodic technical support to fit Michael’s needs or even give him a discount on this service as well.
Etc.

Note that it’s also possible to automatize the function to create personalized offers, but that’s a bigger topic for a future post. Naturally, the exploratory analysis to understand the behavior of customers would need to be updated periodically to reflect the current scenario of our customers.

Conclusion

This post provided an overview of how each variable behaves in the provided customer dataset and introduced some of the classifiers that can be used to predict if a customer will churn or not.

The developed code can be found in this GitHub repository. A Python notebook is provided for data visualization alongside with the code for the build of the whole pipeline. Unit tests for each function are also provided.

Feel free to send me feedbacks on how I can improve my posts on technical skills, but also on how to communicate better.