Multiclass Classification With Mobile Price Range Dataset

Che-Jui Huang
4 min readOct 11, 2020
Photo by Yura Fresh on Unsplash

Data Source

Mobile Price Classification:
https://www.kaggle.com/iabhishekofficial/mobile-price-classification
Author:
Abhishek Sharma

Project Summary

With over 20 features related to different specifications on mobile phones, my goal is to figure out the best algorithm that will help make price decisions for the “Imaginary Manager Bob”. FYI, Bob is a made-up person HA.

Photo by Austin Distel on Unsplash

Pipeline Analysis

  1. Exploratory Data Analysis
  2. Data Normalization
  3. Optional: Multicollinearity Test
  4. Create Model Pool
  5. Fine Tuning

Exploratory Data Analysis, Data Normalization, Multicollinearity Test

4 Classes [0 , 1 , 2, 3]

As a data analyst/scientist, you should be grateful to have such evenly spread target classes. We are seeing a 500 count of records for each category which is not quite common.
Since the target classes have the same amount of records, I will suggest you devote your time to feature engineering to figure out the best combinations of features that you are going to feed into an algorithm.

In my case, I took away any feature that had a high correlation with one another. As a result, I eliminated the {“pc”, ”px_width”, ”sc_h”, ”three_g”} columns based on the multicollinearity test.

Additionally, I notice that the “ram” and the “battery_power” have a strong correlation with “price_range”. Therefore, I assume that using these columns, a scatter plot may give some insights.

What do you think?
Do you find something interesting?
I will leave these questions to you………

Create Model Pool

Some commonly used algorithms for multiclass classification:
— Support Vector Machine
— Decision Tree Classifier
— Gradient Boosting Classifier
— XGBoosting Classifier
— CatBoost Classifier

For the Support Vector Machine and the Decision Tree Classifier, please Google search about the mechanics behind the algorithms. These two are relatively easy to understand and you should have no issues reviewing on your own. On the other hand, Gradient Boosting, XGBoosting, and CatBoost may be harder to understand. Here are the links for you to study if you are eager to find out the “HOW”.

Gradient Boosting Classifier

“Gradient Boost Part 3: Classification” YouTube, uploaded by Josh Starmer, 8 April 2019, https://www.youtube.com/watch?v=jxuNLH5dXCs

XGBoosting Classifier

“XGBoost Part 2: Classification” YouTube, uploaded by Josh Starmer, 10 February 2020, https://www.youtube.com/watch?v=jxuNLH5dXCs

Finally, the CatBoost Algorithm

“Yandex Catboost: Open-source Gradient Boosting Library” YouTube, uploaded by Яндекс, 17 July 2017, https://www.youtube.com/watch?time_continue=81&v=s8Q_orF4tcI&feature=emb_logo

Next, let’s check out the results…..

Although all three boosting methods have similar results, CatBoost remains an upper hand in performance. And as expected, the “ram” and the “battery_power” are the top factors for models to predict “price_range”.

The Test Set Results (Left) // SHAP for Feature Importance (Right)

Fine Tune the Model

In most cases, you will want to fine-tune your top model. However, in this particular case, fine-tune on CatBoost is very likely to overfit the training data. There is evidence supporting my viewpoint, and of course, if you have different thoughts please share in the comments.

Original Distribution of Price Range
Seems the Outliers Are Captured

Conclusion

You may find out that most misclassified instances locate in the overlapping areas. Nonetheless, the original data has twenty dimensions and the actual boundaries probably do not look like the ones above. The actual decision boundary for the Catboost should be more complex. However, for visualization, I use the most important two features to see if I could draw insights from my data. What do you think?

--

--