Multiclass Classification With Mobile Price Range Dataset
Data Source
Mobile Price Classification:
https://www.kaggle.com/iabhishekofficial/mobile-price-classification
Author:
Abhishek Sharma
Project Summary
With over 20 features related to different specifications on mobile phones, my goal is to figure out the best algorithm that will help make price decisions for the “Imaginary Manager Bob”. FYI, Bob is a made-up person HA.
Pipeline Analysis
- Exploratory Data Analysis
- Data Normalization
- Optional: Multicollinearity Test
- Create Model Pool
- Fine Tuning
Exploratory Data Analysis, Data Normalization, Multicollinearity Test
As a data analyst/scientist, you should be grateful to have such evenly spread target classes. We are seeing a 500 count of records for each category which is not quite common.
Since the target classes have the same amount of records, I will suggest you devote your time to feature engineering to figure out the best combinations of features that you are going to feed into an algorithm.
In my case, I took away any feature that had a high correlation with one another. As a result, I eliminated the {“pc”, ”px_width”, ”sc_h”, ”three_g”} columns based on the multicollinearity test.
Additionally, I notice that the “ram” and the “battery_power” have a strong correlation with “price_range”. Therefore, I assume that using these columns, a scatter plot may give some insights.
What do you think?
Do you find something interesting?
I will leave these questions to you………
Create Model Pool
Some commonly used algorithms for multiclass classification:
— Support Vector Machine
— Decision Tree Classifier
— Gradient Boosting Classifier
— XGBoosting Classifier
— CatBoost Classifier
For the Support Vector Machine and the Decision Tree Classifier, please Google search about the mechanics behind the algorithms. These two are relatively easy to understand and you should have no issues reviewing on your own. On the other hand, Gradient Boosting, XGBoosting, and CatBoost may be harder to understand. Here are the links for you to study if you are eager to find out the “HOW”.
Gradient Boosting Classifier
“Gradient Boost Part 3: Classification” YouTube, uploaded by Josh Starmer, 8 April 2019, https://www.youtube.com/watch?v=jxuNLH5dXCs
XGBoosting Classifier
“XGBoost Part 2: Classification” YouTube, uploaded by Josh Starmer, 10 February 2020, https://www.youtube.com/watch?v=jxuNLH5dXCs
Finally, the CatBoost Algorithm
“Yandex Catboost: Open-source Gradient Boosting Library” YouTube, uploaded by Яндекс, 17 July 2017, https://www.youtube.com/watch?time_continue=81&v=s8Q_orF4tcI&feature=emb_logo
Next, let’s check out the results…..
Although all three boosting methods have similar results, CatBoost remains an upper hand in performance. And as expected, the “ram” and the “battery_power” are the top factors for models to predict “price_range”.
Fine Tune the Model
In most cases, you will want to fine-tune your top model. However, in this particular case, fine-tune on CatBoost is very likely to overfit the training data. There is evidence supporting my viewpoint, and of course, if you have different thoughts please share in the comments.
Conclusion
You may find out that most misclassified instances locate in the overlapping areas. Nonetheless, the original data has twenty dimensions and the actual boundaries probably do not look like the ones above. The actual decision boundary for the Catboost should be more complex. However, for visualization, I use the most important two features to see if I could draw insights from my data. What do you think?