What Should become an Anime?

Agongora
5 min readJul 30, 2020

In this article I’m going to walk you through the classification problem i try to solve using sci-kit learn.

As the title suggests, the problem I built a model for is one that takes inputs about candidate to animate and returns whether it’s a really good, ok , questionable, or a bad choice . Here is the dataset we will be working with this dataset https://raw.githubusercontent.com/allan-gon/Unit-2-Build/master/unit_2_build/DATA/anime.csv

We’ll start with loading and inspecting the data:

We can see we have a dataset of 14478 rows and 31 columns. And some of these columns have A LOT of nan values. Now let’s see what kind of data is in the columns and see if there are issues with the data.

Clearly the data needs some cleaning.

Things to note i dropped a lot of columns because they were leakage or useless. I binned the high cardinality features into top 10 and all others are now considered part of the ‘other’ category. This data is from myanimelist they have 2 ranking systems. ‘Rank’ is a position (lower is better) based on your average rating and ‘Popularity’ (also a position,lower is better) which is based on members or how many people have watched the anime. My target was a show’s average of rank and popularity but I then binned that into 4 categories (Really good, ok , Questionable, Bad). These categories represent 25% chunks of the target. So really good means my model thought that the anime was in the top 25% of animes bad means my model thought the anime was in the bottom 25% of animes. I use this as a proxy for how well a show did since I don’t have data on how much all 14k animes made.

So after some cleaning we get this:

Now the fun part modeling. Since this is a classification problem we’ll be trying out these models: Logistic Regressor, Random Forest Classifier, XGB Classifier

We’ll start with the baseline which for classification problems is the percent of the target that the majority class makes up. For us that about 43.68%, this is the accuracy our model must beat to be worth while. First we’ll have to split the data into traning and validation. This is done because we want a model that works just aswell with data it hasn’t seen (validation data) and data it has seen (training data). I’ve fit an tuned all models. Now let’s see how they preformed.

If you look at the legend the blue bars represent the base models which means using a model out of the box, the orange bars are the tuned models which means the models after hyperparameter tuning, and the green bar is the baseline. All models beat the baseline with Logistic Regression having the lowest accuracy of the models and XGBoost Classifier having the highest accuracy.

Here’s the actually numbers for clarity’s sake.

So the Logistic Regression had a validation accuracy of 64.25% and test accuracy of 65.23%. After hyperparameter tuning suprisingly the scores got worse (though marginally) validation is 64.44% and test is 64.84%. The Random Forest validation was 67.05% and the test was 66.32%. After hyperparameter tuning validation was 68.31% and test was 67.42%. Finally the XGBoost validation was 68.89% and test was 68.13%. After hyperparameter tuning validation was 68.02% and test was 68.52%. The XGBoost Classifier had the highest test accuracy.

If you want to visualize what the model did, look at this Shapley Force Plot. For our model lower is better. How you read the plot is by thinking about it like two forces pushing against eachother, the stronger force will dictate where they go. So there’s a positive (blue) force than shows us what contributed to the model giving it this high rating. In the example below we see have 3 related material (prequels,sequels,spinoff, etc…) , not being a hentai, and having 52 episodes wher the largest positive contributing force whereas the red, in this case, being of genre fantasy/sci-fi was a negative contributing force for my model.

Takeaways

All models beat the baseline. The best model (XGB) beat the baseline my about 25% but I was hoping to reach the mid 70’s in accuracy. If I were to comeback to this project, I would try to scrape data about the source material (manga). Currently the model is trained on already existing anime. So what it tries to do is, given what we know about the manga + hopes for the manga (what studio will produce, how many episodes can we milk out of it) where would it rank if it existed as an anime? But because of this there were features Icouldn’t use to train on because they would be leakage like members, favourites, score. Now, if i scraped the source material, I could use those features to train on. Think about it like this, i can’t use score to train on because it’s from already existing anime, you would never know the score of the manga you’re hoping to animate (because it doesn’t have a score because it hasn’t been animated). You would however know the score of the manga. After successfully failing for a few days of trying to scrape myanimelist manga I gave up but this data might drastically increase accuracy (because knowing a manga is popular increases likelihood that the anime will be).

--

--