Cluster Analysis | Reading list

mellishamallikage
May 17, 2022
4 min read

Updated: Aug 29, 2022

Understanding patters in the GoodReads Reading list using K-Means clustering.

Introduction

In a world where there are countless books and it is near impossible for any one individual to read all of them, reading lists are a common feature. Such lists can assist a reader in finding books that they may enjoy or in an academic context, the relevant knowledge is acquired.

Some such lists can be viewed on sites such as Goodreads, "the world’s largest site for readers and book recommendations " (Goodreads, 2022). These lists can be extensive and as such, it is intriguing to examine whether there are patterns in the list and if the list is particularly long how an individual should approach the list.

Overview

Some lists on Goodreads are self explanatory, classifying books by themes or genre. However, lists such as "Books That Everyone Should Read At Least Once", can be more broad and therefore more mysterious. The list has over 26,000 books with a simple description "Books that encourage thought".

As these lists are online, the data will need to be scrapped. Using a modified version of the code from mee-kell, the relevant data was scraped (see modified code).

The secured data requires some data processing. It can then be evaluated to locate any patterns in the dat. As for how to approach the list, clustering may offer some insights.

Data Pre-processing

Assessing the data indicates that there are some issues with the data. Some of the numerical values include text which is causing various issues. Whilst these errors may be fixed manually, given that the data consists of approx. 4900 books, the dropping the rows may be the most efficient. The dataset is large enough that the drop should not undermine the dataset significantly.

This enables the numerical variables to be updated. As the avg_ratings can contain decimal places, it can be converted to a float whilst total_ratings must be a integer as decimal places should not be possible.

As the lists are maintained by users, there may be other hidden errors. One such issue includes duplicates. It is unlikely that any contributor may recall over 5000 titles and ensure that the new book added to the list is not already included. Authors can publish more than one book and titles can overlap. However, the likelihood of an author publishing a book under the same title is relatively low. There may be variations in the edition but that may not be a significant issue. For any duplicates, as the avg_ratings and total_ratings can differ and different approaches can be taken in deciding which entry to retain.

For the purpose of this project the first entry will be retained to favour the user who first added the book to the list.

The end result is a dataset as follows:

The dataset is still large with over 4700 entries.

Patterns in the data

As the data is limited, the patters which can be identified may be limited. That being said, it appears that the most popular authors are Stephen King, William Shakespeare and Terry Pratchett. There is also 18 books where the author is listed as unknown (Anonymous). This is relatively large given that Anonymous appear 5th in the list of popular authors.

Goodreads rating system enables the users to rate a book between 1-5. As such, the distribution of the data is also between 1 and 5. Given the nature of the list, the majority of the data is grouped around 3-5. The inclusion of extremely low rating book is likely to be an outlier.

As for total ratings, due to the nature of the data it is skewed to the right. However, there are suspicious entries as at least one book has no ratings.

Looking for clusters

Before clusters in the data can be identified, the anomalies identified in the previous section should be resolved. Therefore:

Avg_ratings - extreme outliers should be dropped as these may affect the clustering.
Total_ratings - likewise should be dropped.
Anonymous - open to interpretation as there are books published without an author name but they are not the norm.

Therefore in the first instance, the avg_rating and total_ratings should be revised.

Examining the between the number of ratings and the average ratings does not appear to at face value to have clear groups. As the total number of ratings increase the avg_rating centers around 4.

Clustering

A simple classification with 4 clusters appears to divide the data by the number of total reviews.

This is not particularly enlightening.

This may be due to the difference in scale between total_ratings and avg_ratings.

Modifying clustering

To gain a better understanding of clusters, the total_ratings and avg_ratings should be standardised. This should improve the scale difference between the two variables.

Subsequently, using the elbow method, it suggests that there are three clusters in the data.

Looking at the results visually, it appears that the resulting clusters are:

Relatively low average rating and total rating books.
High average rating but low to average total rating books.
High total rating and average average rating

For a reader, these clusters has the following implications:

High average rating but low to average total rating books

Books in this group may be ideal for those who are new to reading as the books may prove more enjoyable.

High total rating and average average rating

These may be hyped books and whilst there is a risk that the book may not be enjoyable, there may be a robust community/discussion centring around these books

Relatively low average rating and total rating books.

These may be more risky books as there is a higher chance of the book being a flop. As such, this may be more aimed at experienced/ avid readers.

Conclusion

This was a relatively short project examining Goodreads reading list, Books That Everyone Should Read At Least Once. Although the list is extensive (over 4700 books), the extensivity as well as some questionable entries (books with extremely low average and total ratings) may reduce the usability of the list for a reader.

Although, clusters within the dataset were identified, it is not extremely robust. As such whilst the reader may find value in the clusters, they should conduct additional research before committing to a book if they are unsure.