Classifications | Favourite Books

Modelling to identify if a book will be loved by Art Garfunkel.

Introduction

Recently an interesting dataset was uploaded to Kaggle. The Art Garfunkel's Library dataset consists of information on over 1300 books that Garfunkel has read since 1968. As an individual who also enjoys reading, I was intrigued about Garfunkel's reading habits and was particularly curious about whether Garfunkel's enjoyment of a book can be predicted.

Overview

To understand Garfunkel's reading habits as well as creating a model that could predict if a specific book read by Garfunkel will be in his favourites the following steps would be required:

Understanding the data
Creating a logistic and random forest

Understanding the data

The dataset has 6 columns and include information on date read, author, books, year published, pages and favourite.

On face value, it appears that there are no null values but columns, date read and year published, will need to be formatted. However, it may be best to examine the dataset in more depth for any further issues and secure a more in-depth understanding of the data.

Books

There are over 1300 unique titles and includes some rereads. Reread books applies to titles such as "The Confessions". There are also books with similar titles as exemplified by the book "Autobiography", one by W. B Yeats and the other by Benvenuto Cellini.

Furthermore, rereads appear to have been done with different books. For example, Othello has two differing page numbers. This is a common feature of books as some may have introductions and notes. Font size will also differ between editions impacting the page count.

This extract of Othello raises some concerns. The year published appears to be different. This could refer to one of two things.

There are error/typo in the year published.
Year published refer to the year in which Garfunkel's edition of the book was first published. Different editions of books can and will have differing first publication dates.

Given that the dates listed are in 1602 and 1604, it is somewhat unlikely that Garfunkel has read these specific editions of the books, even if he possessed the editions. Therefore, these are likely to be errors.

However, rectifying these errors will be extremely complex. Sites such as Amazon typically list the publication date of the specific edition. Likewise, whilst Wikipedia and other sites will have the initial publication date, there is no guaranteed that all the books read will have a corresponding page.

Consequently, this project will have to proceed with the information as is.

Finally for rereads, it appears that books were not marked as favourites in neither the initial or second read. Therefore there is no alterations required for the favourites with regards to rereads.

Favourites

Speaking of the favourites variable, it is a binary column allocating 1 for a favourite and 0 for a non-favourite. The majority of the books are not marked as a favourite book. Only just over 10% of the books are singled out.

Although, this is reasonable, this will cause issues with the models. A simple model reporting that all books are not a favourite will provide an accuracy rating of just under 90%.

Authors

Garfunkel appears to have read more books by specific authors than others. Also for some authors he is more likely to find a favourite book than others. Marcel Proust and L.N. Tolstoy appear to be particularly favoured.

There are some books which may have multiple authors and this dataset is no exception. There can be as high as 25 authors in a single book. There are different possibilities in handling these data. The authors could be divided into individuals and then modelled accordingly or handled as a single unit and assessed accordingly. For the purpose of this project, the variable will be considered in its current form as it is likely that the combination of the authors has an impact on the book and the enjoyability than a sole author.

Pages

The pages has a right skew as the majority of the books are less than 500 pages. Mathematically the books with over 750 pages are considered outliers. However, as books can have more than 750 pages, such longer books should not be dropped from the dataset. That being said, Garfunkel is more likely to enjoy books around 250 pages than longer looks which he rarely reads.

Year Published

Leaving aside possible errors in the accuracy of the data previously mentioned, there are also other issues which need to be resolved before the data could be modelled.

Firstly, there are two values that contain two incorrect values that should be replaced. In addition some publication dates are BC. There may be various methods to resolve the issue, the method used here is to convert all BC publications to a negative number. This should enable the data to be formatted correctly.

Examining the distribution it is clear that the majority of the books Garfunkel has read are centred around the 1800-1900s.

The favourite books relative to publication year also seems to follow the overall pattern established by this variable.

Date Read

Date Read has the most issues. Looking at a bird's eye view indicates that in general the variable consists of month and year. However, there are several entries where only a year has been included.

Firstly, it may be best to split the variables into its key components year and month, using a dummy for those where a month variable is missing.

Year can now be processed easily.

There were more books read before 2005 were likely to be listed as a favourite book.

In contrast, for month, there are still some issues to resolve. There are null values in this variable. One option to resolve this issue would be to list such books as being read in Dec, the last possible month to read a book. However this will distort the accuracy of Dec. With 125 entries without a month which constitutes 9% of the data, this is not ideal.

Subsequently, it may be best to attempt to make a model that excludes the month column complete and an alternative model which excludes the missing month data.

Modelling Favourites

Before models can be created, author variable will need to be encoded.

Model Excluding reading month

Following this, the first model which could be made is a logistic regression as the favourite variable is binary.

In the first instance, variables all but the book name which is too unique and the months, which still contain null values will be used in the model.

Sadly, the results of this model, appears to be no better than the initial model of simply allocating all books as being a non-favourite.

Swapping the logistic model for a random forest, improves some measurements but performs no better than the logistic regression model.

Models including Month

In order for the month variable to be used, the entries pertaining to the null values will need to be dropped.

Doing so and running the two models, shows a decrease in the accuracy of the models. This may be due to the changes in the composition of the data as the missing months also included some favourite books, therefore making exacerbating the skew in the dataset favouring non-favourite books.

Conclusion

It was fascinating to understanding the reading habits of Garfunkel and how widely read he is. I hope one day, I would have read as many books as Garfunkel.

Unfortunately, the modelling whether a book will be regarded a book as a favourite did not lead to the results sought. All the models considered did not perform better than the base model of classifying all books as a non-favourite.

It may be that using more complex models such as deep learning may yield better results. Also books have more key data than those included in this dataset. For instance Garfunkel may have a preference towards a specific genre.

Leaving aside the success of the models, this project does highlight the importance of understanding the possible limitations of models.