EDA & Linear Regression | YouTube API

mellishamallikage
Jun 5, 2022
8 min read

Updated: Feb 18, 2023

Investigating YouTube API and assessment of the relationship between likes and viewership of videos.

Introduction

For several years, I have been a fan of the YouTube channel, 2BRO (兄者弟者). The channel consists of three Japanese men hosting gameplays of popular games, particularly FPS and horror games. They currently have over 3 million subscribers and have been active since May 2010.

Recently, after discovering a method to extract the YouTube API data from the YouTuber/ Data scientist, Thu Vu, I was intrigued to explore the data related to one of my favourite YouTube channels.

One aspect that particularly caught my attention was the variable like counts and how it related to other variables such as view count.

Overview

Using the method outlined by Thu Vu, a relatively clean dataset can be obtained. For 2BRO, this entails a dataset of just under 8000 videos as of 28th May 2022. The variables include titles and descriptions as well as length of videos and publication date.

The focus of this project will be on the relationship between likes and other variables. However, prior to modelling the data, an understanding of the data is required. During the EDA, any outliers that may affect the model may also be identified.

As for the relationships, to ensure that this hypothesis is does not simply examine whether the two variables a correlated the following may be evaluated.

Null hypothesis - There is no linear relationship between viewer count (independent variable) and likes (dependent variable).

EDA

Amending variables

As there are no values in favourite counts, this column can be dropped. In addition, as Thu Vu's method included coding to convert the duration listed in PT to seconds, the original duration can be likewise dropped. Furthermore, the publication variable includes both dates and time. As such, it may be best to separate the the data into two variables, with the date being appropriately formatted.

Duration

There are a myriad of places to begin examining the variables. However, one aspect that is of particular intriguing as a subscriber to the channel is the duration of the videos. 2BRO frequently host extremely long videos and as a general observer, it is unclear if these are outliers. In addition, YouTube has recently introduced shorts and these operate differently from typical videos. 2BRO have also recently been to use this feature.

Shorts

Shorts are clips that are 60 seconds or under. These clips operate in a different manner to typical videos. Consequently, the dataset can be divided into those which are above and below 60 seconds. For 2BRO, there is 116 shorts and 7881 videos.

Video

Examining the videos further, the average video is 47mins long. There are also extremely long videos. The maximum runtime for one video is approx. 8 hours.

Radio

In addition to games, 2BRO also has a monthly wrap up in the form of a radio show. This series currently have 154 episodes, based on the title of the most recent episode.

However, after extracting the data, it appears that the numerical values do not include special episodes. Including such episodes, there are 157 episodes in total. Interestingly a few radio episodes do not have tags.

Although the channel does have other one off special episodes such as holiday specials and reports from gaming shows, as radio shows are a common feature, this is the only series that is excluded. To remove the other types of videos, in-depth review of the title or tag columns will be needed.

Publication Day Name

The majority of YouTube channels release videos in a consistent manner. For 2BRO, this entails uploading one or more videos everyday. That being said, Monday, Tuesday and Thursday are days were the channel is most likely to not upload a video.

In its current form, this variable cannot be used in the model. Subsequently, it needs to be encoded to a numerical value.

Definition

Since the creation of the channel in 2010, the technology around videos have transformed significantly. According to this dataset, the majority of the videos are now in HD.

However, there are 72 videos that are in standard definition (SD). These may be dropped as they are now outliers within a dataset where the majority are in HD.

Interestingly whilst the majority of these videos originate form 2012 and 2013, there are a one clip from 2018 which were recorded in SD.

This video is:

Publication date

Following some of the changes, the majority of the videos are now from 2013. However, there are still videos from 2010 in this dataset. As the number of videos between 2010 and 2013 are limited, it may be best to cap the data to post 2013.

Captions

The majority of the videos do not have a caption - only 295 videos appear to do so. However, it is unclear why these videos specifically contain captions. To avoid any unintended distortions, the videos with and without captions will be left as is. That being said, as this is a binary field may be used in the model, a dummy variable is established for this variable.

Likes

As the graph highlights, likes has a right skew. The majority of the videos have under 10, 000 likes but there are some highly viral videos. These extremely high values should be excluded to prevent any unexpected issues when modelling.

Similar issues can be identified with views and comments and should be likewise evaluated and outliers excluded.

After these changes, the dataset now has just under 6500 entries.

2BRO is a highly active channel which uploads videos nearly everyday and clips are generally over 30mins long. They also favour around 10 tags per video. For a typical video, they secure on approx. 2500 likes and 150 comments on an average video 360,000 views.

Data Modelling

Assessment of the relationships

As the intention of the project is to examine whether the likes follow a linear relationship with the views, the correlations between variables should be examined.

For likes, the most correlated variables are view count, followed by comments and to a later extent year. Therefore, there is a relationship between view count and likes.

Creating a scatter plot of the two variables, likes and view count, presents the following.

As the view count increase, the number of likes increase. However the variations in the number of likes also increase. Additionally, the scale between the two counts is significantly different and as such should be scaled. Further alternations to the data may also aid the model and its accuracy.

Feature engineering

The views and likes should already exclude outliers. To handle the right skew which remains in the data, the log transformation of the two variables will be needed. This transforms the relationship to the following:

As the scale of the views to likes continue to differ greatly, the views will need to be scaled. This is to ensure that the predictions have a mean of 0.

Predicting likes from views

Using the resulting variables, the data can be split into 80% of training and 20% for testing. The predicted variables appear to have some variation from the actual values. Examining the regression score, the model has an accurate rate of 68.5% and a RMSE of 0.543. This suggest that whilst the there is a relationship (log-log linear model) between the views and likes, it is not the only factor. Examining worst predictions, at times it can predict over 120% of the actual value. The predicted values were also generally higher than the actual values.

Predicting likes from views, year of publication and comments

As the number of comments and to a lesser extent year of publication had a correlation with likes, it may be best to explore adding these variables to the model. As comments has a similar distribution to likes and views, it too will need to be transformed.

Under this model, the accuracy rises to high as 90% and the R^2 score is 0.4003.

Evaluating the model

Although this model has been build and appears to be relatively robust with a score of 90% and a RMSE of 0.4, it should be acknowledged that the Ordinary Least Squares (OLS) assumptions may not be fully applicable.

Working through the OLS assumptions, the following can be stated about the model:

Linearity - this assumption holds after a log transformations.

No endogeneity of regressors - as the accuracy is only 90%, there may be missing variables. There were other data within the dataset which were not included that may improve the model further. There is also a risk of variables inaccessible that may also impact the likes rating. For instance, whether a YouTuber asks for likes and subscriptions in videos may impact the number of likes a video receives.

Normality and homoiconicity - the sample is relatively large and there is a intercept. Therefore, normality assumption should apply. As for homoscedastic, whilst the log transformation reduced the impact, there was still variations in the results. This may be the noise/randomness in the number of likes.

No Autocorrelation - As the data is cross sectional, this should not be an issue .

No Multicollinearity - The inclusion of both comments and views in the model violates this assumption. Views and comments share a correlation. More views will lead to more comments, though no user is required to comment. Therefore the impact of a change in view bleeds through from the inclusion of comments.

Model conforming to OLS

Dropping comments should ensure that the OLS requirements are met, although the full scope of variables may be lacking in the model. However, doing so with no other changes will reduce the effectiveness of the model. Therefore, perhaps another variable could be added.

Revisiting the heatmap, the next variable with the highest correlation to likes is the duration. As such, this variable may be added in place of comments. This leads to a model with a score of 89.6%.

The accuracy has decreased but the model remains more robust. Under this model, the predicted values remain higher than the actual values and in the worst case predicted a like count over 100% higher than the actual figure.

It should be acknowledged that this model still faces issue with multicollinearity though the effects should be less prominent. Year of publication appears to have a correlation with views. Older the video, the more views it garners. However under this model the impact of multicollinearity should be more limited than the previous model which used comments and views to predict the likes.

Conclusion

It was extremely interesting to discover how consistent 2BRO has been over the past 12 years. It is clear that they have a passion for their job, uploading every day and the majority of the clips lasting approx. 30mins. As technology has shifted throughout the 12 years, the video quality too has moved to HD from SD.

As for the modelling, there is a relationship between the view count and the likes. However, it is not a straight forward relationship. Both the independent and dependent variables required log transformation before modelling. In addition, the model requires the inclusion of other variables to secure a reasonable accuracy rate.

Caution should be exercised during this process as some variables are correlated with one another. Comments alongside years offered the best model. However, the inclusion of comments and views violated one of the key assumptions of OLS - multicollinearity. The next best predictor of likes is year. This too, however, is correlated with views as older the video, the more like it is to have more views, although the correlation between views and years is far weaker than that comments and views. Therefore, whilst the best model was accurate just under 90% of the time and and has a RMSE of 0.406, it still has some issues concerning OLS.

Additionally, the models may further be improved through the use of further changes in the data. For instance, could videos such as wrap ups be excluded? Did the data on specific games differ from others, ie does Death by Daylight gameplay secure more likes than other games? However, another aspect to consider may be that 2BRO fails to convert views to likes for reasons such as not reminding viewers to like the video.

The null hypothesis could not be rejected as without the log transformation the relationship was not linear and other variables were required to adequately model the relationship.