EDA | Nobel Prize Winners

mellishamallikage
Nov 7, 2022
6 min read

Updated: Dec 10, 2022

Understanding the behind the winners of the Nobel prize.

Introduction

The Nobel Prizes are awards issued for individuals who have made landmark contributions in the field of physics, chemistry, medicine, literature, peace and, later added, economics. These awards are issued each year by committee decision. The awards hold some interesting information such as the ratio of women to men who have received the award as well as organisations (awarded only for the peace prize). Moreover, reports claim that in recent years, the winners have become more diverse and include people from around the globe.

Consequently, this project will aim to explore the following questions:

Ratio of female to male winners
The countries with the highest number of awards.
Average age of the winners

Overview

Before the data can be evaluated, the data and any issues need to be examined. For instance, some column names require renaming to improve readability (i.e. from "bornCountry" to "Country_of_birth".)

# rename columnsdf.rename(columns = {"name": "University", "born": "date_of_birth","died": "date_of_death","bornCountry": "Country_of_birth","bornCountryCode": "Country_code_of_birth","bornCity": "City_of_birth","diedCountry": "Country_of_death","diedCountryCode": "Country_code_of_death","diedCity": "City_of_death"}, inplace = True)

This dataset, consists of over 972 entries of Nobel prize winners. The approx. 18 columns pertaining to these individuals include their names, place of birth and the category under which they won the prize. There is additional coding required to ensure that the date columns are correctly formatted.

In[4]:# shape of dataframe
      df.shape
Out[4]:(972, 20)

In[5]: # information on dataframe 
        df.info()

Out[5]: 
<class 'pandas.core.frame.DataFrame'> Int64Index: 972 entries, 0 to 971 Data columns (total 20 columns):  
#   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----    
0   id                     972 non-null    int64    
1   firstname              966 non-null    object   
2   surname                936 non-null    object   
3   date_of_birth          946 non-null    object   
4   date_of_death          946 non-null    object   
5   Country_of_birth       940 non-null    object   
6   Country_code_of_birth  940 non-null    object   
7   City_of_birth          938 non-null    object   
8   Country_of_death       601 non-null    object   
9   Country_code_of_death  600 non-null    object   
10  City_of_death          595 non-null    object   
11  gender                 972 non-null    object   
12  year                   966 non-null    float64  
13  category               966 non-null    object   
14  overallMotivation      16 non-null     object   
15  share                  966 non-null    float64  
16  motivation             878 non-null    object   
17  University             720 non-null    object   
18  city                   714 non-null    object   
19  country                714 non-null    object  
dtypes: float64(2), int64(1), object(17) memory usage: 159.5+ KB

In[6]:# change formate of columns 
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], errors='coerce')
df["date_of_death"] = pd.to_datetime(df["date_of_death"], errors='coerce')

df["year"] = df["year"].fillna(0).astype(int)

However, an overview highlights that there may be a high number of missing values in some columns, as per the below graph. The overall motivation in particular has an excessive number of missing values. As such, drawing assumptions from this column will be dropped.

In[8]:# drop column 
df = df[[#'id','firstname', 'surname', 'date_of_birth', 'date_of_death','Country_of_birth', 'Country_code_of_birth', 'City_of_birth','Country_of_death', 'Country_code_of_death', 'City_of_death', 'gender','year', 'category',#'overallMotivation', 'share', 'motivation','University', 'city', 'country']].copy()

For the remaining columns, there is approx. 39% of missing values. This includes information about the name of the winner. Looking at the entries where both the first and surname is missing, highlights that information is missing for all other fields. Subsequently, these entries can be removed without undermining the quality of the dataset. Revise dataset consists of 936 entries, though there still appears to be a handful of null values in fields such as died country.

In[9]:# max % of null values
 (df.isnull().sum().max()/df.shape[0])*100
Out[9]:
38.78600823045267

The dataset at this stage includes no duplicated values covering all fields. However, fields such as names indicate that there are duplicated values. Examining the list of names duplicated, it appears that there are two groups of individuals. Those that received multiple awards and those for whom information has been listed incorrectly. To separate these groups, the duplicated names and category can be singled out. This gives us 60 entries where the information appears to be duplicated. To urge on the side of caution, these values will be dropped.

Gender Ratio

Overall, men dominate the awards significantly.

Also looking deeper at the categories, this pattern remains but the imbalance appears to be greater in some categories than others. For example, physics is extremely dominated by men, as is economics. Pease, medicine and literature do, on the other hand, perform better, relative to other categories, with more female winners.

Examining how the gender ratio has transformed post 2000s and pre 1950s, indicate that the gender balance may have not improved in recent years. Since the 2000s there have been no female winners whilst pre 1950s there were some. Economics were introduced post 1950s and it too is dominated by men.

However the gender balance in areas such as peace and literature appear to have improved greatly. Medicine is also performing better in recent years.

As multiple parties can win an award at any one time, the swarm plot is used to highlight how the female winners lie in the greater scheme of the dataset. This indicates how for some categories female winners have increased, peace and medicine. For others especially physics is very male focused.

Diversity is affected by a plethora of social factors. For instance, the perception of the subject area which discourages females from entering a specific field. this has been a long running issue for stem subjects which includes physics and chemistry. Alas this is not a simple issue the committees behind the prize can rectify.

Furthermore, to be awarded a prize, individuals must be nominated. Gender biases may have an impact if parties consider the Nobel prize to be a male centric award and thus fail to nominate leading female figures.

In the worse cases, ideas by females may have been developed by males, whilst excluding the females, leading to the awards being given to the males. Whilst this may seem unlikely or even harsh, it should be noted that in the UK women have only had the right to vote since 1928 and not until 1975 were they given the right to own a bank account.

Lastly, there is also the issue of biases within the committee. This may leave to unconscious biases favouring males or requiring females to meet a higher bar before the award can be given to them.

In all cases, further research will be required and even then, a conclusive factor may not be identified.

Country

The nationality of the winners can highlight the diversity of the Nobel prizes. However it is extremely complex. It can change over the life time of the individual as they immigrate or change nationality. In addition, boarders are also fluid and changes with the course of time as well as new administration. For example Macedonia/North Macedonia (CNN, 2019). Therefore the assessment of diversity may be limited.

With this in mind, nationalities can be examined in three ways, country of birth, country of death and country. These yield the following results:

USA dominates the ratings in all three graphs followed by United Kingdom, Germany and France. One assumption may say that these are the countries with bright individuals who can make landmark discoveries. However a more realistic assumption would be that the nature of these nations are such that they are more apt at providing the resources required for discoveries. For examples, the funding, education and freedoms to pursue such discoveries. For poorer nations, due to issues such as poverty, war or restrictive laws, which prevents individuals from individuals from achieving their maximum potential.

Average Age of winners

A new column must be made to calculate the age of the winners. The data is fairly symmetrical with a skew() score between -0.5 and 0.5. Graphically the distribution is as follows:

In[26]:# calculate skew
df["age"] = df["year"] - df["date_of_birth"].dt.year
df["age"].skew()
Out[26]:
-0.03358652965359777

Investigating the data further highlights that the distribution varies depending on the category. The peace awards have the widest range of ages. However, in contrast, economics has a more concentrated age group cantered around 70. There are different reasons for this distribution including the academic nature of economics compared to peace and therefore how it is awarded.

Conclusion

This was a project examining the Nobel prize awards. As aspects of the project highlights whilst the data is insightful and can answer specific questions, wider research is required to understand the causes.

That said the Nobel prizes may wish to examine how they award the prizes and whether it could be more diverse without undermining the quality of the award. For example by examining a broader array of nominations.