Understanding the behind the winners of the Nobel prize.
![](https://static.wixstatic.com/media/nsplsh_29998af9b11c42b6bf32aae2e29f462d~mv2.jpg/v1/fill/w_980,h_653,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/nsplsh_29998af9b11c42b6bf32aae2e29f462d~mv2.jpg)
Introduction
The Nobel Prizes are awards issued for individuals who have made landmark contributions in the field of physics, chemistry, medicine, literature, peace and, later added, economics. These awards are issued each year by committee decision. The awards hold some interesting information such as the ratio of women to men who have received the award as well as organisations (awarded only for the peace prize). Moreover, reports claim that in recent years, the winners have become more diverse and include people from around the globe.
Consequently, this project will aim to explore the following questions:
Ratio of female to male winners
The countries with the highest number of awards.
Average age of the winners
Overview
Before the data can be evaluated, the data and any issues need to be examined. For instance, some column names require renaming to improve readability (i.e. from "bornCountry" to "Country_of_birth".)
# rename columnsdf.rename(columns = {"name": "University", "born": "date_of_birth","died": "date_of_death","bornCountry": "Country_of_birth","bornCountryCode": "Country_code_of_birth","bornCity": "City_of_birth","diedCountry": "Country_of_death","diedCountryCode": "Country_code_of_death","diedCity": "City_of_death"}, inplace = True)
This dataset, consists of over 972 entries of Nobel prize winners. The approx. 18 columns pertaining to these individuals include their names, place of birth and the category under which they won the prize. There is additional coding required to ensure that the date columns are correctly formatted.
In[4]:# shape of dataframe
df.shape
Out[4]:(972, 20)
In[5]: # information on dataframe
df.info()
Out[5]:
<class 'pandas.core.frame.DataFrame'> Int64Index: 972 entries, 0 to 971 Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 972 non-null int64
1 firstname 966 non-null object
2 surname 936 non-null object
3 date_of_birth 946 non-null object
4 date_of_death 946 non-null object
5 Country_of_birth 940 non-null object
6 Country_code_of_birth 940 non-null object
7 City_of_birth 938 non-null object
8 Country_of_death 601 non-null object
9 Country_code_of_death 600 non-null object
10 City_of_death 595 non-null object
11 gender 972 non-null object
12 year 966 non-null float64
13 category 966 non-null object
14 overallMotivation 16 non-null object
15 share 966 non-null float64
16 motivation 878 non-null object
17 University 720 non-null object
18 city 714 non-null object
19 country 714 non-null object
dtypes: float64(2), int64(1), object(17) memory usage: 159.5+ KB
In[6]:# change formate of columns
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], errors='coerce')
df["date_of_death"] = pd.to_datetime(df["date_of_death"], errors='coerce')
df["year"] = df["year"].fillna(0).astype(int)
However, an overview highlights that there may be a high number of missing values in some columns, as per the below graph. The overall motivation in particular has an excessive number of missing values. As such, drawing assumptions from this column will be dropped.
![](https://static.wixstatic.com/media/9d0c5c_95da5627d9f549a2b6a02547635f636c~mv2.png/v1/fill/w_980,h_430,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_95da5627d9f549a2b6a02547635f636c~mv2.png)
In[8]:# drop column
df = df[[#'id','firstname', 'surname', 'date_of_birth', 'date_of_death','Country_of_birth', 'Country_code_of_birth', 'City_of_birth','Country_of_death', 'Country_code_of_death', 'City_of_death', 'gender','year', 'category',#'overallMotivation', 'share', 'motivation','University', 'city', 'country']].copy()
For the remaining columns, there is approx. 39% of missing values. This includes information about the name of the winner. Looking at the entries where both the first and surname is missing, highlights that information is missing for all other fields. Subsequently, these entries can be removed without undermining the quality of the dataset. Revise dataset consists of 936 entries, though there still appears to be a handful of null values in fields such as died country.
In[9]:# max % of null values
(df.isnull().sum().max()/df.shape[0])*100
Out[9]:
38.78600823045267
The dataset at this stage includes no duplicated values covering all fields. However, fields such as names indicate that there are duplicated values. Examining the list of names duplicated, it appears that there are two groups of individuals. Those that received multiple awards and those for whom information has been listed incorrectly. To separate these groups, the duplicated names and category can be singled out. This gives us 60 entries where the information appears to be duplicated. To urge on the side of caution, these values will be dropped.
Gender Ratio
Overall, men dominate the awards significantly.
![](https://static.wixstatic.com/media/9d0c5c_4d8ad86937734d2a8de886a8291ca198~mv2.png/v1/fill/w_980,h_374,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_4d8ad86937734d2a8de886a8291ca198~mv2.png)
Also looking deeper at the categories, this pattern remains but the imbalance appears to be greater in some categories than others. For example, physics is extremely dominated by men, as is economics. Pease, medicine and literature do, on the other hand, perform better, relative to other categories, with more female winners.
![](https://static.wixstatic.com/media/9d0c5c_c1e68d75fa8243e6afcbaac05d25d922~mv2.png/v1/fill/w_980,h_374,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_c1e68d75fa8243e6afcbaac05d25d922~mv2.png)
Examining how the gender ratio has transformed post 2000s and pre 1950s, indicate that the gender balance may have not improved in recent years. Since the 2000s there have been no female winners whilst pre 1950s there were some. Economics were introduced post 1950s and it too is dominated by men.
However the gender balance in areas such as peace and literature appear to have improved greatly. Medicine is also performing better in recent years.
![](https://static.wixstatic.com/media/9d0c5c_01b7546bd39f455795dc7feea315044f~mv2.png/v1/fill/w_980,h_376,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_01b7546bd39f455795dc7feea315044f~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_85ae21f74afe4e978ddd58cf76734e8a~mv2.png/v1/fill/w_980,h_348,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_85ae21f74afe4e978ddd58cf76734e8a~mv2.png)
As multiple parties can win an award at any one time, the swarm plot is used to highlight how the female winners lie in the greater scheme of the dataset. This indicates how for some categories female winners have increased, peace and medicine. For others especially physics is very male focused.
![](https://static.wixstatic.com/media/9d0c5c_5da5a8eb01cf4914bf2de74f23170d99~mv2.png/v1/fill/w_980,h_367,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_5da5a8eb01cf4914bf2de74f23170d99~mv2.png)
Diversity is affected by a plethora of social factors. For instance, the perception of the subject area which discourages females from entering a specific field. this has been a long running issue for stem subjects which includes physics and chemistry. Alas this is not a simple issue the committees behind the prize can rectify.
Furthermore, to be awarded a prize, individuals must be nominated. Gender biases may have an impact if parties consider the Nobel prize to be a male centric award and thus fail to nominate leading female figures.
In the worse cases, ideas by females may have been developed by males, whilst excluding the females, leading to the awards being given to the males. Whilst this may seem unlikely or even harsh, it should be noted that in the UK women have only had the right to vote since 1928 and not until 1975 were they given the right to own a bank account.
Lastly, there is also the issue of biases within the committee. This may leave to unconscious biases favouring males or requiring females to meet a higher bar before the award can be given to them.
In all cases, further research will be required and even then, a conclusive factor may not be identified.
Country
The nationality of the winners can highlight the diversity of the Nobel prizes. However it is extremely complex. It can change over the life time of the individual as they immigrate or change nationality. In addition, boarders are also fluid and changes with the course of time as well as new administration. For example Macedonia/North Macedonia (CNN, 2019). Therefore the assessment of diversity may be limited.
With this in mind, nationalities can be examined in three ways, country of birth, country of death and country. These yield the following results:
![](https://static.wixstatic.com/media/9d0c5c_1a337fe468d049218dbaa87cf4bb6323~mv2.png/v1/fill/w_980,h_361,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_1a337fe468d049218dbaa87cf4bb6323~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_2e0d2ae8e27749838df4e230865926c0~mv2.png/v1/fill/w_980,h_349,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_2e0d2ae8e27749838df4e230865926c0~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_fc2d1d448b024f269e86e2d131e813f1~mv2.png/v1/fill/w_980,h_347,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_fc2d1d448b024f269e86e2d131e813f1~mv2.png)
USA dominates the ratings in all three graphs followed by United Kingdom, Germany and France. One assumption may say that these are the countries with bright individuals who can make landmark discoveries. However a more realistic assumption would be that the nature of these nations are such that they are more apt at providing the resources required for discoveries. For examples, the funding, education and freedoms to pursue such discoveries. For poorer nations, due to issues such as poverty, war or restrictive laws, which prevents individuals from individuals from achieving their maximum potential.
Average Age of winners
A new column must be made to calculate the age of the winners. The data is fairly symmetrical with a skew() score between -0.5 and 0.5. Graphically the distribution is as follows:
In[26]:# calculate skew
df["age"] = df["year"] - df["date_of_birth"].dt.year
df["age"].skew()
Out[26]:
-0.03358652965359777
![](https://static.wixstatic.com/media/9d0c5c_d6a139a730e744b5941b1de307a6fb32~mv2.png/v1/fill/w_980,h_366,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_d6a139a730e744b5941b1de307a6fb32~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_3617002c97e040bcadc8ce1dec944bfa~mv2.png/v1/fill/w_980,h_537,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_3617002c97e040bcadc8ce1dec944bfa~mv2.png)
Investigating the data further highlights that the distribution varies depending on the category. The peace awards have the widest range of ages. However, in contrast, economics has a more concentrated age group cantered around 70. There are different reasons for this distribution including the academic nature of economics compared to peace and therefore how it is awarded.
![](https://static.wixstatic.com/media/9d0c5c_eff8febf8f4d4e4a8e3be679e97a7f9f~mv2.png/v1/fill/w_980,h_374,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_eff8febf8f4d4e4a8e3be679e97a7f9f~mv2.png)
Conclusion
This was a project examining the Nobel prize awards. As aspects of the project highlights whilst the data is insightful and can answer specific questions, wider research is required to understand the causes.
That said the Nobel prizes may wish to examine how they award the prizes and whether it could be more diverse without undermining the quality of the award. For example by examining a broader array of nominations.
Author's notes
This was a simple and short project. However after watching a Rob Mulla's tutorial on EDA, I wanted to try some of his methods.
For those interested in the tutorial, it can be found here: https://www.youtube.com/watch?v=xi0vhXFPegw&list=WL&index=43
“For the things we have to learn before we can do them, we learn by doing them.” ― Aristotle, The Nicomachean Ethics
Comentarios