Study into the use of unsupervised learning techniques to identify key students of interest.
![](https://static.wixstatic.com/media/nsplsh_51694c5051655153584430~mv2.jpg/v1/fill/w_980,h_653,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/nsplsh_51694c5051655153584430~mv2.jpg)
Introduction
Initially introduced to assess the abilities of students (Cambridge, 2008), exams have come to serve a wider purpose in recent years to encompass the assessment of the teaching quality provided by teachers and the school. From the prospective of students, exams are a frequent point of discussion as they raise issues on the utility of exams in assessing the abilities of students (OpenDoor, 2021) to grade inflation (The Observer, 2021).
In addition as the book, Naked Statistics by Charles Wheelan highlights, the use of exams to assess the performance of teachers and schools have lead to issues of conflict of interest and collusion as teachers strive to ensure students achieve high exam scores. This is exemplified. perhaps best, by the investigation into the Louisiana Prep school (The New York Times, 2018).
Leaving a side the other issues concerning exams and this effectivity to assess a student's capability, such collusions by teachers must be identified and investigated with action taken if proven to be true. There are a myriad of ways this can be done including relying on whistle-blower to notify authorities of such conduct. However, machine learning can also be used to identify possible questionable exam results such as the use of unsupervised learning or anomaly detection, specifically KMeans and Isolation Forest.
Subsequently, using an exam results dataset available on Kaggle (Seshapanpu, 2019), this project will aim to identify anomalies. It should be emphasised that these highlight candidates whose results appear as anomalies. This does not instantaneously correspondence to improper deeds but aim to highlight groups that require further investigation, cutting down to a large population to a manageable sample size.
Overview
![](https://static.wixstatic.com/media/9d0c5c_10894352418a4730b9b84f99218797b1~mv2.png/v1/fill/w_320,h_254,al_c,q_85,enc_auto/9d0c5c_10894352418a4730b9b84f99218797b1~mv2.png)
The dataset for this project involves test results for 1000 students with information on how they scored on reading, writing and maths, along with key information about the student. It has no null values and is relatively clean.
The distribution of the data is as follows:
Gender
The gender balance in the dataset is relatively even with 1.7% skew in favour of males and is similar to the global estimates of gender ratios (The World Bank, 2019).
The combination of the three exam results is as follows:
![](https://static.wixstatic.com/media/9d0c5c_cbd06fc9c3054669a1e2a9a0791afabc~mv2.png/v1/fill/w_980,h_371,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_cbd06fc9c3054669a1e2a9a0791afabc~mv2.png)
It highlights that typically women outperformed males in exams. This is a trend seen in the wider world as one source highlights: "the A-grade attainment for girls at Higher was 52.1 per cent, against 42.2 per cent for boys, resulting in a gap of 9.9 percentage points." (Tes magazine, 2021)
![](https://static.wixstatic.com/media/9d0c5c_f7b9bb7648c7400e9bae8e7e2e3f0861~mv2.png/v1/fill/w_516,h_358,al_c,q_85,enc_auto/9d0c5c_f7b9bb7648c7400e9bae8e7e2e3f0861~mv2.png)
Race/Ethnicity
Drawing insights and assumptions based on ethnicity is limited as the data is anonymized and uses group A to E. Group A appears to the minority race with the smallest number of students and group C appearing to be the minority group with over 300 students identifying wit this group.
![](https://static.wixstatic.com/media/9d0c5c_4649ab2696e54b77bfb9c95705e470ba~mv2.png/v1/fill/w_553,h_221,al_c,q_85,enc_auto/9d0c5c_4649ab2696e54b77bfb9c95705e470ba~mv2.png)
It appears that looking at how the each exam scores are distributed separated by race highlights that specific groups out performed others. Group B and C appear to particularly scored low. This may be due to disadvantages faced by such groups and social dynamics may need to be examined to understand underlying factors related to this distribution.
![](https://static.wixstatic.com/media/9d0c5c_8d0795e0ebb241208342bba714592943~mv2.png/v1/fill/w_980,h_370,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_8d0795e0ebb241208342bba714592943~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_b1da583376e244eba6e0780af17ca641~mv2.png/v1/fill/w_980,h_370,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_b1da583376e244eba6e0780af17ca641~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_830940dd5f2d4f92a0e7b4bdf330d787~mv2.png/v1/fill/w_980,h_370,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_830940dd5f2d4f92a0e7b4bdf330d787~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_6a96be901b0a412d89183e198bcb316f~mv2.png/v1/fill/w_980,h_370,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_6a96be901b0a412d89183e198bcb316f~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_f226aa49353946e082b76f8a6f3aec5e~mv2.png/v1/fill/w_512,h_370,al_c,q_85,enc_auto/9d0c5c_f226aa49353946e082b76f8a6f3aec5e~mv2.png)
Parental level of education
There is significant variation in the level of parental education as some have parents with some or limited to high school education, whilst others have higher education including masters.
![](https://static.wixstatic.com/media/9d0c5c_4cdfae35144447d4a0f63ca86e663c89~mv2.png/v1/fill/w_980,h_359,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_4cdfae35144447d4a0f63ca86e663c89~mv2.png)
These results are reflected in the students and their exam results as those whose parents had masters and bachelors consistently out performed other groups. Students whose parents did not complete high school education were on average likely to underperform compared to their peers in all three areas of examination. It should be noted that for reading and writing students with parents with some college education were likely to have the greatest range of test results. There are a broad array of reasons for this including outlook to education as well as poverty limiting the parents ability to support their children in their academic endeavours.
![](https://static.wixstatic.com/media/9d0c5c_431ebba5e302464bad41e4e83adb342a~mv2.png/v1/fill/w_980,h_359,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_431ebba5e302464bad41e4e83adb342a~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_e62a71ae8c984ea5996db3eb5ccefaf8~mv2.png/v1/fill/w_980,h_359,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_e62a71ae8c984ea5996db3eb5ccefaf8~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_a87abc6197c14f858c3eee3d624a8fb1~mv2.png/v1/fill/w_980,h_359,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_a87abc6197c14f858c3eee3d624a8fb1~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_8ed55796bd234a25b1a782348d30ce47~mv2.png/v1/fill/w_541,h_371,al_c,q_85,enc_auto/9d0c5c_8ed55796bd234a25b1a782348d30ce47~mv2.png)
Lunch
For those from poorer householders are typically offered free/reduced lunch to ensure that children are freed adequately regardless of their circumstance. In this dataset, such student accounts for 34.8% (348 students) of the dataset.
![](https://static.wixstatic.com/media/9d0c5c_e903b93c03dd47198d0b85ec1681053a~mv2.png/v1/fill/w_980,h_365,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_e903b93c03dd47198d0b85ec1681053a~mv2.png)
The impact of this factor on the three variables is as follows:
![](https://static.wixstatic.com/media/9d0c5c_8b81128cebcf4733b275d3b01b7755fd~mv2.png/v1/fill/w_554,h_365,al_c,q_85,enc_auto/9d0c5c_8b81128cebcf4733b275d3b01b7755fd~mv2.png)
This graphic highlights that those who received free or reduced cost lunches achieved lower exam scores. As with other scores, this is a commonly seen in the wider world as one report states: "Those who rarely ate breakfast scored on average 10.25 points lower than those who frequently ate breakfast, a difference of nearly two grades, after accounting for other important factors including socio-economic status, ethnicity, age, sex and BMI." (University of Leeds, 2019)
Test preparations
Test preparations are frequently viewed as a means to improve exam results. In this dataset, 33% of students (335 students) attended such courses. It is unclear if these were offered by the schools or were undertaken as part of extra curriculum activities instigated by the students and their parents. In the case of the later, it should be acknowledged that this undermines the effectiveness of using exam results as an evaluation of the teacher's capability as external teaching informed the capabilities or lack of, of the said teacher.
![](https://static.wixstatic.com/media/9d0c5c_54bc22a843ed4c5eae5dfbb0db0246a6~mv2.png/v1/fill/w_980,h_367,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/9d0c5c_54bc22a843ed4c5eae5dfbb0db0246a6~mv2.png)
In terms of the exam results, the differentiator is limited but in aspects such as writing, there appears to be some benefits. One reasoning may be that unlike maths or reading where the right answer is preestablished, in the cause of writing, there is creativity but also set rules which the student needs to understand. In this case, extra courses can aid the performance of students.
![](https://static.wixstatic.com/media/9d0c5c_fca55b249f774b4e8ea0a344e4b09791~mv2.png/v1/fill/w_537,h_346,al_c,q_85,enc_auto/9d0c5c_fca55b249f774b4e8ea0a344e4b09791~mv2.png)
Anomaly Detection
Anomalies can be detected in several methods. One simple method is to explore outliers through the use of boxplots. These review each factor individually and examine particularly high or low values from the average. This indicate a few particularly low values.
Whilst this method is easy to enact, it takes into account no information about other variables or how the exam results interact with one another.
![](https://static.wixstatic.com/media/9d0c5c_305c9c4f5b974c0c92fd5adc7c055b30~mv2.png/v1/fill/w_980,h_374,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_305c9c4f5b974c0c92fd5adc7c055b30~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_597263687d54439784cf3af88647cfe8~mv2.png/v1/fill/w_980,h_383,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_597263687d54439784cf3af88647cfe8~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_0620cee27918470eb8dd81c91432f8b9~mv2.png/v1/fill/w_980,h_374,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_0620cee27918470eb8dd81c91432f8b9~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_6a56c582980448b4a04e3c6648db3cb0~mv2.png/v1/fill/w_980,h_383,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_6a56c582980448b4a04e3c6648db3cb0~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_95036592a2924fee96c55eb49a84239a~mv2.png/v1/fill/w_980,h_374,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_95036592a2924fee96c55eb49a84239a~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_c5eb018c36e8488bb7ac987622629a7c~mv2.png/v1/fill/w_980,h_383,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/9d0c5c_c5eb018c36e8488bb7ac987622629a7c~mv2.png)
![](https://static.wixstatic.com/media/9d0c5c_cbdcfbb3042e43bba425c4ff969e6954~mv2.png/v1/fill/w_388,h_282,al_c,q_85,enc_auto/9d0c5c_cbdcfbb3042e43bba425c4ff969e6954~mv2.png)
KMeans
Given the multi-linear relationship between the reading, writing and math scores, anomalies be more effective if it took into account all three variables. For this, methods such as KMeans could be utilised.
For the purpose of this project, the method highlighted by DecisionForest on YouTube (2020) will be used.
Clustering the data for all three variables, there are 6 ideal clusters. This includes a cluster with the lowest number of individuals, 85, and therefore may be a point of interest. Graphically, these belong to individuals who performed poorly in exams.
This could be used to identify students who require additional support but may be limited in identifying anomalies caused by teachers manipulating the data.
![](https://static.wixstatic.com/media/9d0c5c_1af3c0fa26a24358b2d33f32c9030005~mv2.png/v1/fill/w_494,h_355,al_c,q_85,enc_auto/9d0c5c_1af3c0fa26a24358b2d33f32c9030005~mv2.png)
Isolation Forest
An alternative method is to use Isolation Forest to identify anomalies in the dataset. Due to the underlying algorithm, it does not require scaling and can utilise categorical data such as race. Once again the method by DecisionForest on YouTube (2020) will be used.
There is a prerequisite to understand the level of contamination, which can be done with the support of an individual with domain knowledge. However, assuming that there is 10% contamination, the dataset highlights possible anomalies as follows.
As with other models, the results of under achieving students are highlighted prominently. Equality over achievers are also highlighted. However, there is also a handful of individuals in the centre of the dataset that are also highlighted as anomalies. These may be those students that over or under perform regardless of circumstance or received additional aid. In each case, additional investigation is required to understand the unique dynamics affecting the performance of such students.
![](https://static.wixstatic.com/media/9d0c5c_7108f76e9fe74ab48fbcf99b8d08ab13~mv2.png/v1/fill/w_540,h_360,al_c,q_85,enc_auto/9d0c5c_7108f76e9fe74ab48fbcf99b8d08ab13~mv2.png)
Conclusion
The use of unsupervised learning algorithm can yield key insights into data which may not be apparent before. In terms of exams, these can highlight not only students that over or under perform but also highlight students whose data is skewed, perhaps by unethical acts by those in authority.
This can have wider implications on the student as well as society. Those who attended universities following teacher interference, may struggle to keep up with demands. Likewise, students who lost their chance to attend the university as a result, their contributions to society may be hampered. Subsequently, machine learning and data science plays a key role in ensuring that students are not exploited in such a manner. Society and authorities must ensure that at the centre of education and in turn exam is the students and their wellbeing now and in the future.
“Education is the most powerful weapon which you can use to change the world.”― Nelson Mandela
Comments