top of page

NLP | Discovering Northanger Abbey

mellishamallikage

Updated: Nov 27, 2022

An exploration into Jane Austen's tale Northanger Abbey through the use of Spacy.

Introduction

As a bibliophile and a former English literature student, the use of language has been of keen fascination. Likewise for those in data science and AI, language has of interested, with the field of Natural Language Processing. It can be used for various aspects such as breaking down key test to its main components to understanding the sentiments behind test and groping text based on their related topics. Such developments have facilitated the growth in products such as chat boxes.


In this project, its fundamentals of extracting key aspects from text and tokenisation will be examined using the library Spacy.


Overview

At the centre of this project is the novel Northanger Abbey by Jane Austen. It was first published in 1817 (Sparknotes, 2022) . It is a coming of age tale focusing on Catherine Morland, who is fascinated by gothic novels, and how she navigates her world(Sparknotes, 2022).


A copywrite free edition of the tale is accessible via Project Gutenberg (2022) and consists of 31 chapters. This copy has additional text before and after the tale including notes on the text. As such, these may be dropped from the review.


That said, this novel has approx. 98737 words and 3542 sentences.


The first sentence of this novel is as follows.

Her situation in life, the
character of her father and mother, her own person and disposition,
were all equally against her.

Key features of the text

The Spacy library is prebuilt to identify key features in the sentence, name entitles. Applying this to the novel reveals the following, identifying aspects such as names, references to numbers and so forth.


The scope of this can be extended to include references to family as in this context they refer to characters. That said, it should be noted that this feature is applied in this form only to text that has not been categorised in other ways.


This lead to text as follows:


Tokenisation

An additional feature of Spacy is to tokenise, categorise, words in a text. When applied to the second sentence of the novel this leads to the following chapter markers can be removed as follows.

This can enable the user to assess how various words and grammar structures are used in a text. However as chapter breaks are currently included in this text, it will need to be removed. Tokenising the remaining words leads to the following table which consists of 102915 tokens.


Evaluating Tokens

There are 98673 tokens. However the table contains some data which can perhaps be dropped for example spaces. The resulting table looks as follow:


First aspect that may be examined is the use of punctuations.

This highlights that the key markers such as comma (",") and full stops (".") appear most frequently.


However ellipsis are also popular. It should be noted that the use of such punctuations are reported to bewilder many people (University of Sussex, 1997)

The remaining tokens consist of various categories such as numbers, yes/no and determiners. It should be noted that surrounding text influences the information concerning each token. For example, examining the nouns highlights "one" as a noun. Reviewing the first sentence of the text highlights that the "one" comes from the phrase "no one who.." and therefore is not a numerical figure.

No one who had ever seen Catherine Morland in her infancy would have
supposed her born to be an heroine.

The popular tokens from other categories can be reviewed as follows:

The proper noun can indicate some of the most popular characters. That said, if a character is referred to as "she", their reference is broken into the pronoun. As such of this aspect, manual examination will be required.



Conclusion

This was a mini-project examining some of the aspects of the spacy library. Whilst this project examines its uses in a playful manner, applying it to a novel, it can have wider users. One key such use is for document checking. For instance, in the aviation items must have a Certificate of Conformity. Such documents must be correct and accurate as it travels up the supply chain and therefore frequent checking. Subsequently aspects such as those highlighted here could be used to identify if key aspects have been satisfied.


0 views0 comments

Recent Posts

See All

Комментарии


Join my mailing list

Thanks for submitting!

  • LinkedIn
  • GitHub-Mark
  • tableau icon
  • Kaggle

© 2023 by The Mountain Man. Proudly created with Wix.com

bottom of page