Author Identification using Text Snippets — A Machine Learning Approach
A Machine Learning system to identify the authors on the basis of the author’s writing style.
This article briefly tells you about the Machine Learning and Natural Language Processing project’s big picture and discusses the results obtained.
The objectives of this project are:
- Create the dataset of authors and their works by web scraping.
- Extracting the features from the dataset and visualization of the data.
- Application of machine learning models and discuss the results.
With each keystroke, each author imparts themselves unto their work; most of this is subconscious. Each of the authors has their own different writing style and vocabulary. Characterizing an author requires extracting features from the author’s text, and these are called stylometric features. These identify an author uniquely.
We propose to train a machine learning model on short text snippets to leverage these properties and identify the author.
Importance of the project
The importance of the project can be derived from the kind of application areas that this work can cater to:
- Simulating/Mimicking author behavior by machines.
- Identifying plagiarism, author changes, author claims out of their works.
- Better information retrieval systems.
- Tone, delivery, and message consistency guidance by automated systems like Grammarly.
The data of 415 authors and 9416 documents was web scraped and the task now was to identify which sentences need to be included and which don’t. To achieve this, the following strategy was used:
- Cherry-picking the three authors on the basis of the number of sentences by each author. This gives us a small dataset.
- Preprocess the corpus, in terms of tokenization, lemmatization, punctuation removal, and case folding. Sentences that consisted of less than 5 words were removed.
- Removing unnecessary sentences collected while web scraping. This was done by creating a list of triggers that were generally seen after scraping.
- This pre-processed data was converted to features using a count vectorizer which was then passed through a Multinomial Naive Bayes Model. An author identification accuracy of 85.96% was observed.
- Through this, we get class probabilities for each sentence. Then top 90% of sentences were taken to ensure the removal of outliers and a 70:30 ratio of common to unique sentences was taken.
- This gives us the sentence — author pair for each author.
From the previous step, the following structure was arrived at:
The above structure makes use of three columns indicating id, text, and author.
- The id column indicates the document id. This column is not useful for machine learning purposes.
- The text column is a sentence from the work of the author indicated in the corresponding column.
- The author column indicates the abbreviated name of popular authors — SW is Shakespeare William, WV is Woolf Virginia, and WO is Wilde Oscar.
Out of these three columns, we will make use of text and author columns. The author column is the class label column, and since we need to identify three authors, this is the multiclass classification problem.
Before going further to the machine learning models, we need to preprocess the data and extract its features.
Exploratory Data Analysis
Exploratory data analysis forms an important part of analyzing the data we have and helps in identifying the type of machine learning techniques to be used.
The following table shows the document length statistics for the data we have:
We can see that the minimum document length is maximum for the author Woolf, which indicates that this author prefers writing long stories as compared to the other two authors.
The following table shows the sentence length statistics for the data we have:
The table alongside shows that the data scraped contains some blank sentences which are indicated by minimum sentence length but from the maximum sentence length, one can conclude that Wilde writes long sentences as compared to Shakespeare and Woolf.
The following table shows the word length statistics for the data we have:
We can infer that Shakespeare tends to write longer words or can be a scenario where multiple words might have been connected without space.
The text data obtained is in raw format, which needs to be preprocessed. These techniques include:
- Tokenization — The sentences present in the author's text are tokenized to generate a stream of tokens.
- Lemmatization — Lemmatization is a process of producing the root word out of the word present in the text. After the tokens are produced, each word is then brought to the lemmatized form.
- Stopword Removal — Stopwords need to be removed to generate meaningful features.
- Contraction Expanding — Various contractions present in the author’s text data needs to be expanded.
- Punctuation Removal — Punctuations need to be removed to assess the text data better.
- Lowercase conversion — Words present in different cases need to be brought to a standard case.
After obtaining the preprocessed data, we can further visualize the author’s habits as indicated below:
We can see that each author tends to use 1–20 words in the text in general (as indicated by wide plots at the bottom). Following is the plot of punctuations per author and it indicates Oscar Wilde uses the least number of punctuations while William Shakespeare tends to use the most number of punctuations in the text.
After the preprocessing, the data frame of a list of tokens for each sentence is obtained to be processed further.
Several features that can depict the characteristics of an author were implemented. Some of these features are:
- Syllable count
- Word count
- Character count
- Average syllable count
- Average word count
- Unique word count
- Punctuation frequency
- Stopword frequency
The above-mentioned features are stylometric in nature. However, we have made use of some sentiment-analysis features such as Vader intensity features. Also, some bulk features which allow us for vocabulary richness and word patterns were added which identify the text:
- Tf-Idf Vectorizer
- POS Tag Tf-idf Vectorizer
- Count Vectorizer
Visualizing the stylometric and Tf-Idf Vectorizer features using TSNE yields us the following results:
Following is the TSNE plot using all the features:
The evaluation metric that we used was multi-class log loss.
where N is the number of observations in the test set, M is the number of class labels (3 classes), log is the natural logarithm, yij is 1 if observation i belongs to class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.
Along with the multiclass logloss, we also computed accuracy for each machine learning model.
Machine Learning Models
Various machine learning models that have been applied are:
- Passive Aggressive Classifier
- Logistic Regression
- Multinomial Naive Bayes
Now, let us see these machine learning models one by one.
- Passive Aggressive Classifier — The passive aggressive classifier tends to work on short text data and was used to train on the feature set containing stylometric features and Tf-Idf Vectorizer features. Also, passive aggressive classifiers do not generate probabilities, and hence we have computed accuracy for this model.
- Logistic Regression — A logistic regression pipeline that handles the imbalance in the number of samples per class and selects features is followed by a class weight balanced Logistic Regression module.
- Multinomial Naive Bayes — Multinomial Naive Bayes model is trained on all the features generated and is the best performing model. Multinomial Naive Bayes from sklearn is applied for this problem.
The web scraped data of the authors for their various works were transformed into structured sentences. These sentences were then fed into the above-mentioned machine learning models, and accuracy and multiclass log loss values were obtained. These results were obtained on the 70:30 ratio of common and unique sentences for the specified authors in the dataset section.
Following are the classification reports of the models which were run on the dataset obtained.
The following table denotes the log loss values of Logistic Regression and Multinomial Naive Bayes models.
Note: As mentioned above, the Passive Aggressive classifier does not provide us with the probability values and hence logloss cannot be computed on this model.
Following is the summarization of the accuracy values from the classification report results:
These results indicate that the Multinomial Naive Bayes model performed the best with a minimum logloss value of 0.42 and the highest accuracy of 83%.
The results surpass human performance at the task on hand with a total accuracy of 83% overall. The best performing model was the Multinomial Naive Bayes model. The performance of the baseline model and our methodology is comparable, however, the baseline model finds logloss on validation data and we find it on a much larger test data.
Various new stylometric features can also be derived. Some advanced stylometric coefficients can also be computed like John Burrows’ Delta Method. These stylometric features can help in characterizing the authors in a more accurate manner.
The development of this project has been a joint effort. The team has collaborated closely on the majority of this project. While the cohesive structure of the project is known to all, the work distribution breakdown is as follows.
Prateek Agarwal(Prateek Agarwal): Exploratory Data Analysis, Data Statistics, Data Preprocessing, Feature Extraction, Documentation
Suryank Tiwari(Suryank Tiwari): Data scraping, Exploratory Data Analysis, Dataset structure generation, Machine Learning Models, Documentation
A special thanks to the instructor Dr. Tanmoy Chakraborty for the guidance throughout the course.
And all the TA’s: Shiv Kumar Gehlot, Shikha Singh, Nirav Diwan, Chhavi Jain, Pragya Srivastava, Vivek Reddy , Ishita Bajaj