Phishing Email Sentiment Analysis
Company Mentor
Jeffrey Archer
Senior Staff Cybersecurity Researcher
GE Aviation
Cincinnati, Ohio
Course Instructor
Dr. Phu Phung
Associate Professor in Computer Science at University of Dayton
Director of Intelligent Systems Security Laboratory
Project Context and Scope
The GE Aviation Cyber Intelligence, Active Defense, and Enterprise Vulnerability Management Team currently uses a data engineering framework, dubbed "Magnus", to extract features from phishing emails sent to GE Employees and use these to score how "suspicious" an email appears using automation.
The Magnus framework is currently used to calculate a suspiciousness score automatically by translating some common patterns looked for by human analysts in email content to automated "signatures" or "feature patterns".
While this kind of automation works well with email metadata and header data, it would be optimal to also include information about the message contents, which the Magnus framework does not examine today.
One particularly valuable feature set that can be extracted from email message contents is the sentiments of the message - the overall opinions or emotions underlying the phrases used in a document's text. Sentiment Analysis of email contents will be the topic of this project and will be used to augment the current Magnus framework by adding features based on the text contents and sentiments of email data.
This project will be able to provide an analysis of sentiment from an email to provide an awareness to the possible intention of the email (e.g. phishing).
Impacts
Sentiment Analysis is the computational study of people's opinions, sentiments, emotions, and attitudes. It often involves employing Natural Language Processing (NLP) techniques to tokenize the words of a document and analyze the content and context of the text to determine - on a basic level - the polarity of a given text (whether the expressed opinion is positive, negative, or neutral) and - on an advanced level - emotional states such as enjoyment, anger, disgust, sadness, fear, and surprise.
The goal of this project is to develop a sentiment analysis tool that, given the input of the contents of an email, can analyze the email text contents and determine the emotional states present in the text.
The output of the project should include the detected emotional states, and, if possible, which tokens/phrases contributed to the detection of that emotional state. See the sample input and output provided below.
The emotional states detected by the project can include any number of emotions (standard emotions used in sentiment analysis are Happiness, Sadness, Fear, Anger, Surprise, Scare, Shame, and Disgust). Of particular interest for this project are emotions having to do with shaming, intimidation, or harassment, as this category of emotions is often used in phishing emails to influence a victim to complete an action, e.g.:
- "Please click on this link immediately to view your banking statement."
- "This request requires your urgent attention"
- "I need you to change the banking information ASAP"
- "If you do not respond in 24 hours, we will release this embarrassing information about you"
- "Change your password here before your account is deleted!"
The technology we'll use -- Jupter Notebooks, Python Natural Language Processing, Kaggle, and Textblob -- will directly allow us to complete this project and hopefully exceed expectations.
Video Demo