In my last post, I introduced sentiment analysis, the Naïve Bayes classification technique and why you or your business might be interested in this.
In this post I’ll delve into it in more detail and-and walk through an example and how it’s connected to sentiment analysis.
The rule itself is written like this: (Boone)
p(A|B) = p(B|A) p(A) / p(B)
Now let’s break this down and explain each component:
p(A|B): ‘The probability of A given B’. This basically means the probability of finding observation A, given that some part of evidence B is there. This is what we want to find out. (Boone)
p(B|A): This is the probability of the evidence turning up, given that the outcome obtains.
p(A): This is the probability of the outcome occurring, without the knowledge of the new evidence.
p(B): This is the probability of the evidence arising, without regard to the outcome.
The sample data set as discussed by (Amiune) illustrates how the theorem can be applied when trying to arrive at whether or not an email is a spam if it has the word “buy” in the mail body.
P(spam |words) = P(words/spam)P(spam) / P(words)
|We have a database of 100 emails.|
· 60 of those 100 emails are spam
· 48 of those 60 emails that are spam have the word “buy”
· 12 of those 60 emails that are spam don’t have the word “buy”
· 40 of those 100 emails aren’t spam
· 4 of those 40 emails that aren’t spam have the word “buy”
· 36 of those 40 emails that aren’t spam don’t have the word “buy”
What is the probability that an email is a spam if it has the word “buy” in the content?
The answer to the above is as follows:
So the probability that an email is a spam if it has the word “buy” is 48/52 = 0.92. So we should probably put this email in the spam folder.
As mentioned previously, the rule and notation are based on probabilities, so we can redefine the problem to use probabilities rather than quantities. Using the same database of emails.
What is the probability that an email is a spam if it has the word “buy”?
The notation to arrive at the answer looks like this:
Summing the previous two P(“buy”|spam) * P(spam) + P(“buy”|not spam) * P(not spam) – we count all the emails that have the word “buy”
Meaning the resulting equation looks like this:
P(spam|”buy”) = P(“buy”|spam) * P(spam) / (P(“buy”|spam) * P(spam) + P(“buy”|not spam) * P(not spam))
This is Bayesian Theorem.
Or , to inject the numbers: 0.8 * 0.6 / (0.8*0.6 + 0.1*0.4) = 0.48 / 0.52
The result of this simulation was: 0.9222485960747988
Or in plain English, based on our existing datasets, there is a 92% chance that emails that contain the word “buy” are spam type emails.
So how do use this theorem to apply sentiment analysis? Read on!
Performing sentiment analysis using Bayesian Theorem involves writing a Naïve Bayesian Classifier which is based on the Bayes Rule that we’ve just discussed. This rule is a way of looking at the conditional probabilities of an event using a given set of mathematical probabilities. As we’ve just seen, the rule if often used in email systems when trying to detect if the email is actually valid based on the presence of a certain set of keywords.
You can find a sample classifier on Github, have a play around with it and see how you get on. In my next post, I’ll talk a little bit more about the difficulties of sentiment analysis and how some of these can be alleviated.
In the meantime, feel free to reach out if you have any questions or comments.