Text Classification
Introduction¶
Let's load the basic python modules we are going to use:
import numpy as np
import scipy as sc
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import altair as alt
matplotlib.rcParams["figure.figsize"] = (8, 8)
alt.renderers.set_embed_options(theme='light');
We first need to create a easier dataset with string categories instead of the integer encoding. The mapping is available in classes.txt
file.
df = pd.read_csv('data/train.csv',
header=None,
names=['i_category', 'title', 'content'])
cat_map = open('data/classes.txt').read().splitlines()
df['category'] = df['i_category'].apply(lambda x : cat_map[x-1])
df = df[['title', 'category', 'content']]
data_url = 'data/normalized_train.csv'
df.to_csv(data_url, index=False)
For the next times, you can skip the process above and simply invoke the following cell:
data_url = 'data/normalized_train.csv'
df = pd.read_csv(data_url)
df.sample(10, random_state=5010)
Data Analysis¶
How many documents do we have ?
df.size
The categories are well-balanced among the population.
df['category'].value_counts(
).plot.pie()
Let's remind some basic quantities useful in NLP:
-
document frequency :
X.mean(0)
- term frequency: Proportion of terms in a document being a given term within one document.
Let's check how accurate the Zipf law is. We use trigrams to make it even more obvious.
cv = CountVectorizer(ngram_range=(1,3))
count = cv.fit_transform(df['content'])
freq = count.sum(0).getA().flatten()
freq[::-1].sort()
plt.plot(freq)
plt.xscale('log')
plt.yscale('log')
Print the 225th first most occuring words by category, expecting to see a lot of the english stop words.
We see many words very specific to their categories, and more broadly a different lexicon and references through the words cloud of each category.
cv_feature_names = cv.get_feature_names()
map_token_to_word = np.vectorize(lambda i: cv_feature_names[i])
coord_x, coord_y = [ _.flatten() for _ in np.meshgrid(np.arange(15), np.arange(15)) ]
top_225_words = {}
charts = {}
for category in df['category'].unique():
cc = count[(df['category'] == category).values, :].sum(0).getA().ravel()
rank = cc.argsort()[-225:]
top_225_words[category] = (map_token_to_word(rank), cc[rank])
t = pd.DataFrame({'x': coord_x,
'y': coord_y,
'text': top_225_words[category][0],
'count': top_225_words[category][1],
'count_trans': (top_225_words[category][1])**(1/2)})
charts[category] = alt.Chart(t).mark_text(
).encode(x='x:O',
y='y:Q',
size='count_trans:Q',
color=alt.value('steelblue'),
text='text:N',
tooltip=['count', 'text']
).properties(title=category, width=512)
((charts['Business'] & charts['World'])
& (charts['Sci/Tech'] & charts['Sports'])
).configure(autosize='fit'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)
Models¶
Normalization¶
Tokenizing, Stemming, Vectorizing
We need to normalize the data using a stemmer, which would reduce the dimension of the WordCount and allow to better leverage the presence of a given lemma. We need to implement a slightly different CountVectorizer
, with a stemming layer.
import snowballstemmer
stemmer_en = snowballstemmer.stemmer('english')
class SnowBallCountVectorizer(CountVectorizer):
def build_tokenizer(self):
tokenize = super().build_tokenizer()
return lambda doc: stemmer_en.stemWords(tokenize(doc))
We need to create the design matrix. There are many ways to get it from a textual content:
- Title: Existence, Count
- Content: Count, TF-IDF
The way we encode the words will direct the choices of our model. A discrete one (ie a Count) will be more suited to a Bayesian model, a continuous one (ie TF-IDF) will be more suited to a standard one as SVM, Logistic Regression.
Warning:
We use trigrams.
Naive Bayes¶
The Naive Bayes model is a very simple one with very strong assumptions about independance. Because of this very conservative way to model interactions between words, we can't implement with great subtility how the appearance of a word within the title is much more important than within the content using an other way than applying a weight term to the title count.
y = df['category']
count_vectorizer_content = SnowBallCountVectorizer(ngram_range=(1, 3))
X_content = count_vectorizer_content.fit_transform(df['content'])
count_features = count_vectorizer_content.get_feature_names()
X_title_from_content = count_vectorizer_content.transform(df['title'])
We simply consider the count from the title as twice important as from the content.
X = X_content + 2*X_title_from_content
Let's split the generated dataset now.
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=5010)
Training and Score¶
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(Xtrain, ytrain)
print(f"{round(100 - 100*nb.score(Xtest, ytest), 3)}%")
This loss is not a bad one if we consider the best one from LeCun's paper: 7.64%. The difference with LeCun's pipeline (using much more computing power since with neural networks) is the stemming step. Lecun's main purposes was to prove neural networks can achieve the same kind of preprocessing without an arbitrary one picked by a human as we did. Neverthelees, it remains in this very case a quite weak results (we could have expected the loss to be lower than 5%).
Mispredictions¶
nb_predict = nb.predict(X)
nb_mispredict = (nb_predict != y)
nb_mis = df[['title', 'category', 'content']][nb_mispredict]
nb_mis['predicted'] = nb_predict[nb_mispredict]
nb_mis.sample(5, random_state=5010)
Globally, the classifier failed to predict ambiguous entries. India Moon Program
will possibly be classified as World
instead of Sci/Tech
because of the presence of India
. Categorizing a document is a subjective task. Let's check some CNN misclassified articles. The first article about one of the Bin Laden's associates has a clearly wrong label. That's not the only one. But when it is not the label which is wrong, we find the article topic lies in the gray zone between categories. How to classify an article dealing with IBM quitting the PC business, is it Sci/Tech
which sounds right or is it a Business
one ?
nb_mis[(nb_mis['content'].str.find('CNN') != -1)]
We find the same phenomenon from some New York Times articles.
nb_mis[(nb_mis['content'].str.find('New York Times') != -1)]
Multinomial Naive Bayes Classifier¶
Let's take a text with $W$ words. Since we consider it as a bag of word, we can actually generate this "bag" by randomly taking out of the vocabulary a word until we get exactly $W$ words. As we did that conditionally to the category, the probability to get a given word would depend on the category. That's where the model lies: this probability would only depend on the category and not on the number of occurences of any other words. That's why this model is called Naive. But it is now tractable since there are "only" as many parameters to estimate as words in the vocabulary (compared to a complexity of at least $2^V$ if we only model the presence of words in a text without simplifying hypothesis).
We name here $X$ the feature vector and $Y$ the category to predict.
$$\mathbb{P} ( X, Y ) = \mathbb{P} (X | Y) \mathbb{P} ( Y )$$Let's rephrase the multinomial distribution model:
$$\mathbb{P} (X | Y) = (\sum_{i\leq V}{ X_i })! \prod_{i \leq V} \frac{p_i^{X_i}}{X_i!}$$Now we estimate each quantity by its natural estimator given an i.i.d sequence of variables (or by smoothing it, the Sklearn documentation explains the purpose of Laplace smoothing).
To get the classifier, we simply maximize over the category the conditional likelyhood.
In that way, the Multinomial Naive Bayes Classifier can be seen as the most natural way to classify a text, especially if we only have a Bag-of-Words representation of the given document.
Playing with the model¶
def predict_from_string(title, content):
x_title = count_vectorizer_content.transform([title])
x_content = count_vectorizer_content.transform([content])
x = x_content + 4*x_title
return nb.predict(x)
Bernoulli Naive Bayes¶
We can find many references (Sklearn, Jurafsky) where it is advised to use a Bernoulli Naive Bayes when classifying. For our case, it doesn't yield to better result.
from sklearn.naive_bayes import BernoulliNB
bb = BernoulliNB()
bb.fit(Xtrain, ytrain)
print(f"{round(100 - 100*bb.score(Xtest, ytest), 3)}%")
Pending Questions¶
- How about others models ? Is Naive Bayes satisfying ? What can explain other models than NB can't ?
- Can MNB scale well to unseen data ?
References¶
- Scikit-Learn Documentation Naive Bayes
- Wikipedia Naive Bayes Classifier: Multinomial naive Bayes
- Wikipedia Multinomial distribution
- Jurafsky's Personal Website (Computer Science Researcher) Speech and Language Processing online NLP book, especially Ch. 4