Text Classification

Introduction

Let's load the basic python modules we are going to use:

In [21]:
import numpy as np
import scipy as sc
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import altair as alt
In [22]:
matplotlib.rcParams["figure.figsize"] = (8, 8)
alt.renderers.set_embed_options(theme='light');

We first need to create a easier dataset with string categories instead of the integer encoding. The mapping is available in classes.txt file.

In [23]:
df = pd.read_csv('data/train.csv', 
                 header=None, 
                 names=['i_category', 'title', 'content'])
cat_map = open('data/classes.txt').read().splitlines()
df['category'] = df['i_category'].apply(lambda x : cat_map[x-1])
df = df[['title', 'category', 'content']]
data_url = 'data/normalized_train.csv'
df.to_csv(data_url, index=False)

For the next times, you can skip the process above and simply invoke the following cell:

In [24]:
data_url = 'data/normalized_train.csv'
df = pd.read_csv(data_url)
df.sample(10, random_state=5010)
Out[24]:
title category content
72617 NCAA graduation-rate study Sports Michigan, Michigan State and Western Michigan ...
57575 Vodafone Software Enables SMS from Desktop Sci/Tech Vodafone UK has introduced a service that will...
73335 Dell lineups gain SuSE Linux, 17-inch notebook Sci/Tech Novell's SuSE software joins Red Hat as server...
14821 NYC Not Likely to Be Bush's Town Nov. 2 (AP) World AP - There is no "Fiddler on the Roof" at the ...
1967 Product Review: HP d530 Business Desktop (News... Sci/Tech NewsFactor - I.T. managers will appreciate the...
68171 Marino and Young Head Hall of Fame Nomination ... Sports CANTON, Ohio (Sports Network) - Former All-Pro...
69268 Greenberg son #39;to stand down at Marsh #39; Business Jeffrey Greenberg, the chief executive of Mars...
8360 China insists bird flu in pigs does not threat... World BEIJING : China Tuesday insisted that although...
79454 Poll leaves Ukraine facing run-off World UKRAINE was in agonising limbo yesterday after...
57040 Tories unveil pensions crisis plan Business The Conservatives today unveiled an eight-poin...

Data Analysis

How many documents do we have ?

In [25]:
df.size
Out[25]:
360000

The categories are well-balanced among the population.

In [26]:
df['category'].value_counts(
             ).plot.pie()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efbe07c3e48>

Let's remind some basic quantities useful in NLP:

  • document frequency : X.mean(0)
  • term frequency: Proportion of terms in a document being a given term within one document.

Let's check how accurate the Zipf law is. We use trigrams to make it even more obvious.

In [27]:
cv = CountVectorizer(ngram_range=(1,3))
count = cv.fit_transform(df['content'])
freq = count.sum(0).getA().flatten()
freq[::-1].sort()
In [28]:
plt.plot(freq)
plt.xscale('log')
plt.yscale('log')

Print the 225th first most occuring words by category, expecting to see a lot of the english stop words.

We see many words very specific to their categories, and more broadly a different lexicon and references through the words cloud of each category.

In [29]:
cv_feature_names = cv.get_feature_names()
map_token_to_word = np.vectorize(lambda i: cv_feature_names[i])

coord_x, coord_y = [ _.flatten() for _ in np.meshgrid(np.arange(15), np.arange(15)) ] 

top_225_words = {}
charts = {}

for category in df['category'].unique():
    cc = count[(df['category'] == category).values, :].sum(0).getA().ravel()
    rank = cc.argsort()[-225:]
    top_225_words[category] = (map_token_to_word(rank), cc[rank])
    t = pd.DataFrame({'x': coord_x,
                  'y': coord_y,
                  'text': top_225_words[category][0],
                  'count': top_225_words[category][1],
                  'count_trans': (top_225_words[category][1])**(1/2)})
    charts[category] = alt.Chart(t).mark_text(
                                  ).encode(x='x:O',
                                           y='y:Q',
                                           size='count_trans:Q',
                                           color=alt.value('steelblue'),
                                           text='text:N',
                                           tooltip=['count', 'text']
                                  ).properties(title=category, width=512)

((charts['Business'] & charts['World']) 
 & (charts['Sci/Tech'] & charts['Sports'])
).configure(autosize='fit'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)
Out[29]:

Models

Normalization

Tokenizing, Stemming, Vectorizing

We need to normalize the data using a stemmer, which would reduce the dimension of the WordCount and allow to better leverage the presence of a given lemma. We need to implement a slightly different CountVectorizer, with a stemming layer.

In [30]:
import snowballstemmer

stemmer_en = snowballstemmer.stemmer('english')

class SnowBallCountVectorizer(CountVectorizer):
    def build_tokenizer(self):
        tokenize = super().build_tokenizer()
        return lambda doc: stemmer_en.stemWords(tokenize(doc))

We need to create the design matrix. There are many ways to get it from a textual content:

  • Title: Existence, Count
  • Content: Count, TF-IDF

The way we encode the words will direct the choices of our model. A discrete one (ie a Count) will be more suited to a Bayesian model, a continuous one (ie TF-IDF) will be more suited to a standard one as SVM, Logistic Regression.

Warning:

We use trigrams.

Naive Bayes

The Naive Bayes model is a very simple one with very strong assumptions about independance. Because of this very conservative way to model interactions between words, we can't implement with great subtility how the appearance of a word within the title is much more important than within the content using an other way than applying a weight term to the title count.

In [31]:
y = df['category']

count_vectorizer_content = SnowBallCountVectorizer(ngram_range=(1, 3))
X_content = count_vectorizer_content.fit_transform(df['content'])
count_features = count_vectorizer_content.get_feature_names()

X_title_from_content = count_vectorizer_content.transform(df['title'])

We simply consider the count from the title as twice important as from the content.

In [32]:
X = X_content + 2*X_title_from_content

Let's split the generated dataset now.

In [33]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=5010)
Training and Score
In [34]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(Xtrain, ytrain)
print(f"{round(100 - 100*nb.score(Xtest, ytest), 3)}%")
8.857%

This loss is not a bad one if we consider the best one from LeCun's paper: 7.64%. The difference with LeCun's pipeline (using much more computing power since with neural networks) is the stemming step. Lecun's main purposes was to prove neural networks can achieve the same kind of preprocessing without an arbitrary one picked by a human as we did. Neverthelees, it remains in this very case a quite weak results (we could have expected the loss to be lower than 5%).

Mispredictions
In [35]:
nb_predict = nb.predict(X)
nb_mispredict = (nb_predict != y)
nb_mis = df[['title', 'category', 'content']][nb_mispredict]
nb_mis['predicted'] = nb_predict[nb_mispredict]
In [38]:
nb_mis.sample(5, random_state=5010)
Out[38]:
title category content predicted
101129 A Catastrophe Strikes, and the Cyberworld Resp... Sci/Tech Much of the initial information about what had... World
63815 Microsoft, Cisco Shake on Network Security Business In a deal sure to bring smiles to the faces of... Sci/Tech
20098 Softbank makes mobile complaint Business SOFTBANK, Japan #39;s largest internet provide... Sci/Tech
13943 Triumphant Olympics End With Pride, Relief World ATHENS, Greece - Efharisto! A nervous world le... Sports
4185 US Court Rejects Movie, Music Makers' Piracy C... Sci/Tech Reuters - A federal appeals court on Thursday\... Business

Globally, the classifier failed to predict ambiguous entries. India Moon Program will possibly be classified as World instead of Sci/Tech because of the presence of India. Categorizing a document is a subjective task. Let's check some CNN misclassified articles. The first article about one of the Bin Laden's associates has a clearly wrong label. That's not the only one. But when it is not the label which is wrong, we find the article topic lies in the gray zone between categories. How to classify an article dealing with IBM quitting the PC business, is it Sci/Tech which sounds right or is it a Business one ?

In [39]:
nb_mis[(nb_mis['content'].str.find('CNN') != -1)]
Out[39]:
title category content predicted
173 Saudis: Bin Laden associate surrenders Sci/Tech \\"(CNN) -- A longtime associate of al Qaeda l... World
176 Al Qaeda member surrenders Sci/Tech \\"RIYADH, Saudi Arabia (CNN) -- One of Saudi ... World
177 Mission Accomplished! Sci/Tech \\"BAGHDAD, Iraq (CNN) -- Members of Iraq's in... World
13962 FBI Probing Suspected Israeli Spy at Pentagon Sci/Tech Reuters, CNN, CBS news, and the Washington Pos... World
14682 Dress like Anna Nicole Smith Business Called #39;Anna Nicole, #39; the new clothing... Sports
41247 Palestinian gunmen kidnap CNN producer Sports GAZA CITY, Gaza Strip -- Palestinian gunmen ab... World
41307 CORRECTED-CNN producer seized in Gaza Sports In GAZA story headlined quot;CNN producer sei... World
49489 Dodge Charger image released Business NEW YORK (CNN/Money) - DaimlerChrysler #39;s D... Sci/Tech
51586 Playoff fever hits some on the job Business Executives around Boston ended business meetin... Sports
82060 A Tale of Two Osamas Sci/Tech There are some interesting, and telling, diffe... World
98894 CNN Hires New President for Its U.S. News Grou... Sci/Tech Reuters - CNN News Group on Monday said it has... Business

We find the same phenomenon from some New York Times articles.

In [40]:
nb_mis[(nb_mis['content'].str.find('New York Times') != -1)]
Out[40]:
title category content predicted
33559 Stocks Mixed on Oil Prices, Profit Outlook World NEW YORK - Higher oil prices and lowered outlo... Business
33843 Stocks Close Lower Amid Rising Oil Prices World NEW YORK - A grim combination of higher oil pr... Business
36209 Reports: CA may escape charges in DOJ deal Sci/Tech Accounting fraud charges against CA will be dr... Business
83863 Bush Election Causes Suicide Sci/Tech \\Wow... poor kid.\\"NEW YORK (AP) -- A 25-ye... World
93242 Report: SBC Sets TV Deal with Microsoft (Reuters) Sci/Tech Reuters - SBC Communications, the No. 2 U.S.\t... Business
107830 IBM Puts Its PC Business Up for Sale -NYT (Reu... Sci/Tech Reuters - International Business Machines Corp... Business
108191 Will IBM sell PC business? Business p2pnet.net News - IBMs PC business is up for g... Sci/Tech
Multinomial Naive Bayes Classifier

Let's take a text with $W$ words. Since we consider it as a bag of word, we can actually generate this "bag" by randomly taking out of the vocabulary a word until we get exactly $W$ words. As we did that conditionally to the category, the probability to get a given word would depend on the category. That's where the model lies: this probability would only depend on the category and not on the number of occurences of any other words. That's why this model is called Naive. But it is now tractable since there are "only" as many parameters to estimate as words in the vocabulary (compared to a complexity of at least $2^V$ if we only model the presence of words in a text without simplifying hypothesis).

We name here $X$ the feature vector and $Y$ the category to predict.

$$\mathbb{P} ( X, Y ) = \mathbb{P} (X | Y) \mathbb{P} ( Y )$$

Let's rephrase the multinomial distribution model:

$$\mathbb{P} (X | Y) = (\sum_{i\leq V}{ X_i })! \prod_{i \leq V} \frac{p_i^{X_i}}{X_i!}$$

Now we estimate each quantity by its natural estimator given an i.i.d sequence of variables (or by smoothing it, the Sklearn documentation explains the purpose of Laplace smoothing).

To get the classifier, we simply maximize over the category the conditional likelyhood.

In that way, the Multinomial Naive Bayes Classifier can be seen as the most natural way to classify a text, especially if we only have a Bag-of-Words representation of the given document.

Playing with the model

In [41]:
def predict_from_string(title, content):
    x_title = count_vectorizer_content.transform([title])
    x_content = count_vectorizer_content.transform([content])
    x = x_content + 4*x_title
    return nb.predict(x)

Bernoulli Naive Bayes

We can find many references (Sklearn, Jurafsky) where it is advised to use a Bernoulli Naive Bayes when classifying. For our case, it doesn't yield to better result.

In [42]:
from sklearn.naive_bayes import BernoulliNB

bb = BernoulliNB()
bb.fit(Xtrain, ytrain)
print(f"{round(100 - 100*bb.score(Xtest, ytest), 3)}%")
9.363%

Pending Questions

  • How about others models ? Is Naive Bayes satisfying ? What can explain other models than NB can't ?
  • Can MNB scale well to unseen data ?

References

  1. Scikit-Learn Documentation Naive Bayes
  2. Wikipedia Naive Bayes Classifier: Multinomial naive Bayes
  3. Wikipedia Multinomial distribution
  4. Jurafsky's Personal Website (Computer Science Researcher) Speech and Language Processing online NLP book, especially Ch. 4