Text Classification

Introduction¶

Let's load the basic python modules we are going to use:

In [21]:

import numpy as np
import scipy as sc
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import altair as alt

In [22]:

matplotlib.rcParams["figure.figsize"] = (8, 8)
alt.renderers.set_embed_options(theme='light');

We first need to create a easier dataset with string categories instead of the integer encoding. The mapping is available in classes.txt file.

In [23]:

df = pd.read_csv('data/train.csv', 
                 header=None, 
                 names=['i_category', 'title', 'content'])
cat_map = open('data/classes.txt').read().splitlines()
df['category'] = df['i_category'].apply(lambda x : cat_map[x-1])
df = df[['title', 'category', 'content']]
data_url = 'data/normalized_train.csv'
df.to_csv(data_url, index=False)

For the next times, you can skip the process above and simply invoke the following cell:

In [24]:

data_url = 'data/normalized_train.csv'
df = pd.read_csv(data_url)
df.sample(10, random_state=5010)

Out[24]:

	title	category	content
72617	NCAA graduation-rate study	Sports	Michigan, Michigan State and Western Michigan ...
57575	Vodafone Software Enables SMS from Desktop	Sci/Tech	Vodafone UK has introduced a service that will...
73335	Dell lineups gain SuSE Linux, 17-inch notebook	Sci/Tech	Novell's SuSE software joins Red Hat as server...
14821	NYC Not Likely to Be Bush's Town Nov. 2 (AP)	World	AP - There is no "Fiddler on the Roof" at the ...
1967	Product Review: HP d530 Business Desktop (News...	Sci/Tech	NewsFactor - I.T. managers will appreciate the...
68171	Marino and Young Head Hall of Fame Nomination ...	Sports	CANTON, Ohio (Sports Network) - Former All-Pro...
69268	Greenberg son #39;to stand down at Marsh #39;	Business	Jeffrey Greenberg, the chief executive of Mars...
8360	China insists bird flu in pigs does not threat...	World	BEIJING : China Tuesday insisted that although...
79454	Poll leaves Ukraine facing run-off	World	UKRAINE was in agonising limbo yesterday after...
57040	Tories unveil pensions crisis plan	Business	The Conservatives today unveiled an eight-poin...

Data Analysis¶

How many documents do we have ?

In [25]:

df.size

Out[25]:

The categories are well-balanced among the population.

In [26]:

df['category'].value_counts(
             ).plot.pie()

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x7efbe07c3e48>

Let's remind some basic quantities useful in NLP:

document frequency : X.mean(0)
term frequency: Proportion of terms in a document being a given term within one document.

Let's check how accurate the Zipf law is. We use trigrams to make it even more obvious.

In [27]:

cv = CountVectorizer(ngram_range=(1,3))
count = cv.fit_transform(df['content'])
freq = count.sum(0).getA().flatten()
freq[::-1].sort()

In [28]:

plt.plot(freq)
plt.xscale('log')
plt.yscale('log')

Print the 225th first most occuring words by category, expecting to see a lot of the english stop words.

We see many words very specific to their categories, and more broadly a different lexicon and references through the words cloud of each category.

In [29]:

cv_feature_names = cv.get_feature_names()
map_token_to_word = np.vectorize(lambda i: cv_feature_names[i])

coord_x, coord_y = [ _.flatten() for _ in np.meshgrid(np.arange(15), np.arange(15)) ] 

top_225_words = {}
charts = {}

for category in df['category'].unique():
    cc = count[(df['category'] == category).values, :].sum(0).getA().ravel()
    rank = cc.argsort()[-225:]
    top_225_words[category] = (map_token_to_word(rank), cc[rank])
    t = pd.DataFrame({'x': coord_x,
                  'y': coord_y,
                  'text': top_225_words[category][0],
                  'count': top_225_words[category][1],
                  'count_trans': (top_225_words[category][1])**(1/2)})
    charts[category] = alt.Chart(t).mark_text(
                                  ).encode(x='x:O',
                                           y='y:Q',
                                           size='count_trans:Q',
                                           color=alt.value('steelblue'),
                                           text='text:N',
                                           tooltip=['count', 'text']
                                  ).properties(title=category, width=512)

((charts['Business'] & charts['World']) 
 & (charts['Sci/Tech'] & charts['Sports'])
).configure(autosize='fit'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

Out[29]:

Models¶

Normalization¶

Tokenizing, Stemming, Vectorizing

We need to normalize the data using a stemmer, which would reduce the dimension of the WordCount and allow to better leverage the presence of a given lemma. We need to implement a slightly different CountVectorizer, with a stemming layer.

In [30]:

import snowballstemmer

stemmer_en = snowballstemmer.stemmer('english')

class SnowBallCountVectorizer(CountVectorizer):
    def build_tokenizer(self):
        tokenize = super().build_tokenizer()
        return lambda doc: stemmer_en.stemWords(tokenize(doc))

We need to create the design matrix. There are many ways to get it from a textual content:

Title: Existence, Count
Content: Count, TF-IDF

The way we encode the words will direct the choices of our model. A discrete one (ie a Count) will be more suited to a Bayesian model, a continuous one (ie TF-IDF) will be more suited to a standard one as SVM, Logistic Regression.

Warning:

We use trigrams.

Naive Bayes¶

The Naive Bayes model is a very simple one with very strong assumptions about independance. Because of this very conservative way to model interactions between words, we can't implement with great subtility how the appearance of a word within the title is much more important than within the content using an other way than applying a weight term to the title count.

In [31]:

y = df['category']

count_vectorizer_content = SnowBallCountVectorizer(ngram_range=(1, 3))
X_content = count_vectorizer_content.fit_transform(df['content'])
count_features = count_vectorizer_content.get_feature_names()

X_title_from_content = count_vectorizer_content.transform(df['title'])

We simply consider the count from the title as twice important as from the content.

In [32]:

X = X_content + 2*X_title_from_content

Let's split the generated dataset now.

In [33]:

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=5010)

Training and Score¶

In [34]:

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(Xtrain, ytrain)
print(f"{round(100 - 100*nb.score(Xtest, ytest), 3)}%")

8.857%

This loss is not a bad one if we consider the best one from LeCun's paper: 7.64%. The difference with LeCun's pipeline (using much more computing power since with neural networks) is the stemming step. Lecun's main purposes was to prove neural networks can achieve the same kind of preprocessing without an arbitrary one picked by a human as we did. Neverthelees, it remains in this very case a quite weak results (we could have expected the loss to be lower than 5%).

Mispredictions¶

In [35]:

nb_predict = nb.predict(X)
nb_mispredict = (nb_predict != y)
nb_mis = df[['title', 'category', 'content']][nb_mispredict]
nb_mis['predicted'] = nb_predict[nb_mispredict]

In [38]:

nb_mis.sample(5, random_state=5010)

Out[38]:

	title	category	content	predicted
101129	A Catastrophe Strikes, and the Cyberworld Resp...	Sci/Tech	Much of the initial information about what had...	World
63815	Microsoft, Cisco Shake on Network Security	Business	In a deal sure to bring smiles to the faces of...	Sci/Tech
20098	Softbank makes mobile complaint	Business	SOFTBANK, Japan #39;s largest internet provide...	Sci/Tech
13943	Triumphant Olympics End With Pride, Relief	World	ATHENS, Greece - Efharisto! A nervous world le...	Sports
4185	US Court Rejects Movie, Music Makers' Piracy C...	Sci/Tech	Reuters - A federal appeals court on Thursday\...	Business

Globally, the classifier failed to predict ambiguous entries. India Moon Program will possibly be classified as World instead of Sci/Tech because of the presence of India. Categorizing a document is a subjective task. Let's check some CNN misclassified articles. The first article about one of the Bin Laden's associates has a clearly wrong label. That's not the only one. But when it is not the label which is wrong, we find the article topic lies in the gray zone between categories. How to classify an article dealing with IBM quitting the PC business, is it Sci/Tech which sounds right or is it a Business one ?

In [39]:

nb_mis[(nb_mis['content'].str.find('CNN') != -1)]

Out[39]:

	title	category	content	predicted
173	Saudis: Bin Laden associate surrenders	Sci/Tech	\\"(CNN) -- A longtime associate of al Qaeda l...	World
176	Al Qaeda member surrenders	Sci/Tech	\\"RIYADH, Saudi Arabia (CNN) -- One of Saudi ...	World
177	Mission Accomplished!	Sci/Tech	\\"BAGHDAD, Iraq (CNN) -- Members of Iraq's in...	World
13962	FBI Probing Suspected Israeli Spy at Pentagon	Sci/Tech	Reuters, CNN, CBS news, and the Washington Pos...	World
14682	Dress like Anna Nicole Smith	Business	Called #39;Anna Nicole, #39; the new clothing...	Sports
41247	Palestinian gunmen kidnap CNN producer	Sports	GAZA CITY, Gaza Strip -- Palestinian gunmen ab...	World
41307	CORRECTED-CNN producer seized in Gaza	Sports	In GAZA story headlined quot;CNN producer sei...	World
49489	Dodge Charger image released	Business	NEW YORK (CNN/Money) - DaimlerChrysler #39;s D...	Sci/Tech
51586	Playoff fever hits some on the job	Business	Executives around Boston ended business meetin...	Sports
82060	A Tale of Two Osamas	Sci/Tech	There are some interesting, and telling, diffe...	World
98894	CNN Hires New President for Its U.S. News Grou...	Sci/Tech	Reuters - CNN News Group on Monday said it has...	Business

We find the same phenomenon from some New York Times articles.

In [40]:

nb_mis[(nb_mis['content'].str.find('New York Times') != -1)]

Out[40]:

	title	category	content	predicted
33559	Stocks Mixed on Oil Prices, Profit Outlook	World	NEW YORK - Higher oil prices and lowered outlo...	Business
33843	Stocks Close Lower Amid Rising Oil Prices	World	NEW YORK - A grim combination of higher oil pr...	Business
36209	Reports: CA may escape charges in DOJ deal	Sci/Tech	Accounting fraud charges against CA will be dr...	Business
83863	Bush Election Causes Suicide	Sci/Tech	\\Wow... poor kid.\\"NEW YORK (AP) -- A 25-ye...	World
93242	Report: SBC Sets TV Deal with Microsoft (Reuters)	Sci/Tech	Reuters - SBC Communications, the No. 2 U.S.\t...	Business
107830	IBM Puts Its PC Business Up for Sale -NYT (Reu...	Sci/Tech	Reuters - International Business Machines Corp...	Business
108191	Will IBM sell PC business?	Business	p2pnet.net News - IBMs PC business is up for g...	Sci/Tech

Multinomial Naive Bayes Classifier¶

Let's take a text with $W$ words. Since we consider it as a bag of word, we can actually generate this "bag" by randomly taking out of the vocabulary a word until we get exactly $W$ words. As we did that conditionally to the category, the probability to get a given word would depend on the category. That's where the model lies: this probability would only depend on the category and not on the number of occurences of any other words. That's why this model is called Naive. But it is now tractable since there are "only" as many parameters to estimate as words in the vocabulary (compared to a complexity of at least $2^V$ if we only model the presence of words in a text without simplifying hypothesis).

We name here $X$ the feature vector and $Y$ the category to predict.

$$\mathbb{P} ( X, Y ) = \mathbb{P} (X | Y) \mathbb{P} ( Y )$$

Let's rephrase the multinomial distribution model:

$$\mathbb{P} (X | Y) = (\sum_{i\leq V}{ X_i })! \prod_{i \leq V} \frac{p_i^{X_i}}{X_i!}$$

Now we estimate each quantity by its natural estimator given an i.i.d sequence of variables (or by smoothing it, the Sklearn documentation explains the purpose of Laplace smoothing).

To get the classifier, we simply maximize over the category the conditional likelyhood.

In that way, the Multinomial Naive Bayes Classifier can be seen as the most natural way to classify a text, especially if we only have a Bag-of-Words representation of the given document.

Playing with the model¶

In [41]:

def predict_from_string(title, content):
    x_title = count_vectorizer_content.transform([title])
    x_content = count_vectorizer_content.transform([content])
    x = x_content + 4*x_title
    return nb.predict(x)

Bernoulli Naive Bayes¶

We can find many references (Sklearn, Jurafsky) where it is advised to use a Bernoulli Naive Bayes when classifying. For our case, it doesn't yield to better result.

In [42]:

from sklearn.naive_bayes import BernoulliNB

bb = BernoulliNB()
bb.fit(Xtrain, ytrain)
print(f"{round(100 - 100*bb.score(Xtest, ytest), 3)}%")

9.363%

Pending Questions¶

How about others models ? Is Naive Bayes satisfying ? What can explain other models than NB can't ?
Can MNB scale well to unseen data ?

References¶

Scikit-Learn Documentation Naive Bayes
Wikipedia Naive Bayes Classifier: Multinomial naive Bayes
Wikipedia Multinomial distribution
Jurafsky's Personal Website (Computer Science Researcher) Speech and Language Processing online NLP book, especially Ch. 4