data science machine learning natural language processing novice

Document Classification With Solr Streaming Expressions

Classification is one of the most popular tasks in Natural Language Processing and Machine Learning. Solr ships with features, a subset of Streaming Expressions features, that allows building and deploying statistical classification models out-of-the-box. With adequate preprocessing and indexing tweaks, these features can be used to classify documents quickly and with high accuracy. This post illustrates how Solr streaming expressions and Zeppelin notebooks can be used to build a document classifier.

Dataset and Preprocessing

In this post, the BBC News dataset will be used. It will be split into training and testing subsets with sizes 1,999 and 226 respectively, indexed into two collections, bbc-text-train and bbc-text-test. Since Solr training feature, namely train expression, takes a binary classified documents, a label binarizer preprocessing is required before indexing. The following Python code can be used to achieve that:

import pandas as pd
from sklearn import preprocessing

lb = preprocessing.LabelBinarizer()

df = pd.read_csv('~/workspace/data/bbc-text/bbc-text-train.csv')
category = df['category_s'] #you can also use df['column_name']
encoded_category = lb.transform(category)
df_encoded_category = pd.DataFrame(data=encoded_category, 
                        columns=map(lambda c: c + '_i', lb.classes_), index=None)
df_con = pd.concat([df, df_encoded_category], axis=1)

df_con.to_csv('~/workspace/data/bbc-text/bbc-text-preprocessed-train.csv', index=False)

Training and Building Models

Having the training set indexed1, the following streaming expression can be used to train and store the mode into Solr:

stream commit(bbc-text-model,update(bbc-text-model, batchSize=500,
                  features(bbc-text-train, q="*:*", featureSet="featureSet", field="text_t", outcome="tech_i", numTerms=25),

The above expression extracts features from the training set, trains a model using these features and commits the result to bbc-text-tech-classification-model collection. Because label binarizeation is used, an ensemble of models is created represented by a classifier for each class, looping the above code over categories technically speaking.

Document Classification

The next step is to use the models built to classify documents. The following streaming expression can be used to apply that:

stream classify
    (model(bbc-text-model, id="bbc-text-tech-classification-model"), 
    search(bbc-text-test,q="*:*",fl="text_t, id, category_s",sort="id desc", rows=50),

The output of the above stream is tuples of the classified document along with two additional fields: probability_d which is useful for classifying the document and score_d which is useful for ranking. In our case, the assigned class is the one with the highest probability.

Learning Curves

Learning curves can be plotted using the stored model as follows:

search q=*:*&collection=bbc-text-model&fl=name_s,trueNegative_i,truePositive_i,falseNegative_i,falsePositive_i,iteration_i&sort=iteration_i%20asc&fq=name_s:bbc-text-tech-classification-model&rows=10

The expression above returns the confusion matrix for each class. Inside a Zeppelin notebook, the following line chart is obtained for the above stream:

Technology Class Learning Curve

Error Decay

Here, we plot error versus iterations. The following code can be used to plot error curves:

search q=*:*&collection=bbc-text-model&fl=name_s,error_d,iteration_i&sort=iteration_i%20asc&rows=400&fq=iteration_i:[2%20TO%20*]

Note: iteration 1 error is very high compared to successive iterations which makes curves very skewed to the left. So, in the above code, we start drawing error curves from iteration 2.

Error Decay Curve


There are many evaluation metrics that can be used to evaluate classifiers, below are the most popular ones.

Confusion Matrix

Confusion matrix is a mathematical representation of a set of classification performance characteristics. The elements of the matrix are stored along the generated model by Solr. However, due to a glitch as of the time of writing this post, these elements are always less than the actual. Alternatively, the classification results can be exported and the calculations can be done in Python, nevertheless, Solr interpreter can be used directly when the issue is addressed.

To export the classification results, Zeppelin export functionality, located at the top right corner of table views, can be used. Once all results are exported, they can be moved to Zeppelin for processing using docker cp or a mounted volume.

Zeppelin Table Export and View Options

The Python code below loads the classification results into Pandas data frames, combines them and calculates the maximum probability and the corresponding category:

import pandas as pd
from sklearn.metrics import confusion_matrix

categories = ['tech', 'business', 'entertainment', 'sport', 'politics']

df = {}
probabilities = ['probability_tech_d']
df['combined'] = pd.read_csv('~/classification-results/tech.csv')
df['combined'] = df['combined'].drop(['score_d'], axis=1)
df['combined'] = df['combined'].rename({'probability_d': 'probability_tech_d'}, axis='columns')
for c in categories:
    if c == 'tech':
    df[c] = pd.read_csv('~/classification-results/' + c + '.csv')
    df[c] = df[c].drop(['text_t', 'category_s', 'score_d'], axis=1)
    df[c] = df[c].rename({'probability_d': 'probability_' + c + '_d'}, axis='columns')
    df['combined'] = df['combined'].merge(df[c], on='id')
    probabilities.append('probability_' + c + '_d')

df['combined']['probability_max_d'] = df['combined'][probabilities].max(axis=1)
df['combined']['probability_max_s'] = df['combined'][probabilities].idxmax(axis=1)
df['combined']['predicted_category_s'] = df['combined']['probability_max_s'].apply(lambda x: x.replace('probability_', '').replace('_d', ''))

print('Probabilities combined')

Based on the preceding data frame, the following paragraph calculates and visualizes the the confusion matrix inside Zeppelin:

import numpy as np
import matplotlib.pyplot as plt

cm = confusion_matrix(df['combined']['category_s'], df['combined']['predicted_category_s'], labels=categories)

fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest',
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
      xticklabels=categories, yticklabels=categories,
      title='BBC Text Classification Confusion Matrix',
      xlabel='Predicted category')

ax.set_xticks(np.arange(cm.shape[1]+1)-.5, minor=True)
ax.set_yticks(np.arange(cm.shape[0]+1)-.5, minor=True)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",

# Loop over data dimensions and create text annotations.
fmt = 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], fmt),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black")

The result plotted inside the notebook should look like this nice bluish figure:


There are many metrics that evaluate classification models. The most common ones are accuracy, precision, recall, and f1-score defined as:

    \[ accuracy=\frac{tp + tn}{tp + fp + tn + fn} \]

    \[ precision=\frac{tp}{tp + fp} \]

    \[ recall=\frac{tp}{tp + fn} \]

    \[ f1=2*\frac{precision*recall}{precision+recall} \]

The code snippet below computes these metrics using scikit-learn classification_report function:

from sklearn.metrics import classification_report

print(classification_report(df['combined']['category_s'], df['combined']['predicted_category_s']))

Here’s an instance of the above metrics for the classification problem at hand:

               precision    recall  f1-score   support      
business            0.94      0.91      0.93        55 
entertainment       0.97      0.94      0.95        33      
politics            0.87      0.95      0.91        41         
sport               0.96      0.98      0.97        44          
tech                0.96      0.92      0.94        53      

accuracy                                0.94       226     
macro avg           0.94      0.94      0.94       226  
weighted avg        0.94      0.94      0.94       226

1. Solr dynamic field naming convention is assumed

Leave a Reply

Your email address will not be published. Required fields are marked *