Classification is one of the most popular tasks in Natural Language Processing and Machine Learning. Solr ships with features, a subset of Streaming Expressions features, that allows building and deploying statistical classification models out-of-the-box. With adequate preprocessing and indexing tweaks, these features can be used to classify documents quickly and with high accuracy. This post illustrates how Solr streaming expressions and Zeppelin notebooks can be used to build a document classifier.
The concept of data science notebooks has been around for a while. Notebooks are web interfaces that allow creating and sharing live code, equations, visualizations and narrative text. They exist somewhere in data science workflows to serve data cleaning, transformation, numerical simulation, statistical modeling, data visualization and even machine learning. In a Python environment, Jupyter is prominent. In Java or Scala environment, Apache Zeppelin fits seamlessly. Though Jupyter can be used with a Java kernel and Zeppelin can be used with a Python interpreter, each one natively belongs to its own stack.