Categories
data science novice

Zeppelin Notebooks and Solr

The concept of data science notebooks has been around for a while. Notebooks are web interfaces that allow creating and sharing live code, equations, visualizations and narrative text. They exist somewhere in data science workflows to serve data cleaning, transformation, numerical simulation, statistical modeling, data visualization and even machine learning. In a Python environment, Jupyter is prominent. In Java or Scala environment, Apache Zeppelin fits seamlessly. Though Jupyter can be used with a Java kernel and Zeppelin can be used with a Python interpreter, each one natively belongs to its own stack.

Apache Zeppelin

Installing Zeppelin and Solr Interpreter

Zeppelin interpreters allow languages or data processing backends to be plugged into Zeppelin. However, Zeppelin extensibility is designed through Helium which is a plugin system that can extend Zeppelin with components including interpreters. Other components that can be plugged are spells, visualizations and even whole applications.

In this post, we are going to explore creating computational notebooks for data stored in Solr which can be done in a few steps. If you are running Solr, all what is required to start creating notebooks for existing collections is obtaining Zeppelin1 and installing the Solr interpreter:

$ docker run -p 8080:8080 -d --name zeppelin apache/zeppelin:0.8.1
$ docker exec zeppelin ./bin/install-interpreter.sh --name solr --artifact com.lucidworks.zeppelin:zeppelin-solr:0.1.5
$ docker restart zeppelin # to load the interpreter

Navigate to http://<zeppelin-host>:8080 and Zeppelin welcome page should appear:

To create a Solr interpreter, click “anonymous” menu on the top right then click “Interpreter”. Click “Create” button on the top right and enter parameters as follows:

Solr Interpreter Parameters

Note: the default value of solr.baseUrl is http://localhost:8983/solr

Creating the Notebook

For the purpose of this post, the Amazon Cell Phones Reviews dataset will be used. Trivial preprocessing, such as renaming fields and splitting prices field, is required before indexing. Once indexed into two collections, items and reviews, a notebook can be created.

Each cell inside the notebook should start with %solr spell in order to bind to the notebook to Solr interpreter and run commands. Currently, there are 5 commands that are supported: use, search, facet, stream and sql. The last two commands invoke the powerful Solr Streaming Expressions and Parallel SQL features that enable creating computations for Zeppelin.

The first command that should be used is use which generally sets the default collection to be used throughout the notebook where no collection is mentioned:

%solr
use reviews

Adding Computational Paragraphs

To visualize rating distribution:

%solr
facet q=*:*&facet=true&facet.field=rating_f
Rating Distribution

To calculate average rating by brand:

%solr
SELECT avg(reviews.rating_f) as rating, brand_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY items.brand_s ORDER BY rating DESC LIMIT 20
Average Rating By Brand

Conducting a reviews Significant Terms query for the the search keyword “battery”:

%solr
stream significantTerms(reviews, q="body_t:battery", field="body_t")
“Battery” Significant Terms

It’s noted that several significant terms suggest that many reviews touch the point that phone batteries drain quickly or die before they usually should.

Top products with highest ratings:

%solr
SELECT avg(reviews.rating_f) as rating, items.title_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY items.title_s ORDER BY rating DESC LIMIT 20

Total reviews count by brand:

%solr
SELECT sum(totalReviews_i) AS s, brand_s FROM items GROUP BY brand_s ORDER BY s DESC
Total Reviews Count By Brand

Products count by brand:

%solr
facet q=*:*&facet=true&facet.field=brand_s&collection=items
Products Count By Brand

Total helpful reviews by brand:

%solr
SELECT sum(helpfulVotes_i) as s, brand_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY brand_s
Total Helpful Reviews By Brand

Verified field distribution and average ratings:

%solr
SELECT count(*) as v, verified_s FROM reviews GROUP BY verified_s
Verified Field Distribution

Average rating for each verified state:

%solr
SELECT verified_s, avg(rating_f) as rating FROM reviews GROUP BY verified_s
Average Rating By Verified State

Search for reviews that mention keyword “screen”:

%solr
search q=body_t:screen&fl=asin_s,name_s,title_s,body_t,verified_s,rating_f,helpfulVotes_i,date_s&rows=20
Reviews That Mentioned the Keyword “Screen”

Product base price and rating:

%solr
SELECT price1_f, brand_s, rating_f, totalReviews_i FROM items WHERE price1_f > 0
Product Base Price and Rating

1. As of writing this post, Zeppelin version 0.8.2 docker image does not start or run as expected

Leave a Reply

Your email address will not be published. Required fields are marked *