The concept of data science notebooks has been around for a while. Notebooks are web interfaces that allow creating and sharing live code, equations, visualizations and narrative text. They exist somewhere in data science workflows to serve data cleaning, transformation, numerical simulation, statistical modeling, data visualization and even machine learning. In a Python environment, Jupyter is prominent. In Java or Scala environment, Apache Zeppelin fits seamlessly. Though Jupyter can be used with a Java kernel and Zeppelin can be used with a Python interpreter, each one natively belongs to its own stack.
Installing Zeppelin and Solr Interpreter
Zeppelin interpreters allow languages or data processing backends to be plugged into Zeppelin. However, Zeppelin extensibility is designed through Helium which is a plugin system that can extend Zeppelin with components including interpreters. Other components that can be plugged are spells, visualizations and even whole applications.
In this post, we are going to explore creating computational notebooks for data stored in Solr which can be done in a few steps. If you are running Solr, all what is required to start creating notebooks for existing collections is obtaining Zeppelin1 and installing the Solr interpreter:
$ docker run -p 8080:8080 -d --name zeppelin apache/zeppelin:0.8.1
$ docker exec zeppelin ./bin/install-interpreter.sh --name solr --artifact com.lucidworks.zeppelin:zeppelin-solr:0.1.5
$ docker restart zeppelin # to load the interpreter
Navigate to http://<zeppelin-host>:8080 and Zeppelin welcome page should appear:
To create a Solr interpreter, click “anonymous” menu on the top right then click “Interpreter”. Click “Create” button on the top right and enter parameters as follows:
Note: the default value of
Creating the Notebook
For the purpose of this post, the Amazon Cell Phones Reviews dataset will be used. Trivial preprocessing, such as renaming fields and splitting prices field, is required before indexing. Once indexed into two collections,
reviews, a notebook can be created.
Each cell inside the notebook should start with
%solr spell in order to bind to the notebook to Solr interpreter and run commands. Currently, there are 5 commands that are supported:
sql. The last two commands invoke the powerful Solr Streaming Expressions and Parallel SQL features that enable creating computations for Zeppelin.
The first command that should be used is
use which generally sets the default collection to be used throughout the notebook where no collection is mentioned:
Adding Computational Paragraphs
To visualize rating distribution:
To calculate average rating by brand:
SELECT avg(reviews.rating_f) as rating, brand_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY items.brand_s ORDER BY rating DESC LIMIT 20
Conducting a reviews Significant Terms query for the the search keyword “battery”:
stream significantTerms(reviews, q="body_t:battery", field="body_t")
It’s noted that several significant terms suggest that many reviews touch the point that phone batteries drain quickly or die before they usually should.
Top products with highest ratings:
SELECT avg(reviews.rating_f) as rating, items.title_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY items.title_s ORDER BY rating DESC LIMIT 20
Total reviews count by brand:
SELECT sum(totalReviews_i) AS s, brand_s FROM items GROUP BY brand_s ORDER BY s DESC
Products count by brand:
Total helpful reviews by brand:
SELECT sum(helpfulVotes_i) as s, brand_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY brand_s
Verified field distribution and average ratings:
SELECT count(*) as v, verified_s FROM reviews GROUP BY verified_s
Average rating for each verified state:
SELECT verified_s, avg(rating_f) as rating FROM reviews GROUP BY verified_s
Search for reviews that mention keyword “screen”:
Product base price and rating:
SELECT price1_f, brand_s, rating_f, totalReviews_i FROM items WHERE price1_f > 0
1. As of writing this post, Zeppelin version 0.8.2 docker image does not start or run as expected