The concept of data science notebooks has been around for a while. Notebooks are web interfaces that allow creating and sharing live code, equations, visualizations and narrative text. They exist somewhere in data science workflows to serve data cleaning, transformation, numerical simulation, statistical modeling, data visualization and even machine learning. In a Python environment, Jupyter is prominent. In Java or Scala environment, Apache Zeppelin fits seamlessly. Though Jupyter can be used with a Java kernel and Zeppelin can be used with a Python interpreter, each one natively belongs to its own stack.
Installing Zeppelin and Solr Interpreter
Zeppelin interpreters allow languages or data processing backends to be plugged into Zeppelin. However, Zeppelin extensibility is designed through Helium which is a plugin system that can extend Zeppelin with components including interpreters. Other components that can be plugged are spells, visualizations and even whole applications.
In this post, we are going to explore creating computational notebooks for data stored in Solr which can be done in a few steps. If you are running Solr, all what is required to start creating notebooks for existing collections is obtaining Zeppelin1 and installing the Solr interpreter:
$ docker run -p 8080:8080 -d --name zeppelin apache/zeppelin:0.8.1
$ docker exec zeppelin ./bin/install-interpreter.sh --name solr --artifact com.lucidworks.zeppelin:zeppelin-solr:0.1.5
$ docker restart zeppelin # to load the interpreter
Navigate to http://<zeppelin-host>:8080 and Zeppelin welcome page should appear:
To create a Solr interpreter, click “anonymous” menu on the top right then click “Interpreter”. Click “Create” button on the top right and enter parameters as follows:
Note: the default value of solr.baseUrl
is http://localhost:8983/solr
Creating the Notebook
For the purpose of this post, the Amazon Cell Phones Reviews dataset will be used. Trivial preprocessing, such as renaming fields and splitting prices field, is required before indexing. Once indexed into two collections, items
and reviews
, a notebook can be created.
Each cell inside the notebook should start with %solr
spell in order to bind to the notebook to Solr interpreter and run commands. Currently, there are 5 commands that are supported: use
, search
, facet
, stream
and sql
. The last two commands invoke the powerful Solr Streaming Expressions and Parallel SQL features that enable creating computations for Zeppelin.
The first command that should be used is use
which generally sets the default collection to be used throughout the notebook where no collection is mentioned:
%solr
use reviews
Adding Computational Paragraphs
To visualize rating distribution:
%solr
facet q=*:*&facet=true&facet.field=rating_f
To calculate average rating by brand:
%solr
SELECT avg(reviews.rating_f) as rating, brand_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY items.brand_s ORDER BY rating DESC LIMIT 20
Conducting a reviews Significant Terms query for the the search keyword “battery”:
%solr
stream significantTerms(reviews, q="body_t:battery", field="body_t")
It’s noted that several significant terms suggest that many reviews touch the point that phone batteries drain quickly or die before they usually should.
Top products with highest ratings:
%solr
SELECT avg(reviews.rating_f) as rating, items.title_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY items.title_s ORDER BY rating DESC LIMIT 20
Total reviews count by brand:
%solr
SELECT sum(totalReviews_i) AS s, brand_s FROM items GROUP BY brand_s ORDER BY s DESC
Products count by brand:
%solr
facet q=*:*&facet=true&facet.field=brand_s&collection=items
Total helpful reviews by brand:
%solr
SELECT sum(helpfulVotes_i) as s, brand_s FROM reviews INNER JOIN items ON reviews.asin_s = items.asin_s GROUP BY brand_s
Verified field distribution and average ratings:
%solr
SELECT count(*) as v, verified_s FROM reviews GROUP BY verified_s
Average rating for each verified state:
%solr
SELECT verified_s, avg(rating_f) as rating FROM reviews GROUP BY verified_s
Search for reviews that mention keyword “screen”:
%solr
search q=body_t:screen&fl=asin_s,name_s,title_s,body_t,verified_s,rating_f,helpfulVotes_i,date_s&rows=20
Product base price and rating:
%solr
SELECT price1_f, brand_s, rating_f, totalReviews_i FROM items WHERE price1_f > 0
1. As of writing this post, Zeppelin version 0.8.2 docker image does not start or run as expected
One reply on “Zeppelin Notebooks and Solr”
Thank for your article! I had problems with docker setup because of maven repo http disabling, so I had to set env var for https repo:
“`
docker run -e ZEPPELIN_INTERPRETER_DEP_MVNREPO=https://repo1.maven.org/maven2 –name zeppelin -d apache/zeppelin:0.8.2
docker exec zeppelin ./bin/install-interpreter.sh –name solr –artifact com.lucidworks.zeppelin:zeppelin-solr:0.1.6
“`