Building a Dynamic Analytics Dashboard with Apache Solr and Banana in 10 Minutes

TL;DR if you have a raw dataset or a data indexed into Apache Solr, a meaningful analytics dashboard that gives insights and useful graphical and tabular information can be built in minutes.

Steps

1. Install Solr

2. Start Solr and create a new collection

3. Download and modify the dataset

4. Use NLTK stop words

5. Index the dataset

6. Clone and install Banana

7. Configure Banana

8. Create the dashboard and panels

1. Installing Apache Solr

Download the latest version of Apache Solr from Apache mirrors (https://lucene.apache.org/solr/downloads.html) and extract it to a local directory or to a directory on a server if you plan to run this tutorial on a server. We will assume that Solr is extracted to $SOLR_HOME directory.

2. Updating Stopwords File

Out of the box, Solr comes with a stop words lists for several languages, however a more comprehensive stop words list, for English in this tutorial, is included with NLTK Natural Language Processing library. For this tutorial, download the stop words list from https://gist.github.com/sebleier/554280 and copy the file to _default/stopwords.txt under configsets directory:

$ cd $SOLR_HOME/server/solr/configsets/_default/conf
$ curl https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK\'s%2520list%2520of%2520english%2520stopwords > nltk_stopwords.txt
$ cp nltk_stopwords.txt stopwords.txt

3. Starting Solr and Creating a New Collection

Start Solr in cloud mode:

$ cd $SOLR_HOME
$ bin/solr start -c

Next, create a new collection to be used for indexing the data:

$ bin/solr create -c greads

4. Indexing the Dataset

For this tutorial, Goodreads books dataset will be used. Download the dataset from Kaggle in CSV format and update column names to match dynamic fields schema settings in Solr. Dynamic fields is a feature that allows inference of fields data types based on a postfix naming convention. The most common / basic postfixes are:

_i and _l: integer and long respectively
_f and _d: float and double respectively
_s and _t: string and text respectively

Note: adding s to any of these postfixes creates a multi-value field of the same data type instead of a single-value field.

So, update column names in the CSV file as follows:

bookID_i
title_t
authors_s
average_rating_f
isbn_s isbn13_s
language_code_s
num_pages_i
ratings_count_i
text_reviews_count_i

5. Indexing the Dataset Using Out-of-the-box Post Command Line

To index the dataset into the collection created in step 3, the post command can be used as follows:

$ bin/post -c greads /path/to/data/books.csv

Note: the dataset contains 5 records that have invalid average_rating_f values. Removing these records is necessary for the indexing to complete successfully.

6. Installing Banana

Now we have the data indexed into Solr, we are ready to build the dashboard. Clone Banana from GitHub repository and follow the installation instructions in the Readme file.

7. Configuring The Dashboard

Navigate to http://localhost:8983/solr/banana/index.html. If the installation is successful, the default dashboard should be displayed inside the browser window. Click the top right gear and select Solr tab then change collection name value to greads. Since the data set is not a timeseries, remove the timepicker panel from the second row and the histogram from the fourth row. The timepicker filter also needs to be removed as well from the third row.

8. Creating Panels

We will add several panels to the dashboard that give some insights of Goodreads dataset, such as:

who the top authors are based on reviews count and ratings
how languages rank in terms of the number of pages authored and ratings
what the longest and shortest books are, in what language and their ISBN’s and titles.

Many other panels panels can be created to obtain more insights as well!

Let’s start by creating a faceting panel for authors and languages so that analytics for a specific author or language can be obtained. Add a new row to the dashboard then click the plus sign to add a new panel and select “facet” from the drop down menu. In “Add Facet Field” text box, add authors_s and language_code_s fields.

Next, let’s create a terms panels to provide some insights about ratings, reviews and number of pages per author or language. In an empty block on the dashboard click the plus sign to add a new panel then select “terms” from the dropdown menu that appears. Enter the following values for “Field”, “Mode”, and “Stats field” respectively in each terms panel:

authors_s, mean, average_rating_f: average ratings for authors
authors_s, sum, and text_reviews_count_i: total reviews counts for authors
language_code_s, mean, __num_pages_i: average number of pages per book per language
language_code_s, mean, ratings_count_i: the average ratings count per language
authors_s, mean, __num_pages_i: the average number of pages per author

Getting some descriptive statistic measures can be useful. These measures can be calculated and displayed by adding a Hits panel to the dashboard. Click the plus sign on the dashboard to add a new panel and select “hits” from the dropdown then add the following metrics:

Stats Type	Field	Decimal Digits	Label
count	id	0	Total Number of Books
mean	average_rating_f	2	Average Rating
stddev	average_rating_f	2	Rating Stan- dard Deviation

The next step is creating a word cloud panel that displays books titles most frequent words. Straightforward enough, add a new panel and select “tag cloud” from the available panels and select title_s in the field.

The final step is adding a table panel that displays some of the dataset fields. Table panel allows sorting and inspecting records, features that we will use to identify the longest and shortest books in the dataset based on the number of pages. Click the plus sign to add a new panel, select “table” and add these columns to the table:

author_s
isbn_s
__num_pages_i
title_t
text_reviews_count_i
language_code_s

Voilà! All panels have been added and the dashboard is ready to be used for exploring the dataset.

To get the longest book in the dataset, click __num_of_pages_i column to sort rows in a descending order with respect to the number of pages. The longest book turns out to be “The Complete Aubrey/Maturin Novels (5 Volumes)” by Patrick O’Brian consisting of 6,756 pages. To get the shortest book, click again on the same column. We find that there are many books with 0, 1, 2, … pages in the dataset. Most of these books could apparently either audio books, digital books or the number of pages was entered incorrectly.

The dashboard is dynamically interactive which mean filtering or faceting reflects on the whole dashboard. This enables the user to get real-time insights for the selected criteria. For example, clicking on “Terry Brooks“ bar on any of the terms panels adapts the dashboard to filter and display all panels for Terry Brooks, in other words it creates a Terry Brooks dashboard on the fly. For instance, by inspecting the “Language” facet or “Language Average Pages” panels, it turns out that Terry Brooks wrote books in English, German and Dutch.

Using Banana on top of Solr installation gives users a powerful analytics layer that they can use to create dashboards to explore datasets, slice-and-dice, and achieve many data analytics tasks.

Leave a Reply Cancel reply