TL;DR if you have a raw dataset or a data indexed into Apache Solr, a meaningful analytics dashboard that gives insights and useful graphical and tabular information can be built in minutes.
Steps
1. Install Solr
2. Start Solr and create a new collection
3. Download and modify the dataset
4. Use NLTK stop words
5. Index the dataset
6. Clone and install Banana
7. Configure Banana
8. Create the dashboard and panels
1. Installing Apache Solr
Download the latest version of Apache Solr from Apache mirrors (https://lucene.apache.org/solr/downloads.html) and extract it to a local directory or to a directory on a server if you plan to run this tutorial on a server. We will assume that Solr is extracted to $SOLR_HOME
directory.
2. Updating Stopwords File
Out of the box, Solr comes with a stop words lists for several languages, however a more comprehensive stop words list, for English in this tutorial, is included with NLTK Natural Language Processing library. For this tutorial, download the stop words list from https://gist.github.com/sebleier/554280 and copy the file to _default/stopwords.txt under configsets directory:
$ cd $SOLR_HOME/server/solr/configsets/_default/conf
$ curl https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK\'s%2520list%2520of%2520english%2520stopwords > nltk_stopwords.txt
$ cp nltk_stopwords.txt stopwords.txt
3. Starting Solr and Creating a New Collection
Start Solr in cloud mode:
$ cd $SOLR_HOME
$ bin/solr start -c
Next, create a new collection to be used for indexing the data:
$ bin/solr create -c greads
4. Indexing the Dataset
For this tutorial, Goodreads books dataset will be used. Download the dataset from Kaggle in CSV format and update column names to match dynamic fields schema settings in Solr. Dynamic fields is a feature that allows inference of fields data types based on a postfix naming convention. The most common / basic postfixes are:
- _i and _l: integer and long respectively
- _f and _d: float and double respectively
- _s and _t: string and text respectively
Note: adding s
to any of these postfixes creates a multi-value field of the same data type instead of a single-value field.
So, update column names in the CSV file as follows:
- bookID_i
- title_t
- authors_s
- average_rating_f
- isbn_s isbn13_s
- language_code_s
- num_pages_i
- ratings_count_i
- text_reviews_count_i
5. Indexing the Dataset Using Out-of-the-box Post Command Line
To index the dataset into the collection created in step 3, the post
command can be used as follows:
$ bin/post -c greads /path/to/data/books.csv
Note: the dataset contains 5 records that have invalid average_rating_f
values. Removing these records is necessary for the indexing to complete successfully.
6. Installing Banana
Now we have the data indexed into Solr, we are ready to build the dashboard. Clone Banana from GitHub repository and follow the installation instructions in the Readme file.
7. Configuring The Dashboard
Navigate to http://localhost:8983/solr/banana/index.html. If the installation is successful, the default dashboard should be displayed inside the browser window. Click the top right gear and select Solr tab then change collection name value to greads. Since the data set is not a timeseries, remove the timepicker panel from the second row and the histogram from the fourth row. The timepicker filter also needs to be removed as well from the third row.
8. Creating Panels
We will add several panels to the dashboard that give some insights of Goodreads dataset, such as:
- who the top authors are based on reviews count and ratings
- how languages rank in terms of the number of pages authored and ratings
- what the longest and shortest books are, in what language and their ISBN’s and titles.
Many other panels panels can be created to obtain more insights as well!
Let’s start by creating a faceting panel for authors and languages so that analytics for a specific author or language can be obtained. Add a new row to the dashboard then click the plus sign to add a new panel and select “facet” from the drop down menu. In “Add Facet Field” text box, add authors_s
and language_code_s
fields.
Next, let’s create a terms panels to provide some insights about ratings, reviews and number of pages per author or language. In an empty block on the dashboard click the plus sign to add a new panel then select “terms” from the dropdown menu that appears. Enter the following values for “Field”, “Mode”, and “Stats field” respectively in each terms panel:
- authors_s, mean, average_rating_f: average ratings for authors
- authors_s, sum, and text_reviews_count_i: total reviews counts for authors
- language_code_s, mean, __num_pages_i: average number of pages per book per language
- language_code_s, mean, ratings_count_i: the average ratings count per language
- authors_s, mean, __num_pages_i: the average number of pages per author
Getting some descriptive statistic measures can be useful. These measures can be calculated and displayed by adding a Hits panel to the dashboard. Click the plus sign on the dashboard to add a new panel and select “hits” from the dropdown then add the following metrics:
Stats Type | Field | Decimal Digits | Label |
count | id | 0 | Total Number of Books |
mean | average_rating_f | 2 | Average Rating |
stddev | average_rating_f | 2 | Rating Stan- dard Deviation |
The next step is creating a word cloud panel that displays books titles most frequent words. Straightforward enough, add a new panel and select “tag cloud” from the available panels and select title_s
in the field.
The final step is adding a table panel that displays some of the dataset fields. Table panel allows sorting and inspecting records, features that we will use to identify the longest and shortest books in the dataset based on the number of pages. Click the plus sign to add a new panel, select “table” and add these columns to the table:
- author_s
- isbn_s
- __num_pages_i
- title_t
- text_reviews_count_i
- language_code_s
VoilĂ ! All panels have been added and the dashboard is ready to be used for exploring the dataset.
To get the longest book in the dataset, click __num_of_pages_i
column to sort rows in a descending order with respect to the number of pages. The longest book turns out to be “The Complete Aubrey/Maturin Novels (5 Volumes)” by Patrick O’Brian consisting of 6,756 pages. To get the shortest book, click again on the same column. We find that there are many books with 0, 1, 2, … pages in the dataset. Most of these books could apparently either audio books, digital books or the number of pages was entered incorrectly.
The dashboard is dynamically interactive which mean filtering or faceting reflects on the whole dashboard. This enables the user to get real-time insights for the selected criteria. For example, clicking on “Terry Brooks“ bar on any of the terms panels adapts the dashboard to filter and display all panels for Terry Brooks, in other words it creates a Terry Brooks dashboard on the fly. For instance, by inspecting the “Language” facet or “Language Average Pages” panels, it turns out that Terry Brooks wrote books in English, German and Dutch.
Using Banana on top of Solr installation gives users a powerful analytics layer that they can use to create dashboards to explore datasets, slice-and-dice, and achieve many data analytics tasks.