Categories
natural language processing novice

Lightweight Text Clustering with Solr

Clustering is one of the most common unsupervised Machine Learning tasks. Solr is shipped with a clustering module based on Carrot2 built-in algorithms. Carrot2 comes with 4 algorithms: Lingo, STC, kMeans and Lingo3D each one mapped to a clustering engine. The first three are open-source whereas the last one is commercial. When this approach is used, clustering takes place in memory. Other frameworks, such as Mahout, can be used to do the clustering “off-line.”

Categories
data science novice

Zeppelin Notebooks and Solr

The concept of data science notebooks has been around for a while. Notebooks are web interfaces that allow creating and sharing live code, equations, visualizations and narrative text. They exist somewhere in data science workflows to serve data cleaning, transformation, numerical simulation, statistical modeling, data visualization and even machine learning. In a Python environment, Jupyter is prominent. In Java or Scala environment, Apache Zeppelin fits seamlessly. Though Jupyter can be used with a Java kernel and Zeppelin can be used with a Python interpreter, each one natively belongs to its own stack.

Apache Zeppelin
Categories
data analytics devops novice

Realtime Log Analytics with Solr, Logstash, Banana and Beats

Logs are everywhere and usually generated in large sizes and high velocities. These logs can be used to obtain useful information and insights about the domain or the process related to these logs, such as platforms, transactions, system users, etc. In this post, a realtime web (Apache2) log analytics pipeline will be built using Apache Solr, Banana, Logstash and Beats containers.

However, in order to get the pipeline running, several integration aspects related to streaming data need to be addressed through settings and patches supplied through mounted volumes. The structure of these volumes can be as below:

Categories
data analytics novice

Introducing Graph Visualization in Banana v1.7

Graph traversal features have been introduced in Solr 6 releases. These powerful features enables Solr users to run expressions that traverses graph structures in order to introduce or extract useful information. These graph traversal features are particularly useful when data is already indexed into Solr and light graph operations are required especially on top of text search. Before proceeding, a basic knowledge of Solr and graph structures is required.

Solr traversal implementation uses Breadth First Search (BFS) to perform graph traversal which is more suitable for solving search problems than its counterpart Depth First Search (DFS). It is also possible to combine graph traversal with other search or streaming operations.

In this post, we are going to explore basic graph visualization introduced in Banana v1.7. To visualize a graph in Banana, there must be at least a collection indexed into Solr with two fields: from and to that represent the adjacency matrix or, in other words, the edges of the graph. Alternatively, two collections can be used to visualize the graph: a main collection which is configured in the dashboard settings and an additional graph collection that stores the graph matrix. The main collection will be joined with the graph collection to retrieve node labels.