Under the Hood: Architectural Overview of Netmera Search

One of the most important features of Netmera Platform is full-text search. This feature really stands out from other Backend Cloud services, in that you can offer unstructured data very efficiently in your app via Netmera. With Netmera Search our backend services are particularly useful for media and content based applications.

In this post, I will talk about the search feature of Netmera and technologies that we use to develop it.

Netmera stores data on the Nosql database MongoDB which offers scalable, high-performance, schema free data storage. MongoDB provides reliable data store and fast query . It also provides several useful querying options, however, it has limited search functionality. You can’t create MongoDB index that will allow searching on each field efficiently. Indexing all fields causes memory problems. Due to limited search feature of MongoDB, we decided to use a search engine for this purpose. We analyzed search engines and decided to use Solr because of its maturity and large community behind it.

Solr is an open source search platform which is based on Lucene search engine. It provides full-text search, facet search, analyzing/stemming/boosting contents and some other useful features. It can perform complex queries, can handle millions of documents and can scale horizontally. Since Solr has opportunity to store data and retrieve it during search, we decided to store content in Solr instead of MongoDB. However we observe that query performance of Solr decreases when index size increases. We realized that the best solution is to use both Solr and Mongo DB together. Then, we integrate Solr with MongoDB by storing contents into the MongoDB and creating index using Solr for full-text search. We only store the unique id for each document in Solr index and retrieve actual content from MongoDB after searching on Solr. Getting documents from MongoDB is faster than Solr because there is no analyzers, scoring etc. With this hybrid approach we get the benefits of both technologies.

Netmera search is composed of three layers described below.

1. Content Indexing
While adding content into cloud, it is also added to the search index. Before text (meta data of media or the text content itself) gets indexed, it is tokenized and analyzed. For this purpose, we have developed our FilterFactory in order to analyze Turkish data. Solr has a built-in stemmer but  it doesn’t provide precise results for Turkish language. We use Zemberek which is an open source Natural Language Processing library for Turkic languages for stemming terms. With this process we achieve more accurate search results. We also create stop word list (common words used in the language) for the Turkish language and remove those words from the document during indexing. This process improves the indexing time.

At the moment we have extended Solr for Turkish language but we have plans to optimize our search engine for other languages. We would like to hear your recommendations to add additional libraries for other languages. You can tell us available language extensions in the comments.

2. Searching
This layer provides the capabilities of searching contents inside the search index. Besides full-text search, Netmera is also able to do geo-location search. In order to make location search, latitude and longitude values are indexed for all location related content. Then our search engine can perform two kinds of geo-location searches as described below;

Box Search : Given two corner points, box is created and contents inside the box are listed. This feature can be used on a map in order to find locations or contents inside a map area.

Circle Search : Given a point (latitude, longitude) and distance (radius of the circle), a circle is created and contents inside this circle are listed. This search method can best be used to find nearby contents around a user.

3. Ranking
Contents retrieved from the search index are ranked by the Lucene’s scoring algorithm. This is a complex algorithm but in general it is based on the frequency of search term in individual documents and in overall index. Our current R&D efforts are focused on customizing search scores (ranking) by adding new factors such as social context, popularity, location etc. to the search. In this way different type of applications will be able to find and show most relevant content to their users. We are still working on this feature and will publish a detailed blog post when it is released.

This is a general overview of our search feature. Feel free to contact me for any questions and feedback.