Problem Statement: Despite being the first class citizen in Spark, holding the key corporate asset i.e., data, Datasets are not getting enough attention it needs in terms of making them searchable.
In this blog, we'll look at following items:
- How can we make datasets searchable
- esp. without using any search engine
- SQL like queries for search
- Same data, different representations
- Rearrange the same data in different schema’s
- Understand the impact of data schema on search time.
- Conclusion - Performance testing
- Evaluate each of these approaches and see which is more appropriate way to structure the datasets when it comes to making them searchable.
I’ve implemented a working sample application using worldbank open data that does the following:
- A UI Dashboard built on top of Spark to browse knowledge (a.k.a data)
- Real-time query spark and visualise it as graph.
- Supports SQL query syntax.
- This is just a sample application to get an idea on how to go about building any kind of analytics dashboard on top of the data that Spark processed. One can customize it according to their needs.
Please refer to my git repository here for further details.
I've used ~4 million countries profile information as knowledge base from world bank open data in this project. Following table displays some sample rows:
I tried different ways to structure this data and evaluated their performance using simple search by country id
query.
Let's jump in and take a look at what are these different schema's that I tried and how they performed when queried by CountryId...
The very first attempt to structure the data was naturally the simplest of all i.e., DataFrames, where all the information of the country is in one row.
Schema:
Query by CountryId response time: 100ms
Next, I represented the same data as . In this approach, we basically, take each row and convert it into triplets of Subject, Predicate and Object
as shown in the table below:
Schema:
Query by CountryId response time: 6501ms
Next, I represented the same data as RDF triplets of linked data. The only difference between earlier approach and this one is that, the Subject here is linked to a unique id which inturn holds the actual info as shown below:
Schema:
Query by CountryId response time: 25014ms
The last attempt that I tried was to structure the data as graph with vertices, edges, and attributes. Below picture gives an idea of how country info looks like in this approach:
Schema:
- Number of Vertices: 4,278,235
- Number of Edges: 15,357,957
- Query by CountryId response time: 7637ms
I wrote this blog to demonstrate:
- How to make datasets searchable and
- The impact of data schema on search response time.
For this, I tried different ways to structure the data and evaluated its performance. I hope it helps you and gives a better perspective in structuring your data for your Spark application.
- When the nature of your data is homogenous, capturing all the information of a record in a single row gives the best performance in terms of search time.
- If you are dealing with heterogenous data, where all the entities cannot share a generic schema, then simple RDF triplets yields better response times compared to LinkedData RDF Triplets representation.
- Though higher response time is a downside of LinkedData, it is the recommended approach to connect, expose and make your data available for semantic search.
- GraphFrames is very intuitive for user to structure the data in many cases. Its response times are comparable to RDF triplets search and they also open up doors to exploit graph algorithms like triangle count, connected commonents, BFS etc
The search query here, essentially filters the datasets and returns the results i.e., it is a filter() transformation applied on data. So, the observed response times per schema not only applies to search but it also applies to any transformations that we apply on spark data. This experiment definetely tells us how big is the impact of dataschema on the performance of your spark application. Happy data structuring !!