Apache Solr for distributed indexing #27

maestre3d · 2020-06-19T20:31:25Z

Is your feature request related to a problem? Please describe.
Yes, right now distributed data indexation is impossible and we still need to implement complex query building patterns in order to satisfy our indexation standards.

Describe the solution you'd like
Currently, we use Redis as cache and PostgreSQL as main RDBMS, yet for full-text searching we need something more powerful and easy to implement. Using Apache Solr as distributed indexing database for the job is one of the best choices we have, even more if we are already using Apache Zookeeper (required by Solr) and Apache Kafka.

Describe alternatives you've considered
Since we might use EKS (e.g. Elasticsearch) for other inquiries, we should consider using the ElasticStack.

Additional context
No

maestre3d · 2020-06-21T21:40:37Z

Consider using Apache Solr (+ Apache Lucene + Apache Zookeeper), it's widely use and its easier to do full-text search even in docs like xml and PDFs.

Since we're already using Apache Kafka and most likely Apache Hadoop and Hive, we still must use Apache Zookeeper for sharded clusters management anyway.

maestre3d · 2020-07-25T04:58:35Z

Now reconsidering using Elasticsearch.

Here is the full explanation:
Currently, we have planned to deploy almost everything on AWS. While there is an Open Source service of Elasticsearch in AWS, there's no managed service for Apache Solr available in AWS. Even though our platform is using Kafka, we would be using MKS (Managed Kafka Service) in AWS. Thus, we wouldn't need to manage our Zookeeper cluster by ourselves since AWS takes charge. The same goes for Apache Spark and Hadoop+MapReduce.

In addition, if we chose Solr anyways, we would have to manage all EC2 instances along with AutoScaling groups, Service Discovery registry and Load Balancing.

Hence, we know this will eventually come with a price and we still consider an Apache Solr migration in a future for certain services that could take an advantage of the trade-offs Apache Solr has to offer (like fine-grained search on media/blob service's files).

maestre3d · 2020-07-25T05:29:17Z

We should use the CQRS pattern whenever robust querying is needed from now on.

Why?
By segregating queries and writes from code, we gain the option to use different databases for each operation.
For example, write commands could be writing to a PSQL database with more structured/normalized data, or maybe we would want to store our aggregate as JSON, so we could take advantage of Apache Cassandra or AWS DynamoDB.

In the other hand, for the query side, we could use a fine-grained indexation/querying database system like the ones we've been discussing before and get the benefits of full text search in NRT (Near Real Time).

Thus, CQRS gives us a way to model complex business logic, avoiding the use of simple CRUD operations.

Here is a diagram that explains this pattern too with a single data source. This could be used whenever a complex domain modeling is needed (avoiding simple CRUD operations).

Now, this diagram shows us the way we could be using this pattern in order to segregate our data sources into particular data sets that can be scaled independently. This could be used whenever complex querying is needed like Media and Author services.

Finally, there is another way to handle our data using CQRS. That way is using Event Sourcing (ES), which aggregates data instead of replacing it like conventional ways. It's events are immutable as in real life events are. This is highly used when data is constantly changing and you must keep track of it for different reasons. Moreover, ES brings us the possibility of easy rollback because we can trace the previous state of our aggregate using the event store, so we could rebuild aggregates.
Here is a diagram that shows how CQRS + ES can be combined.

By the time, we are still turning down the need of full Event Sourcing (ES) since we already have a way to successfully handle distributed transactions (SAGA). However, we know we still must tune our transaction system even more to be completely resilient and simple (using Orchestration instead Choreography, tuning the Event Server from Alexandria Core to use distributed tracing, metrics, circuit breakers, retry policies and logging by default using the chain of responsibility pattern for example).

Although we take advantage of an Event-Driven ecosystem to communicate to other services asynchronously, we are still evaluating the need of a dedicated event store, mostly for rollbacks/compensating operations.

More information can be found here.

maestre3d added the enhancement New feature or request label Jun 19, 2020

maestre3d self-assigned this Jun 19, 2020

maestre3d changed the title ~~Elasticsearch for query commands~~ Apache Solr for distributed indexing Jul 9, 2020

maestre3d added this to the Alexandria Beta milestone Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Solr for distributed indexing #27

Apache Solr for distributed indexing #27

maestre3d commented Jun 19, 2020 •

edited

Loading

maestre3d commented Jun 21, 2020 •

edited

Loading

maestre3d commented Jul 25, 2020

maestre3d commented Jul 25, 2020

Apache Solr for distributed indexing #27

Apache Solr for distributed indexing #27

Comments

maestre3d commented Jun 19, 2020 • edited Loading

maestre3d commented Jun 21, 2020 • edited Loading

maestre3d commented Jul 25, 2020

maestre3d commented Jul 25, 2020

maestre3d commented Jun 19, 2020 •

edited

Loading

maestre3d commented Jun 21, 2020 •

edited

Loading