Skip to content

Reverse Index Search

ajesler edited this page May 9, 2012 · 2 revisions

One of the pain points of using cassandra is the inability to do an efficient range search. For example, SELECT * FROM CF_USERS WHERE AGE >= 10 AND AGE <= 40. To do this with secondary indexes you need to throw in an equality clause and then have cassandra read all rows matching the equality and evaluate the range linearly. This can be very inefficient. An alternative is to read all rows and execute the range filter in java but that is even more inefficient.

This can be solved with custom reverse indexes. These indexes have to be built manually but can be very powerful. The idea behind a secondary index is to create a separate column family where the key name is the index (usually sharded over multiple keys) and the columns are a composite of the value + foreign key (Yes, foreign key!!!). The benefit here is that each index shard is sorted by a value (the value from the data column family). So for the above example the data and index column families would be

CF_USERS
  RowKey:  username (UTF8Type)
  ComparatorType: UTF8Type
 
Example: 
  "elandau" : {
      "FIRST" : "Eran",
      "LAST" : "Landau",
      "AGE" : 34,
  }
CF_INDEX
  RowKey:  AGE_[0..10]  (11 shards here)
  CompartorType:  CompositeType(LongType, UTF8Type)
 
Example:
   "AGE_0" : {
        5:"ariellelandau" : null,
        34:"elandau" : null,
        60:"lindalandau" : null
   }
 
   "AGE_1" : {
        12:"amitlandau":null
        33:"nettalandau":null
   }

Notice that the first part of the composite has type equal to the value type and the second part has type equal to the row key (in the other column family)

To use the reverse index you need to first search CF_INDEX by reading all rows in the shard and applying a column range from 10 to 40. Once you have all the index entries you extract the foreign key from the column names and do a get from the CF_USERS column family. The complexity here is how to construct the query to be most efficient. To do so you may need to break up the index shard query into small key slices that can be executed in parallel. Once you start reading each shard you may need to paginate through it if it holds a lot of columns.

Astyanax provides a recipe to do all of this in one call.

ReverseIndexQuery.newQuery(keyspace, CF_USERS, INDEX_CF_NAME, LongSerializer.get())
    .withIndexShards(new Shards.StringShardBuilder().setPrefix("AGE_").setShardCount(11).build())
    .fromIndexValue(10L)
    .toIndexValue(40L)
    .withColumnSlice(Arrays.asList("FIRST", "LAST", "AGE"))
    .forEach(new Callback<Row<String, String>>() {
        @Override
        public void handle(Row<String, String> row) {
            // Process your data rows here:  The row should have columns "FIRST", "LAST", "AGE"
        })
    .execute();

What's missing? You currently need to build the custom index on your own when you update your data. In the future Astyanax will have a simple DAO that will do this transparently. Also, a future version of cassandra will provide trigger functionality which could be used by cassandra to built the reverse index for you.

Uniqueness constraint

Cassandra does not provide built in uniqueness constraint on row keys. Astyanax provides a recipe which follows a write/read sequence that guarantees uniqueness constraint. This is a suboptimal use case for cassandra and should be used sparingly where guaranteeing uniqueness cannot be avoided.

The following example shows how to set up a uniqueness constraint on a column family that has long key type and String columns name type. The assumption here is that the same CF is used to store data as well as the columns used to guarantee uniqueness.

UniquenessConstraintWithPrefix<Long> unique = new UniquenessConstraintWithPrefix<Long>(keyspace, CF_DATA)
    .setPrefix("unique_") // This is optional and can be used to distinguish between the unique column name and real columns
    .setTtl(60);  // This is optional
         
 try {
    String column = unique.isUnique(someRowKey);
    if (column == null) {
        // Not unique
    }
    else {
        // Is unique.
        // Make sure to store the returned column with your data otherwise it will TTL and uniquess will be lost
    }
} catch (ConnectionException e) {
}

QUORUM consistency level is used by default but can be changed by calling .setConsistencyLevel(ConsistencyLevel.CL_QUORUM_EACH)