Tick Schema and Data Types #1

durple · 2014-11-04T00:20:13Z

Cassandra requires us to sepcify a SQL-like schema. Currently I have two thoughts around this:

Here is what I am currently thinking in terms of the schema

if the inbound JSON is
{"a": 1, "b": x,"c": true}
It would be represented by the following schema:

Table a:
value int primary key, ts timestamp, count counter

And the data would look like

(value of a) -> (ts, count), (ts, count) (ts, count)

Table b:

value string primary key, ts timestamp, count counter

Data:

(value of b) -> (ts, count), (ts, count) (ts, count)

Table c:

value boolean primary key, ts timestamp, count counter

Data

(value of c) -> (ts, count), (ts, count) (ts, count)

The key in the column is the actual value of the JSON element

I can listen to a stream for a specified number of messages and determine the type for each of the inbound attributes.

Drawbacks to this method are:

I have to listen to a stream for a while so time series might take a while to show up.
Attributes that do not confirm to the data type will be rejected by Cassandra
It makes the querying for data from an API harder because now I need to know what type to convert to before I query Cassandra
I have to have a some way of determining the data type and it may not always work

The other idea would be to just convert everything to a string representation and I am not too sure as to how I feel about that either. It will ensure that:

I store everything
the queries become easy.

I just don't like the idea of manipulating the data types as I store the data.

The text was updated successfully, but these errors were encountered:

durple · 2014-11-04T00:26:54Z

In the second case,

The schema just happens to look like

value string primary key, ts timestamp, count counter

For all keys.

I could also avoid creating a separate table for each key but I like the separation because that allows me to index more efficiently

nikhan · 2014-11-04T15:31:33Z

How do I query across tables? How do I get all keys for stream X?

nikhan · 2014-11-04T15:31:46Z

I have to listen to a stream for a while so time series might take a while to show up.

what is a while? why?

durple · 2014-11-04T15:40:16Z

How do I query across tables? How do I get all keys for stream X?

Every stream will have it's own keyspace (database in traditional terms). It is just a matter of using the right database and doing a show me all the tables.

You never query across tables. The idea is for each key or combinations thereof to have their own tables. e.g.

{k1:v1, k2:v2}

would ideally have the following tables

k1 -> containing time series for all distinct values of k1
k2 -> containing time series for all distinct values of k1 
k1_k2 -> containing time series for all distinct values of k1, k2

nikhan · 2014-11-04T15:42:22Z

I don't understand k1_k2:
This is a time series for values in k1 and k2?

durple · 2014-11-04T15:45:48Z

I have to listen to a stream for a while so time series might take a while to show up.

what is a while? why?

For now:
A predetermined number of messages or a predetermined interval of time (may be configured as a flag when you start listening to a stream). The reason for this would be to determine the type of each attribute and create appropriate tables in Cassandra.

durple · 2014-11-04T15:49:12Z

This is a time series for values in k1 and k2?

Yes, eg: if a stream had user and location. I would have a time series of

users over time
location over time
and user, location over time allowing me to query by both the attributes - users in a given location

nikhan · 2014-11-04T16:16:57Z

do i need to know order or anything when querying compound keys? are all variations available?
stream has A, B, C
i want A & C and A & B

durple · 2014-11-04T16:26:42Z

do i need to know order or anything when querying compound keys? are all variations available?
stream as A, B, C
i want A & C and A & C

Did you mean A&B and A&C?

If so, yes. There is a caveat though and it is one of the experiments being undertaken with Tick. The thought is for Tick to be "smart" in some way to tell you the keys that make sense in a combination or even by themselves. The specifics are yet to be determined though and it is one of the more advanced thoughts that we've discussed. For now, assume that it will create a table for

A
B
C
AB
AC
BC
ABC

The order is not important (it should not be).

durple · 2014-11-04T16:30:54Z

The thought though is to know if A&C make sense prior to building the time series for it. The current thought is to measure the cardinality of an attribute or a combination of attributes and compare it with the cardinality of the stream itself over a specified interval of time. It the cardinality is approximately same, it is probably not a useful time series to build. eg. tweet_id over time may not be a useful metric. but a retweeted_tweet_id is probably a more useful metric. Tick should be smart about that.

nikhan · 2014-11-04T17:00:42Z

I'd strongly debate what "useful" means, as my questions would likely be along the lines of:
What does A&C look like? What does A&B look like? Not having a time series because the cardinality is supposedly "similar" doesn't really tell me anything ... other than that Tick thinks I don't need this time series.

durple · 2014-11-04T17:10:03Z

I agree that it is a debatable point and Mike and I might be wrong here (or right, who knows). I will open up a separate issue because this is a whole different thought but I'd like feedback on the current issue of schema and types. Figuring out what is "useful" is a crazy problem and a more advanced issue than the current one. For now I'd like to get the fist step right where in for A,B,C - I build a time series for A B and C each and have appropriate schemas for those, If we can figure that part out (correctly) building a time series of all the combinations or a subset of the combinations is sort of an extension of building it for a single attribute.

nikhan · 2014-11-04T18:57:56Z

Back on track, I am not sure what option 1 gives you over option 2 (aside from types, which could be... useful in some capacity, i guess?) it sounds like it makes things nedlessly complicatd.

if types need to be stored, i wouldn't actually store the type as a typed variable, but make a new table that literally describes the type as a string.

durple · 2014-11-04T19:10:15Z

Ability to have the right types and have the right kind of queries, I suppose.

So if I were tracking an attribute that is always an int - say session length over time or scroll depth. I could run a range query on the keys.

If I convert everything to a string I might lose that ability.

durple · 2014-11-04T19:11:49Z

Or do we not care about having such abilities altogether? That is probably the question to be asking. Do we just care to pass it a key and a value and expect a time series for that value? That seems a bit restrictive but I could be wrong.

durple added the question label Nov 4, 2014

durple self-assigned this Nov 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tick Schema and Data Types #1

Tick Schema and Data Types #1

durple commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

durple commented Nov 4, 2014

Tick Schema and Data Types #1

Tick Schema and Data Types #1

Comments

durple commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

nikhan commented Nov 4, 2014

durple commented Nov 4, 2014

durple commented Nov 4, 2014