Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tick Schema and Data Types #1

Open
durple opened this issue Nov 4, 2014 · 15 comments
Open

Tick Schema and Data Types #1

durple opened this issue Nov 4, 2014 · 15 comments
Assignees
Labels

Comments

@durple
Copy link
Contributor

durple commented Nov 4, 2014

Cassandra requires us to sepcify a SQL-like schema. Currently I have two thoughts around this:

Here is what I am currently thinking in terms of the schema

if the inbound JSON is
{"a": 1, "b": x,"c": true}
It would be represented by the following schema:

Table a:
value int primary key, ts timestamp, count counter

And the data would look like

(value of a) -> (ts, count), (ts, count) (ts, count)

Table b:

value string primary key, ts timestamp, count counter

Data:

(value of b) -> (ts, count), (ts, count) (ts, count)

Table c:

value boolean primary key, ts timestamp, count counter

Data

(value of c) -> (ts, count), (ts, count) (ts, count)

The key in the column is the actual value of the JSON element

I can listen to a stream for a specified number of messages and determine the type for each of the inbound attributes.

Drawbacks to this method are:

  • I have to listen to a stream for a while so time series might take a while to show up.
  • Attributes that do not confirm to the data type will be rejected by Cassandra
  • It makes the querying for data from an API harder because now I need to know what type to convert to before I query Cassandra
  • I have to have a some way of determining the data type and it may not always work

The other idea would be to just convert everything to a string representation and I am not too sure as to how I feel about that either. It will ensure that:

  • I store everything
  • the queries become easy.

I just don't like the idea of manipulating the data types as I store the data.

@durple durple added the question label Nov 4, 2014
@durple durple self-assigned this Nov 4, 2014
@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

In the second case,

The schema just happens to look like

value string primary key, ts timestamp, count counter

For all keys.

I could also avoid creating a separate table for each key but I like the separation because that allows me to index more efficiently

@nikhan
Copy link

nikhan commented Nov 4, 2014

How do I query across tables? How do I get all keys for stream X?

@nikhan
Copy link

nikhan commented Nov 4, 2014

I have to listen to a stream for a while so time series might take a while to show up.

what is a while? why?

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

How do I query across tables? How do I get all keys for stream X?

Every stream will have it's own keyspace (database in traditional terms). It is just a matter of using the right database and doing a show me all the tables.

You never query across tables. The idea is for each key or combinations thereof to have their own tables. e.g.

{k1:v1, k2:v2}

would ideally have the following tables

k1 -> containing time series for all distinct values of k1
k2 -> containing time series for all distinct values of k1 
k1_k2 -> containing time series for all distinct values of k1, k2

@nikhan
Copy link

nikhan commented Nov 4, 2014

I don't understand k1_k2:
This is a time series for values in k1 and k2?

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

I have to listen to a stream for a while so time series might take a while to show up.

what is a while? why?

For now:
A predetermined number of messages or a predetermined interval of time (may be configured as a flag when you start listening to a stream). The reason for this would be to determine the type of each attribute and create appropriate tables in Cassandra.

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

This is a time series for values in k1 and k2?

Yes, eg: if a stream had user and location. I would have a time series of

  • users over time
  • location over time
  • and user, location over time allowing me to query by both the attributes - users in a given location

@nikhan
Copy link

nikhan commented Nov 4, 2014

do i need to know order or anything when querying compound keys? are all variations available?
stream has A, B, C
i want A & C and A & B

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

do i need to know order or anything when querying compound keys? are all variations available?
stream as A, B, C
i want A & C and A & C

Did you mean A&B and A&C?

If so, yes. There is a caveat though and it is one of the experiments being undertaken with Tick. The thought is for Tick to be "smart" in some way to tell you the keys that make sense in a combination or even by themselves. The specifics are yet to be determined though and it is one of the more advanced thoughts that we've discussed. For now, assume that it will create a table for

A
B
C
AB
AC
BC
ABC

The order is not important (it should not be).

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

The thought though is to know if A&C make sense prior to building the time series for it. The current thought is to measure the cardinality of an attribute or a combination of attributes and compare it with the cardinality of the stream itself over a specified interval of time. It the cardinality is approximately same, it is probably not a useful time series to build. eg. tweet_id over time may not be a useful metric. but a retweeted_tweet_id is probably a more useful metric. Tick should be smart about that.

@nikhan
Copy link

nikhan commented Nov 4, 2014

I'd strongly debate what "useful" means, as my questions would likely be along the lines of:
What does A&C look like? What does A&B look like? Not having a time series because the cardinality is supposedly "similar" doesn't really tell me anything ... other than that Tick thinks I don't need this time series.

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

I agree that it is a debatable point and Mike and I might be wrong here (or right, who knows). I will open up a separate issue because this is a whole different thought but I'd like feedback on the current issue of schema and types. Figuring out what is "useful" is a crazy problem and a more advanced issue than the current one. For now I'd like to get the fist step right where in for A,B,C - I build a time series for A B and C each and have appropriate schemas for those, If we can figure that part out (correctly) building a time series of all the combinations or a subset of the combinations is sort of an extension of building it for a single attribute.

@nikhan
Copy link

nikhan commented Nov 4, 2014

Back on track, I am not sure what option 1 gives you over option 2 (aside from types, which could be... useful in some capacity, i guess?) it sounds like it makes things nedlessly complicatd.

if types need to be stored, i wouldn't actually store the type as a typed variable, but make a new table that literally describes the type as a string.

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

Ability to have the right types and have the right kind of queries, I suppose.

So if I were tracking an attribute that is always an int - say session length over time or scroll depth. I could run a range query on the keys.

If I convert everything to a string I might lose that ability.

@durple
Copy link
Contributor Author

durple commented Nov 4, 2014

Or do we not care about having such abilities altogether? That is probably the question to be asking. Do we just care to pass it a key and a value and expect a time series for that value? That seems a bit restrictive but I could be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants