-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tick Schema and Data Types #1
Comments
In the second case, The schema just happens to look like
For all keys. I could also avoid creating a separate table for each key but I like the separation because that allows me to index more efficiently |
How do I query across tables? How do I get all keys for stream X? |
what is a while? why? |
Every stream will have it's own keyspace (database in traditional terms). It is just a matter of using the right database and doing a show me all the tables. You never query across tables. The idea is for each key or combinations thereof to have their own tables. e.g. {k1:v1, k2:v2} would ideally have the following tables
|
I don't understand k1_k2: |
For now: |
Yes, eg: if a stream had user and location. I would have a time series of
|
do i need to know order or anything when querying compound keys? are all variations available? |
Did you mean A&B and A&C? If so, yes. There is a caveat though and it is one of the experiments being undertaken with Tick. The thought is for Tick to be "smart" in some way to tell you the keys that make sense in a combination or even by themselves. The specifics are yet to be determined though and it is one of the more advanced thoughts that we've discussed. For now, assume that it will create a table for A The order is not important (it should not be). |
The thought though is to know if A&C make sense prior to building the time series for it. The current thought is to measure the cardinality of an attribute or a combination of attributes and compare it with the cardinality of the stream itself over a specified interval of time. It the cardinality is approximately same, it is probably not a useful time series to build. eg. tweet_id over time may not be a useful metric. but a retweeted_tweet_id is probably a more useful metric. Tick should be smart about that. |
I'd strongly debate what "useful" means, as my questions would likely be along the lines of: |
I agree that it is a debatable point and Mike and I might be wrong here (or right, who knows). I will open up a separate issue because this is a whole different thought but I'd like feedback on the current issue of schema and types. Figuring out what is "useful" is a crazy problem and a more advanced issue than the current one. For now I'd like to get the fist step right where in for A,B,C - I build a time series for A B and C each and have appropriate schemas for those, If we can figure that part out (correctly) building a time series of all the combinations or a subset of the combinations is sort of an extension of building it for a single attribute. |
Back on track, I am not sure what option 1 gives you over option 2 (aside from types, which could be... useful in some capacity, i guess?) it sounds like it makes things nedlessly complicatd. if types need to be stored, i wouldn't actually store the type as a typed variable, but make a new table that literally describes the type as a string. |
Ability to have the right types and have the right kind of queries, I suppose. So if I were tracking an attribute that is always an int - say session length over time or scroll depth. I could run a range query on the keys. If I convert everything to a string I might lose that ability. |
Or do we not care about having such abilities altogether? That is probably the question to be asking. Do we just care to pass it a key and a value and expect a time series for that value? That seems a bit restrictive but I could be wrong. |
Cassandra requires us to sepcify a SQL-like schema. Currently I have two thoughts around this:
Here is what I am currently thinking in terms of the schema
if the inbound JSON is
{"a": 1, "b": x,"c": true}
It would be represented by the following schema:
And the data would look like
Table b:
Data:
Table c:
Data
The key in the column is the actual value of the JSON element
I can listen to a stream for a specified number of messages and determine the type for each of the inbound attributes.
Drawbacks to this method are:
The other idea would be to just convert everything to a string representation and I am not too sure as to how I feel about that either. It will ensure that:
I just don't like the idea of manipulating the data types as I store the data.
The text was updated successfully, but these errors were encountered: