Skip to content

MTI Format Analysis

Jerome Athias edited this page Nov 14, 2015 · 25 revisions

Background

This page captures a trade-off analysis for the CTI TC's consensus MTI format which is intended to be adopted by the STIX SC, the CybOX SC, and potentially the TAXII SC. Per the mailing list discussion, the proposal is:

  • Each specification will be defined as a high-level data model
  • Additionally, one and only one binding specification to a serialization format will be identified and standardized as "mandatory to implement"
  • Zero or more alternative specifications could be identified and standardized as alternative bindings if there is sufficient interest

Evaluation Criteria

1. Language Support (MUST)

The MTI format MUST be supported across a wide variety of languages, minimally Python, Java, and C#. The more the better, though.

Discussion

Jordan: Must have languages also include C, C++, PHP, Go, Ruby, Perl. The nice to have list is a lot larger.

Davidson: I think we should consider dependencies. For instance, in most (all?) languages, XML introduces a dependency on an XML library. My preference (not requirement) is that the MTI choice does not introduce a dependency.

Kirillov: I don't think native language support is a must. Honestly, I think isn't a huge consideration as whatever we end up picking will more than likely be well supported across a wide variety of languages, whether natively or through additional libraries.

2. Schema Validation (MUST)

The MTI format MUST allow for instance documents/messages to be validated against a pre-defined schema that encodes rules for the format.

Discussion

Wunder: It's probably heretical to say this, but is it actually a requirement? I could imagine not having a standard validation capability and leaving it to the tools to make a best effort. That said, as a way to encode the format rules it would certainly be desirable, so given that IMO this item should be a SHOULD, not a MUST.

Barnum: This seems like a pretty important criteria if we are striving for verifiable interoperability.

Jordan: I am not sure this is really a requirement. And some serialization libraries handle extra or missing data better than others. XML and JSON both have schema validation, and I believe binary solutions do as well. So regardless, we should be covered here.

Davidson: I'm not sure this is a requirement either (I lean toward nice-to-have). I guess I need to be convinced that schema validation provides value in a production setting (vs just being a performance penalty).

Kirillov: For schema validation to really be useful, our data model has to be much more constrained. Given that almost all of STIX/CybOX is optional, and many of the fields are free-form strings, our current XML schema validation doesn't give us much beyond verifying general document structure. Accordingly, I'd say this should be a SHOULD and not a MUST.

Casanave: Over constrained "schema validation" can work against interoperability. For example, is a piece of data that is well formed against a schema I don't understand invalid? Does it require information I may not know or may not be able to share? Consider 3 levels: If you say it, this is how. You must say this. You can't say anything else.

3. Speed and Scalability (MUST)

The MTI format MUST be scalable to the level identified by the TC.

Discussion

Wunder: We need to define what level of scale we care about supporting in the MTI specification, but I haven't seen any numbers other than what TAXII put together.

Barnum: I would argue that this factor is completely use case dependent. Different use cases would have different requirements for speed and scalability. For choosing an MTI, I think the key question for this factor would be determining the threshold (from a speed and scalability perspective) for use cases that an MTI would support and which ones would need alternative forms to support.

Jordan: If we are successful, then we are going to have an enormous amount of STIX data flowing around. While some use cases may only deal with a small number of items a day, those are not the use-cases we need to be concerned with as it is easy for them to deal with bloat. I am thinking that people need to plan for the processing, serialization and transmission of 50,000 STIX objects a day at the low end, and 10,000,000 a day on the high end. This metric is based on a standard 20,000 FTE company where every device in the network can consume, produce, or enrich STIX data.

MacDonald: I know of an organization that has TBs of threat information they are able to share daily, but they have no efficient way of sharing that using STIX/TAXII. It just won't cope at present. We need to concentrate on minimizing the bandwidth usage across the network as much as possible, as businesses won't want to be spending money on their WAN links just to get threat information. They would much rather spend their money on things that make money.

Davidson: I'm not sure this has any bearing on the MTI selection. However, I think it does matter in terms of the design of the MTI selection. I'll posit that every option we've discussed so far (XML, JSON, CaptProto, etc) can meet or not meet (depending on design) any scalability requirements we might have. I'd vote to move this to design requirement.

Kirillov: I agree with John - speed and scalability are relative, so I think we need some harder metrics on this one.

4. Human-Readability (SHOULD)

The MTI format SHOULD be a human-readable text format. This allows for easy debugging and usage of the format outside of tools when necessary.

Discussion

Wunder: IMO this is a nice-to-have but not a requirement. Human readable formats are much easier to debug, but IMO usage of the binding outside of tools is not something we should aim for.

Barnum: Aligned to but not exactly the same as "human readability" is the potential advantage of a structured text-based format to support very simple transformations and filtering (e.g. markings-based tear lines) of content. Maybe this is a separate factor but seems heavily related here so I did not create a new factor.

Jordan: I agree with Wunder, this is a nice to have. At the end of the day, it will be hundreds of applications that consume and process the STIX data and make it pretty for the end user. Even in the debugging use case the documents can be so large that it is painful to even make sense of them. I think we would really need to create command line tools that make sense of the data for debugging.

MacDonald: I agree with Jordan and Wunder. If we have debugging and testing applications that make it easy to interact with the protocol, then there is no need to have the protocol itself human readable. We just need better tools. Who likes reading raw XML anyway?

Davidson: I put this as "very nice to have" but not at the "requirement" level. If I consider how most security teams operate, they have people who can hack around / script a little bit, but aren't full blown software engineers. As a result, I think we have to aim simple if we ever want a security person to "pop open the hood" and do anything useful with STIX/TAXII. I think we want your typical security team to be able to "pop open the hood", but that's just my opinion. This gets into a tangent, but I see a world where a security person can subscribe to a TAXII channel, print messages to stdout, and see something they can work with.

Kirillov: I was originally in the MUST camp for this one, but I think it should really be a SHOULD. The reality is that, given the decentralized (i.e., reference-heavy) structure of STIX and CybOX, it will be difficult for humans to consume no matter which MTI we choose.

Casanave: Agree with "nice to have"

5. Query Capability (SHOULD)

The MTI format SHOULD support some kind of standard document query capability (e.g. XPath).

Discussion

Wunder: I'm not sure whether this is actually a requirement or even a nice-to-have? Currently the only place it's used is in profiles and in data markings, two of the most complicated parts of STIX/CybOX. Maybe this is one of those things we should toss aside for simplicity's sake?

Barnum: I think this should stay here as a SHOULD. The whole point of the STIX language and any serialization format is to support the real world CTI use cases. The need to query STIX content is a key part of the real-world use cases not an artificial requirement added by the language itself. As such I don't think the factor can be discarded for simplicity's sake.

Jordan: I am okay with Query, as long as it is implementable. I would argue that what we have today is not, and we might as well not even have it. I am thinking that this might be a pending item for a 2.1 or 2.2 version of STIX.

MacDonald: Querying should be part of the embedded protocols that TAXII transports. TAXII and STIX should operate in a similar principle as TCP/IP. TAXII should only transport from one end to the other, and STIX should contain and understand the Threat Information and support query and response. Each embedded protocol should have the ability to query and respond if it makes sense for that protocol. Keeping this separation of purpose has multiple benefits:

  • TAXII is simpler to implement
  • TAXII doesn't need to be updated when a new embedded protocol is added or changed
  • Each embedded protocol can choose if it needs the ability to query and respond.
    • It makes TAXII loosely coupled, and highly cohesive - making it easier to create, maintain and enhance for the future

Davidson: I would personally vote to make this a non-requirement. A format-based query mechanism makes a huge assumption about the underlying software implementation and IMO locks implementations to a very specific technology stack (all drawbacks). I agree wholeheartedly that we need to be able to query the information contained in STIX; we just don't need to be able to query the wire format.

Kirillov: I'm in the camp that this shouldn't be a requirement. At best it's a nice-to-have; most query capabilities can be achieved with other methods, especially if the format is simple enough.

Casanave: I think this requirement is at the wrong level. The issue is not what you can do in one document - that is an internal processing issue. The issue is accessing large scale distributed data - you may not get that in a "message", you need to be able to query repositories at "web scale". Thus I would suggest distributed query a "must".

6. Extension Capability (SHOULD)

The MTI format SHOULD support replaceable types (i.e. extensions). For example, the test mechanism construct would need to support Snort, YARA, OpenIOC, etc.

Discussion

Wunder: I don't know whether any formats wouldn't support this, but it's worth explicitly listing it if we expect the high-level model to leverage this capability.

Barnum: I would assert that this should be a MUST rather than a SHOULD.

Jordan: I think everything should support this. But it is good to call it out.

MacDonald: This is a problem for STIX, not TAXII. As I've said before, there should be loose coupling between STIX and TAXII, in the same way that there is between TCP/IP. There is a reason that TCP/IP has worked for decades, and a reason that IP can be used with UDP, ICMP, IGMP, etc... We should reflect those same principles in the development of TAXII/STIX.

Davidson: I think all formats support this (nominally with a Content-Type=<MIME-Type>; Content=<...> structure) and therefore doesn't need to be considered as an MTI criteria.

Kirillov: This seems to be more of a data model issue rather than one that impacts the MTI directly. There are likely ways to implement extension points in any MTI, even if they're not explicitly defined.

Athias: I would assert that this should be a MUST rather than a SHOULD. NB: IODEF, CAPEC, CVE, CWE, CPE, etc. are currently XML based

7. Backwards-Compatibility (MAY)

The MTI format MAY be backwards-compatible with previous versions of STIX, CybOX, and TAXII. This would allow for easier upgrade paths.

Discussion

Wunder: Since the previous versions were XML, this is an argument for XML. Continuity is a good thing.

Jordan: I think STIX 2.0 is going to be a big enough change that people will need to refactor and rewrite code. So this is not really a requirement at this point. STIX 1.x was about getting our feet wet and figuring out what we did not know or what we did wrong.

** MacDonald**: STIX v2.0 should be a change. Large amounts of code will need change to support the new standard, so I'm not sure that backwards compatibility is key. We already need to run multiple python-stix libraries in order to generate multiple versions of STIX, so making the same requirement for STIX v2.0 doesn't concern me.

Davidson: I don't understand. How can a MAY be a requirement? I think we should just pick the thing that meets our needs the best and go with it.

Kirillov: Agree with Terry and Bret - this shouldn't be a requirement for STIX 2.0/CybOX 3.0. Going forward, we may want to think about it more, but it shouldn't have any impact on our decisions for the next major version.

8. No required dominate decomposition (MUST)

The MTI format MUST support a wide variety of use cases and viewpoints. This implies that the information MUST NOT require one specific hierarchy in a communication but will allow the consumer of the information to select and structure the information they need in the decomposition applicable to their application and stakeholders. Must such solutions are "graph" structures but other approaches can be considered.

Discussion

Casanave: Dominate decompositions work very well in coupled systems for one particular purpose under single management. CTI is none of that. The transition to thinking about what best supports a community is different than building a system. Instead of saying one viewpoint is "wrong" or "less important", lets not encode the viewpoint into the data structure - but allow them when it makes sense. In essence, this means moving from tabular or hierarchical data to graph data. Current STIX tries to do this in XML, it can be done but is cumbersome. There is more than one way to skin this cat, I have suggested RDF, but the fundamental requirement is about flexibility and broad applicability. The consideration is that more general data will be harder to program for, sorry.

Jordan: This is not a MUST and if I understand what is being stated, then I disagree with it. We need one way of doing things for the vast majority, 90+% of the use cases. Products and solutions need to be able to just work. We need DLNA for CTI, this means a single format that everyone uses. And for those people that needs something different and are only going to communicate within their niche eco-system, then they can do what ever they want. But for wide-spread communication of CTI, we need one way and it should be native JSON, something everyone already fully understands and can use.

9. Well formed and typed internal end external (web) references (MUST)

No one information "package" is complete, additional detail, supporting information or other sources need to be accessible and linkable (e.g. linked data). Within a document links are required between entities else everything must be in a very strict value-only hierarchy. The MTI MUST support internal and external references to entities in a standards based format. Such references should correspond to the types defined in the data model.

Discussion

Casanave: It should be noted that, based on my understanding, current STIX does not use "REF" in a XML valid way. REF is only defined within a document but it is being used across documents. The XML standard for external references is "XPOINTER'. The URI is the standard way to reference things and should be used.

Jordan: I am not sure this is a MUST, more like a nice to have and I think we can do this with simple string values.

10. Support of digital signatures or/and encryption (SHOULD)

Integrity and Confidentiality + Traceability and Non-Repudiation are important security principles. If identified as a requirement, while there are known mechanisms for XML (for objects and content, without speaking about ease of implementation), are there good ones for JSON?

Discussion

Athias: MUST be evaluated as a requirement

Format Analysis

XML

PRO

  • Strong cross-language support
  • Strong tooling support (e.g. JAXB)
  • Strong ecosystem of supporting standards/capabilities (e.g. XSLT, XPATH, XPOINTER, XQUERY)
  • Strong/powerful schema validation language(s)
  • Human-readable
  • More natural to work with for many developers (enterprise, government) Jordan: I think this is wrong.

CON

  • Large size (unless you use EXI, which is not human-readable and has much worse tool/language support)
  • Powerful schema allows for complicated schema (elements vs. attributes, namespaces, xsi:type, etc.)
  • Somewhat difficult to work with in its raw form
  • Not very friendly for opensource, web application, and APP developers.
  • XSI types and optionality
  • XML types do not map well to struct and types in programming languages
  • XML serialization can be very slow relative to other options
  • Can be overly complex and hard to use and understand
  • Most API in use today, do not use XML
  • Namespaces

Discussion

Wunder: One option we should consider is greatly simplifying the set of XML Schema constructs we actually use. For example, we could consider developing a binding to XML that only uses elements (no attributes), doesn't use namespaces, doesn't use xsi:type, etc. This would greatly simplify the XML that we use while maintaining the ability to use XML's great tool support and providing (some) backwards compatibility (though most everything would need to be rewritten, at least it's the same base technology).

Jordan: Everything is going to need to be rewritten anyway. But doing as Wunder as called out, would make it easier to use for sure. If XML is chosen, then I would argue we MUST do the things Wunder calls out.

MacDonald: I believe that the two most important things for TAXII are:

  • smallest size possible
  • ability to carry any protocol within it to maintain upper level protocol flexibility I am not a fan of XML as it is too verbose.

Davidson: Ok, a few opinions here

  • If we go XML, we need to simplify the schemas we have
  • I disagree with XML being more natural to work with many developers. While there are plenty of people who write code that works with XML, I can't say I've ever heard a preference for working with XML (this is anecdotal, but it's my experience).
  • As a con, I would add that XML generally introduces a dependency on a 3rd party library.
  • Personally, I just haven't seen a strong argument for XML.

Kirillov: IMO, we don't really take advantage of the things that XML does well, such as schema validation. I don't see a strong argument to keeping it as the MTI.

JSON

PRO

  • Strong cross-language support (equal to XML)
  • Medium-strong tooling support (somewhat worse than XML, but by no means bad)
  • Schema validation language. It has some overlap with XML schema but supports less in some ways and more in other ways.
  • Human-readable
  • Simpler to work with than XML or many other formats (maps easier to many programming language constructs)
  • More natural to work with for many developers (web developers, startups, etc)
  • Less verbose than XML
  • Easy to understand
  • Requires some level of simplification, similar to what Wunder called out for what would be good for XML.
  • Opensource, web application, and APP developers prefer this method. This will eliminate an entry barrier we have today.
  • Groups like Soltra and others will no longer need to transcode STIX as soon as they get it to JSON, as it will be just JSON. Soltra and others store all of their STIX data today in JSON on the backend.

CON

  • JSON Schema is not standard and therefore tool support may be inconsistent Jordan: It is currently in Draft 4 at IETF
  • Would require tech stack refresh from legacy STIX/CybOX/TAXII XML Jordan: This will need to be done anyway as the STIX 2.0 and TAXII 2.0 will be so different.
  • Not as compressed as binary format

Discussion

Wunder: IMO it would be nice to see some examples of equivalent constructs in XML and JSON (and XML Schema and JSON schema). I think Intelworks or someone else had converted some of STIX 1.2 over, can we link to that as an example? Did it have JSON schema?

Jordan: IMO, JSON is the best option between XML and Binary. It is also a solution that will give us the ability to recruit lots of developers to write tools. Further, everything I hear, JSON is a lot easier to work with, especially in web applications.

MacDonald: I am not convinced that JSON is the answer, as binary protocols will have smaller bandwidth requirements, but I am also aware of the difficulty in developing using binary protocols and the lack of libraries on all end devices. So JSON may be a good 'halfway house' on the road to binary protocols. A valid scenario may be the use of JSON in most environments(e.g. client to endpoint), but the use of binary in high bandwidth situations (e.g. sharing between Google, Facebook and Microsoft)

Davidson: I am generally for JSON. I recognize that other formats may be more performant, but I think JSON can be performant enough.

Kirillov: JSON seems to be on the path of eclipsing XML as the defacto serialization for web applications and message transport, has less overhead, and has wide language support. I feel that its downsides of less powerful schema support and potential lack of support in certain domains (enterprise/government products) are outweighed by these benefits.

Binary Format

Wunder: Please break apart into thrift, protobuf, etc. if you have the expertise. I don't

Options:

  • cap'n proto
  • thrift
  • avro
  • protobuf

PRO

  • Much smaller and faster than XML or JSON (citation needed)
  • A better solution for large data sets.
  • Varying cross-language support
  • Varying tool support
  • Most have strong validation of message types

CON

  • Not as natural for developers to work with
  • Somewhat of a holy war between different formats
  • Not human readable
  • Support for various programming languages depends on the binary format chosen.
  • Can be complex to understand and use in code.
  • Binary protocols performance is dependent on data ingested. Need to test to find the best solution for the type of data we need.

Discussion

Wunder: I find binary protocols more compelling as alternative bindings than as an MTI binding. I think using a binary protocol is unnecessary for a lot of CTI traffic and would be a barrier to adoption because developers aren't as familiar with it.

Jordan: I agree with Wunder here.

MacDonald: I would agree that at present using binary as an alternative MTI makes sense. Initially I see this as only being used where there is a huge volume of threat information being shared, but slowly over time as threat sharing becomes more popular and the tools get more refined I see this becoming the main MTI.

Davidson: +1 to Wunder

RDF (JSON Serialization)

PRO

  • Strong cross-language support (equal to XML)
  • Medium-strong tooling support (somewhat worse than XML, but by no means bad)
  • Multiple open source implementations in multiple languages, including repositories
  • Can Map to internal SQL data
  • Strong schema and semantic validation
  • Human-readable
  • Based on a simple data model
  • Less verbose than unconstrained XML
  • Inherently distributes based on use of URI
  • Standardized by W3C: JSON-LD
  • Has full query language (SPARQL)
  • Easy to generate from logical models such as in UML
  • Provides for a very flexible distributed data graph (linked data)
  • Serialization can optionally be hierarchical.
  • Natural progression to use of deeper semantics within application
  • Can be used in REST, query, request/reply or pub/sub interactions

CON

  • Less well known
  • Graph structure a leap for some programmers
  • Individual "triples" (the fundamental element) are not identified and so can't have metadata (e.g. source), this can be overcome but requires a specific style of use
  • Can be more verbose than some other formats

Discussion

Casanave: If there is "buy in" to the no dominate decomposition and distribution requirements, this is about the only game in town. If there is not buy-in to those requirements the alternative is a LOT of smaller and more purpose-specific "exchange formats" for specific purposes mapped to a broader reference model - so no one MTI. A single encapsulated single decomposition will NEVER meet the CTI community needs.

Jordan: Not sure how this differs from native JSON other than adding complexity and being something that the majority of developers will not understand. We need an MTI that is super easy to implement and that the average opensource, web app, and APP developer can use today. Things that scare off developers, like XML, are not good options.

Conclusions

Wunder: If you were to ask me the format that I prefer to use personally knowing what I know now it would be JSON: it's MUCH easier to work with in the languages I use and because it's so much less powerful in XML it's easier to stay simple and focus on what you want to say rather than encoding things perfectly to support validation. The two big downsides of JSON are that JSON schema is less reliable than XML schema and that tooling support in enterprise/government products are much better in XML. So I would not be upset if XML were chosen.

Jordan: From what I have heard from the developers and I have talked to, there is nearly unanimous desire for JSON. I also believe JSON is easier to work with and it would be my choice. If we were to chose XML, we would need to drastically clean it up and make it more JSON like. Further, most groups use JSON on the backend to store their STIX data, so we should just give it to them in the format they are using on the backend.

MacDonald: I would prefer binary from a purists perspective, due to it being the smallest bandwidth consumption. But I also recognize there is a chance of alienating current users by such a large jump. I would be fine with JSON as the primary MTI, with a binary protocol selected as an alternative (after a bakeoff involving real data).

Davidson: I generally prefer JSON, and I'd like to move forward with JSON as a strawman MTI.

Kirillov: +1 for JSON as the MTI. I feel that a binary format may be suitable as a secondary implementation. However, I do worry about the potential for fragmentation by allowing for any number of serializations. I think we should seriously considering limiting serializations to one MTI and one alternative format.

Anderson: JSON is very popular and arguably more approachable than XML for the average developer. Moving to JSON may improve adoption of STIX. However, I'm not convinced that this standard needs an MTI format at all. If this standard can clearly define unambiguous models at a high level, then implementing various formats will be (nearly?) trivial and can be accomplished by others. IMHO, our time may be best spent removing ambiguity from the STIX models.

Casanave: First, is one MTI a good idea? Will it stand the test of time and very different needs? Would MLD (Mandatory logical data) work better? If developers code to the logical data they will be happier over time and support more data formats and use cases as they come. If one MTI is required it MUST be a distributed graph model, this is the only thing that has a chance of wide scale success, beyond sending about an arbitrarily list of indicators. The XML format CAN be made general to be a graph and distributed, and the current format starts in that direction. But XML schema is not great for this and it becomes complex - RDF is built for it and is standard. The JSON representation reads well and is not hard to utilize. Note: I have no vested interest in RDF, I use it when it is the right choice.

Jordan: We need a MTI and it should be JSON. Is RDF or what ever better, maybe, I do not know enough about it. But what I do know is that JSON is highly favored by the development community and product managers. Think of it this way.... Token Ring, FDDI, and ATM are much better solutions, however, Ethernet won. JSON is the de facto solution for sharing data between applications and products today and looks to be the long-term solution. Lets not fight against the grain.

Casanave: Response to Jordan... If what is needed is a full representation of CTI for all use cases it will not be simple, small or something a developer can use quickly - regardless of the syntax. If what is wanted is small, targeted exchange schema for very specific purposes, that is another thing altogether. For example, an exchange of suspicious IP addresses from a specific entity. Sure, do a JSON schema for that and then map it to full CTI as a reference model. This would be more like very fine-grain "profiles" than a universal MTI. There would then need to be a bunch of these targeted exchange schema. To answer your specific question, the significant elements of what JASON-LD adds is specific reference to (RDF) schema to define elements, ability to do linked web references and query.

Burger: I cannot resist pointing out both the wheel has been invented and no one likes what they got. Do we want a single specification with multiple transport encodings, ranging from almost human readable to highly efficient, bit-wise binary? Look at ASN.1. Yuck, but it's out there and almost works. (I think it sucks, but I point it out that unicorns sometimes do exist, and often you don't want one.)

Clone this wiki locally