Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Case 6: discussion #7

Open
msdemlei opened this issue Jan 12, 2015 · 8 comments
Open

Use Case 6: discussion #7

msdemlei opened this issue Jan 12, 2015 · 8 comments
Assignees

Comments

@msdemlei
Copy link

Unless I misunderstand this use case, it proposes to allow embedding some sort
of executable code into the data format.

If that is true, I believe it the use case should be dropped, for at least the
following reasons:

(1) Security concerns: Even if you "sandbox" whatever code is executing
(which, of course, makes it more likely that the format's execution
facilities in the end will be too slow or restricted to be generally
useful), it's still going to be hard to control what apparently
innocuous files actually do (see Adobe's pain with Javascript in PDF).

(2) Ease of implementation: If we allow something like this, all
conforming implementations will have to include an interpreter for
whatever code this turns out to be. This will typically be a major
effort (or at least dependency) that's going to hurt adoption (not to
mention security concerns again). On the other hand, I've always wanted
to write a FORTH machine...

(3) Complexity considerations: As file formats are always at the "edges"
of computer systems, it's great if they are "verifiable" in some
sense (e.g., checking validity with a context free grammar). This
feature is deep, deep within the land of Turing complete languages with
all the related problems ("will this image halt?"). That's a fairly
fundamental flaw for something that sounds like a fairly exotic
application that would probably better be solved by local convention (a
pipeline manual might state: "look for the chunk labeled
'foo-execute', check for foo's signature via the foo-signature chunk,
and then just do it").

@embray
Copy link

embray commented Jan 12, 2015

I agree--if nothing else this use case needs clarification and/or narrowing. For example, there is a narrow sense in which this might be useful. For example for WCS or possibly other data reduction uses it would be possible to embed simple instructions for sequences of transformations and arithmetic functions to perform on some data in the file. But as currently written this use case read to me like stored procedures, as in a database, and that I think we want to avoid.

@brianthomas
Copy link
Member

No, the intention here was not to embed any executable or compiled code; its basically what Erik wrote above. The idea was that some restricted set of notation/instructions would be adopted in the standard so that some parts of the data could be algorithmically described (and generated). Libraries, regardless of actual implementation language would have to support parsing, and executing, the instruction set.

@brianthomas brianthomas self-assigned this Jan 12, 2015
@embray
Copy link

embray commented Jan 12, 2015

This one does need to be handled with care though. If we allow mathematical transformations on image data, why not allow, say, virtual tables created from joins of other tables, or other such database-like operations? I don't think we should have such a requirement, but why do we privilege one type of embedded data transformation over another? I'm not sure how to write this use case in such a way that addresses that slippery slope.

@msdemlei
Copy link
Author

Hi,

On Mon, Jan 12, 2015 at 02:20:47PM -0800, Erik Bray wrote:

This one does need to be handled with care though. If we allow
mathematical transformations on image data, why not allow, say,
virtual tables created from joins of other tables, or other such
database-like operations? I don't think we should have such a

I'd maintain it's a question of the type of machine required to
execute the embedded specifications, and I'd say we should be "below
Turing" in some sense.

Now, for use cases like the specification of generalised transforms
evidently common mathematical expressions would be required, and
such expressions are, in themselves, probably not computable by
pushdown automata -- but I'd not be worried about these, accepting
"normal math" as elementary operations doesn't look dangerous to me.

Loops and function definitions are an entirely different beast. The
difference essentially is that it's easy to reason about what the
expressions do, whereas with loops and recursion it's at least hard
and in general impossible. Conditionals are a bit in between, but
something like SQL's CASE should be useful for many important
expressions (splines, say) while probably not poisoning the language
with computability problems.

The bottom line is that if we come up with a spec on this, we should
find some computability experts and ask them for their opinions...

Cheers,

     Markus

@nxg
Copy link

nxg commented Jan 13, 2015

This approach, as Brian glosses it, is similar to the approach used by the AST library. That library provides general WCS support not by listing a number of algorithms and parameters, in the style of FITS-WCS, but by implementing a collection of general transformations which can be composed to provide complicated transformations on the data (and several of which are precomposed to provide the standard WCS mappings). That manifestly works in that case, and it's easy to see how it might work for a more general case of data transformations.

The transformations are specified within NDF files in (if I recall correctly; it's been a while) a not terribly readable form. One could imagine a little language which articulated them in a more naturally editable form.

@brianthomas
Copy link
Member

I've tried to re-write this use-case a little based on the discussion here. I've dropped the idea of generation of theory datasets from the use case and focused it more on tablular and image transformation/value generations. Please take a look and feedback. I expect we'll still need to iterate.

@mdboom
Copy link

mdboom commented Jan 13, 2015

In my view, use case #6 still reads more like a feature in search of a use case than a use case. It would be helpful to understand the reasons why such a feature would be important, and why it must be part of a storage file format.

I understand why descriptions of coordinate transformations is essential: it allows for mapping between logical and physical coordinates without the problems that come with resampling the data. It could be done with a fixed lookup table (and HST had a history of that in some cases), but being able to tweak knobs of the transformation has proven very useful.

I'm not as sold on the reasons why algorithmically-generated data must be specified in the file format, rather than as an adjunct tool or extension for that purpose. Particularly given that the file format will support the storage of structured metadata, one could store a procedure in the file that could be understood by some domain-specific tool in the future. I don't think the file format should require anything like this, as it adds significantly to the implementation burden and has the potential to create many more security holes where otherwise there would be few.

@brianthomas
Copy link
Member

There appears to be some confusion here still. Another attempt to explain this is that allowing simple mathematical formulae to describe the data is a good thing, if only from the standpoint of compression of large datasets. It also promotes long-term understanding of the data since you can see succinctly what the underlying formula (and perhaps scientific principle, as applicable) are behind that portion of the dataset.

I'm not as sold on the reasons why algorithmically-generated data must be specified in the file format, rather than as an adjunct tool or extension for that purpose.

I'd be all for more complex generation of data sitting in an (optional, outside of the spec) plugin which has compiled code. Where the line is between "simple mathematical formulae" and "complex generation" is another matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants