Add datasource value to records. #28

apotheon · 2017-04-09T15:48:55Z

@lbmn proposed new metadata:

The proposal involved a works_databases_references key to provide references to sources of information about works included in the list, as exemplified in a web paste for a YAML work submission @lbmn provided. This kind of thing could prove useful in the transition to a proper works database, with some automated population of new works entries and perhaps automated alerts of changes to projects behind existing entries based on those data sources.

At this time, datasources seems like a better metadata key name at this time, though the specific format of the values in the datasources array must still be considered. Please share any ideas and suggestions in comments here, or in discussion in the #copyfree and/or ##copyfree channels on freenode.

The text was updated successfully, but these errors were encountered:

lbmn · 2017-04-14T00:36:57Z

(0) Brainstorming The Reference Format Name

The temporary column name of works_databases_references in the web paste was in likeness to the existing CI conventions. CI is using the word "works" to refer to the various things in this data set (programs / scripting libraries / plug-ins / fonts, videos, books, etc). Spelling out the word "references" and use of underscores was mimicking the license_reference field. Together the name (although a bit verbose) is descriptive of the concept I am introducing.

Using datasources is ok, but it would be better if we were to think this through and come up with a unique name / initialism to define this concept, initiating a reusable and potentially standard way to reference downloadable works across various works databases. Since it would reference data sources that also contain non-copyfree works, we shouldn't use "copyfree" in the name.

I think the terms "content" and "package" combine to a good general term for what we are talking about. It is more descriptive than the term "work", which can be mistaken for other definitions of this word. Also a work can be unfinished, unpublished, not information-based, etc.

The term "content package" differs from using the word "package" alone, indicating that we can be talking about things other than software: ebooks, structured data packages, Web-site snapshots (ex. ZIM), videos, etc. It differs from using the word "content" alone to indicate something that isn't a free-flowing scrap of content (like this post) but an organized versioned aggregate of all the pieces needed for some end (like a source repository complete with issues discussions).

And what we are defining here is a way to link to / reference to content package metadata with other database / authority / index. We are connecting different ecosystems: FreeBSD packages link to other FreeBSD packages (ex. as dependencies), and NPM packages link to other NPM packages, but we want to link to both. I think the term interlink is fitting.

And so, until I get better ideas, my suggested name for this standard is: Content Package Interlink Format, or C.P.I.F.! 😃

It's also a play on "copy if" - you may or may not want to copy this content package depending on the meta-data you in CPIF links.

But I hope someone suggests something better, so the name is of course subject to change.

(1) Brainstorming The Reference Format Structure

The CPIF link format has to be multi-part, with the first part identifying the package database, and at least one more to uniquely identify the specific package.

Since BSD ports datasources (likely our most significant data source for software) and Gentoo Portage use a slash-delimited path to identify records, I also used a slash following the prefix. Other content package databases may use a deeper-layer hierarchy. This also maps easily URLs and the Unix filesystem, including some new ideas for the latter (ex mv /usr/ports /cpif/freebsd).

It is an open question if maybe CPIF should organize the data-sources by category (ex. /cpif/video/youtube, /cpif/software/cabal, etc). I currently think this is a bad idea, because some databases could fall into multiple categories, and it would be best to deal with that further down the path (ex. /cpif/facebook/video/$ID, /cpif/facebook/photo/$ID). Also, some data sources defy easy categorization.

(2) Database Identifiers

I think my web paste example covers most foreseeable scenarios. (Note that it contained an error: I forgot to edit out ".se" from the pkgsrc prefix when pasting.) In light of the above brainstorming, it should now read:

h2o:
  uri:
  - https://h2o.examp1e.net
  tags:
  - server
  - software
  - web
  license:
  - MIT/X11 License
  license_reference:
  - https://h2o.examp1e.net/faq.html
  cpif:
  - github/h2o/h2o
  - freebsd/www/h2o
  - pkgsrc/www/h2o
  - opensuse/h2o
  - homebrew/h2o

It is an open question about whether we should use domain names for the projects (ex. brew.sh) rather than a simplified ID string (ex. homebrew). I think that the latter is the way to go. This way we can maintain consistency even if domain names change (ex. a gTLD to Namecoin exodus). Also sometimes there are multiple sites for a package database: some more formal for the project (ex. freebsd.org, pkgsrc.org) while other third party sites contain the actual metadata (ex. freshports.org, pkgsrc.SE).

(To be continued...)

apotheon · 2017-04-14T03:21:46Z

Format Name

I'm not a big fan of the term "content" for this purpose. You say this to
justify it:

The term "content package" differs from using the word "package" alone,
indicating that we can be talking about things other than software

This usage seems to imply the common use of the term "content" on the internet,
which actually implies it's not software. Broadening it enough to
incorporate software, though, turns the word back into its generalized default
meaning: stuff inside something else. That is so broad as to be meaningless,
and coupled with "package" it becomes even less meaningful, because a "content
package" then just becomes a "package containing contents". Duh, of course --
that's what packages do (contain contents). As such, I think "content package"
is a largely pointless term that does not actually describe what we mean in any
useful fashion. It is, in fact, likely misleading. I would be more inclined
to use "package" by itself than "content package", which does not have to
refer specifically to software. In fact, many package management systems
(originally designed for software) deliver non-software packages as well as
software packages. Consider the existence of documentation packages in, e.g.,
Debian and FreeBSD package archives.

I find your objection to the word "work" unconvincing. The term has some
relevant meaning in law, as well as ample precedent for exactly the sort of
meaning we want. It is aptly descriptive of what the copyfree works database
would address: copyrightable works (and, thus, copyfree-able works).

Of course, I also think that the "interlink format" you propose is naturally
more general than just some medium or protocol for sharing data about content,
software, works, or whatever else you might describe in terms no less specific.
I don't really have any objections to the terms "interlink" and "format", but
think it needs a different name. Perhaps "metadata interlink format" works
better, and lends itself easily to a typical filname extension profile (.mif)
if such is needed.

I think all I really like about the name as a whole you that you invented is
the implicit reference to the standard Unix file copy utility, cp. All this
is roughly irrelevant to the matter of figuring out how to actually format the
metadata in YAML, though.

Format Structure / Database Identifiers

I think I'm happier (but not fully happy -- more on that in a moment) with something like:

freebsd/ports/h2o
github/repository/h2o
opensuse/yast/h2o

. . . or something like that. On the other hand, URIs might be appropriate
instead, because both my above alteration of your suggestion and your
suggestion itself run afoul of the problem of needing to maintain some kind of
separate concordance metadata to help resolve that information to a machine
readable set of directions to the original. Maintaining consistency in the
first tier metadata despite changes in the location of the dataset itself is
like putting lipstick on a pig, because one still has to maintain the
concordance as a second tier of metadata -- that is, one must still have the
pig behind the lipstick -- resulting in extra data having to be maintained for
the sole purpose of trying to make things look pretty, trading away performance
and (relatively) easy reliability of data maintenance to get it.

Those are my sixteen cents. Two cents ain't what they used to be.

lbmn · 2017-04-15T20:10:45Z

I'm fine with mif.

I'll just make a few nitpicks, but leave the final decision to CI.

It would be interesting to have a lengthy debate on codifying a reexamined computing terminology — like how the word "content" means "digital stuff you can download and consume" (now including apps) rather than package "contents", etc, etc, etc — but this isn't the place.

I haven't thought of this being a new file format (*.mif), but a string syntax format to be used in other file formats - like how URI / Href syntax is used in HTML / etc. (Major differences from Href would include: *1* It can be a list of reference strings instead of just one. *2* Having a centrally-defined prefix lookup table instead of arbitrary server. *3* No protocol, port, etc; only path.)

Some of your mif reference path examples seem unnecessarily verbose. What could possibly go under /mif/github/ if not repositories? With FreeBSD there are indeed things outside of ports (base source tree, docs / handbook, Web-site source tree, mailing list, etc), but those would be very rarely used. Maybe it would be better to reference them as /mif/freebsd-base/blah instead?

But, again, I leave the details up to CI. "Don't let perfect be the enemy of good."

I just wanted to emphasize the importance of coming up with a reusable standard for referencing works metadata sources, ideally with a memorable name. This syntax can then be used for a number of my future projects, like a package manager for installing copyfree software, fonts, Nim libraries, ebooks, offline website snapshots, etc.

apotheon · 2017-04-22T16:37:56Z

What could possibly go under /mif/github/ if not repositories?

gists, GitHub Pages, and wikis

With FreeBSD there are indeed things outside of ports (base source tree, docs / handbook, Web-site source tree, mailing list, etc), but those would be very rarely used. Maybe it would be better to reference them as /mif/freebsd-base/blah instead?

I'm not entirely sure what you're suggesting.

I just wanted to emphasize the importance of coming up with a reusable standard for referencing works metadata sources, ideally with a memorable name.

That's a good idea, of course, and I don't object in principle. Getting into the practical details, though, I still think that just providing literal directions to the source information is probably more useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add datasource value to records. #28

Add datasource value to records. #28

apotheon commented Apr 9, 2017

lbmn commented Apr 14, 2017

apotheon commented Apr 14, 2017 •

edited

Loading

lbmn commented Apr 15, 2017 •

edited

Loading

apotheon commented Apr 22, 2017

Add datasource value to records. #28

Add datasource value to records. #28

Comments

apotheon commented Apr 9, 2017

lbmn commented Apr 14, 2017

apotheon commented Apr 14, 2017 • edited Loading

Format Name

Format Structure / Database Identifiers

lbmn commented Apr 15, 2017 • edited Loading

apotheon commented Apr 22, 2017

apotheon commented Apr 14, 2017 •

edited

Loading

lbmn commented Apr 15, 2017 •

edited

Loading