Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata extraction at SIP creation time #319

Open
beepsoft opened this issue Sep 20, 2017 · 4 comments
Open

Metadata extraction at SIP creation time #319

beepsoft opened this issue Sep 20, 2017 · 4 comments

Comments

@beepsoft
Copy link

Hi,

Would it be possible/feasible to add metadata extraction to roda-in at SIP creation time?

For users who are not archivists and know little about metadata but would like to create meaningful SIPs I would find it quite useful if roda-in itself would provide Tika, ExifTool etc. plugins (just like RODA) to extract metadata from the files to be included in the SIP. So, when someone adds files and directories to the SIP, there could be an option whether to run automatic metadata extraction for those files. If the user selects to have metadata extraction, roda-in would extract as much metadata as possible and generate the appropriate EAD 2002, DublinCore, etc. metadata description for each file where possible.

What do you think about it?

@hsilva-keep
Copy link
Member

Dear @beepsoft ...Possible it would be, and we had a strategy very similar in roda-in version 1.x for file characterization, but we have ditched that type of approach because:

  1. we found out that integrating those tools was very heavy in terms of app size (from 200/300MB to 20MB in version 2.x)
  2. we have changed the app objectives: in version 2.x the app must be able to produce massive amounts of SIPs, in different package formats, with high versitility in terms of metadata schemas (not prescriptive), with a few clicks.

And this approach maries very well with RODA repository because all the other tasks will be done on the repository side, e.g. preservation metadata creation with technical metadata.

So, as far as I'me aware, we have no plans in adding that type of functionality. Nevertheless, thanks for the suggestion.

@beepsoft
Copy link
Author

Dear @hsilva-keep,

thanks for the clarification! Do you have a roadmap or estimation when 2.x is to be released?

Thanks!

@hsilva-keep
Copy link
Member

At the moment, and aside from final/official release, the latest release is stable and fully functional. And we don't have any urgent/needed functionality to develop in the next couple of weeks/months.

@luis100
Copy link
Member

luis100 commented Sep 21, 2017

@beepsoft the idea of extra features in RODA-in is interesting and as @hsilva-keep said it has been tried in the past but had drawbacks. To do it right, I foresee we need the following:

  • A way to add the new functionality at runtime (i.e. as plugins, it can reuse the RODA mechanism, but new interfaces need to be created to hook the new functionality in several parts of the system, at logic level and interface level)
  • RODA-in base needs to remain small (20MB), plugins need to be downloaded and installed at runtime, we will require a market, as in applications like Atom.io or SublimeText.
  • Plugins need to be sandboxed (i.e. don't allow plugins to crash the app affecting its functionality) and also their performance needs to be ensured (e.g. killing a plugin that takes too much time).

Having this, I imagine several types of plugins:

  • Descriptive metadata extraction including file format identification
  • Embedded (pre)viewers for files of a certain format
  • Digital signatures
  • New SIP formats
  • New descriptive metadata formats
  • New association methods
  • New descriptive metadata selection methods
  • Harvesters:
    • Web harvest (from a URL and some options create a WARC file)
    • Database harvest (using DBPTK to extract a SIARD2 from a DBMS)
    • ERMS harvest (i.e. extract information from a Document Management System) (e.g. Sharepoint, Alfresco, CMIS, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants