Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling metadata files which reference external content files #50

Open
hpollock opened this issue May 28, 2019 · 3 comments
Open

Crawling metadata files which reference external content files #50

hpollock opened this issue May 28, 2019 · 3 comments

Comments

@hpollock
Copy link

hpollock commented May 28, 2019

A scenario we're looking to use the Filesystem Collector on is to crawl a collection of textual metadata files on the file system (one file per document) - we can use taggers in the preparsehandlers to extract this text as document metadata. However, each record can (though not always) reference an external file path to the actual document file which we'd want to undergo parsing by the document parser.

Is there an easy way through configuration to route this external document file to the parser for parsing so that the metadata record and document content are effectively combined?

@essiembre
Copy link
Contributor

I do not think there is an out-of-the-box way to do this. If you know your Java, here is a suggestion:

Implement a IFileDocumentProcessor and add it as an entry under <postImportProcessors>.

In your document processor, you will have a FileDocument argument that will contain your file metadata and content. Get the path of the child document you want to merge. From that, use the FileSystemManager argument to fetch it and call the Importer module explicitly to parse the target document and merge it yourself. Not the most trivial thing, but that is the only option that comes to mind right now.

@hpollock
Copy link
Author

Thanks for the quick response and suggested approach Pascal. We'll try that out.

@essiembre
Copy link
Contributor

I am marking this as a feature request to be able to merge content with another file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants