Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any processing method to exclude <rPh> Tag from sharedStrings.xml in Crawled xlsx File #74

Open
ki-suzuki opened this issue Sep 21, 2023 · 1 comment
Assignees

Comments

@ki-suzuki
Copy link

ki-suzuki commented Sep 21, 2023

The contents of the sharedStrings.xml file in the target xlsx file for crawling are as follows.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="8" uniqueCount="8"><si><t>月日</t><rPh sb="0" eb="2"><t>ガッピ</t></rPh><phoneticPr fontId="2"/></si><si><t>会社名</t><rPh sb="0" eb="3"><t>カイシャメイ</t></rPh><phoneticPr fontId="2"/></si><si><t>金額</t><rPh sb="0" eb="2"><t>キンガク</t></rPh><phoneticPr fontId="2"/></si><si><t>支払日</t><rPh sb="0" eb="3"><t>シハライビ</t></rPh><phoneticPr fontId="2"/></si><si><t>締日</t><rPh sb="0" eb="2"><t>シメビ</t></rPh><phoneticPr fontId="2"/></si><si><t>S社</t><rPh sb="1" eb="2"><t>シャ</t></rPh><phoneticPr fontId="2"/></si><si><t>A社</t><rPh sb="1" eb="2"><t>シャ</t></rPh><phoneticPr fontId="2"/></si><si><t>B社</t><rPh sb="1" eb="2"><t>シャ</t></rPh><phoneticPr fontId="2"/></si></sst>

What I ultimately want to obtain is the content excluding the tag.
(What i want to do is to remove something like <rPh sb="0" eb="2"><t>キンガク</t></rPh>)
Is there any processing method available? I would appreciate your help very much if you could assist me.

@sakanaosama
Copy link

Following the import process and content extraction, all tags are removed. Nonetheless, if you wish to exclude specific content based on these tags, you must work at the 'preParseHandlers' level under "Importer," where all the tags are still preserved before extraction. You can find more information about this configuration in the documentation at https://opensource.norconex.com/importer/v2/configuration#tbl-transformer. You can achieve this using the 'ReduceConsecutivesTransformer' or by implementing a custom script using the 'ScriptTransformer.'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants