Skip to content

Chunking

code2k13 edited this page Jul 14, 2021 · 2 revisions

The *'chunk.py' script extracts chunks using NLTK's regex based chunking feature from nlphose compliant JSON record. You need to supply a regex and a name for the regex pattern as parameters to this script. It appends a 'chunk' attribute to the JSON record which contains the name of pattern and the extracted chunk. This script can be used multiple times in a pipeline. In such a case the name of the regex pattern in every invocation should be different.

Sample usage:

./file2json.py -n 3 data/1342-0.txt |\
./chunk.py  observation '{<JJ>|<NN?>*<NN>}'

Here is a sample output for the above command:

{
  "file_name": "1342-0.txt",
  "id": "b3656be6-e2e7-11eb-93c4-42b45ace4426",
  "text": " Mr. Wickham was the happy man towards whom almost every female eye was turned, and Elizabeth was the happy woman by whom he finally seated himself; and the agreeable manner in which he immediately fell into conversation, though it was only on its being a wet night, made her feel that the commonest, dullest, most threadbare topic might be rendered interesting by the skill of the speaker.",
  "chunks": [
    [
      "observation",
      "happy man"
    ],
    [
      "observation",
      "happy woman"
    ],
    [
      "observation",
      "agreeable manner"
    ],
    [
      "observation",
      "threadbare topic"
    ]
  ]
}

Example where chunk.py is used twice usage:

./file2json.py -n 3 data/1342-0.txt |\
./chunk.py  adj_noun '{<JJ><NN>}' |\
./chunk.py  vb_noun '{<VB><NN>}' 

Here is a output record generated by above command:

{
  "file_name": "1342-0.txt",
  "id": "3c34a00e-e2e8-11eb-aeff-42b45ace4426",
  "text": " To Mr. Darcy it was welcome intelligence—Elizabeth had been at Netherfield long enough. She attracted him more than he liked—and Miss Bingley was uncivil to _her_, and more teasing than usual to himself. He wisely resolved to be particularly careful that no sign of admiration should _now_ escape him, nothing that could elevate her with the hope of influencing his felicity; sensible that if such an idea had been suggested, his behaviour during the last day must have material weight in confirming or crushing it.",
  "chunks": [
    [
      "adj_noun",
      "welcome intelligence—Elizabeth"
    ],
    [
      "adj_noun",
      "last day"
    ],
    [
      "adj_noun",
      "material weight"
    ],
    [
      "vb_noun",
      "_now_ escape"
    ]
  ]
}