Skip to content

Commit

Permalink
update information on pretrained model and user instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
selenasong committed Jul 19, 2024
1 parent 61f799a commit cdeb761
Showing 1 changed file with 45 additions and 14 deletions.
59 changes: 45 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,43 @@
This app extracts keywords in a text document according to tokens' TF-IDF scores. The IDF scores are generated from
a given list of text files in a directory.

## Information on the available model
The current available model for keyword extraction is trained with 22 out of 24 NewsHour transcripts listed in
[batch2.txt](https://github.com/clamsproject/aapb-annotations/blob/9cbe41aa124da73a0158bfc0b4dbf8bafe6d460d/batches/batch2.txt).
Excluded files' names and reasons of exclusion are:
* `cpb-aacip-525-028pc2v94s`: File not found in the dataset
* `cpb-aacip_507-r785h7cp0z`: Contains no transcript but an error message

This model is trained with English stopwords removed.
Tokens that appears in more than 85% of these 22 documents are also removed (i.e., `max_df=0.85`)

## User instruction
### System requirements
* Requires Python3 with `clams-python`, and `scikit-learn` to run the app locally.
* Requires Python3 with `clams-python`, `clams-utils` and `scikit-learn` to run the app locally.
* Requires an HTTP client utility (such as `curl`) to invoke and execute analysis.
* Requires docker to run the app in a Docker container

Run `pip install -r requirements.txt` to install the requirements.

### Generate IDF scores for tokens in text documents in a directory
### Train a model with NewsHour transcripts using `tfidf.py`
> **NOTE:**
> If you only look to use the keyword extractor app instead of training your own model,
> please skip this section and follow instructions in the next section.
After getting into the working directory, run the following line on the target dataset:

`python tfidf.py --dataPath path/to/target/dataset/directory`

By running this line, tfidf.py generates a pickle file named `idf_feature_file.pkl` by default that stores the IDF values
and the corresponding feature dictionary for the use of later keyword extraction.
By running this line, `tfidf.py` does 2 things:
* cleans all transcripts in a given directory.
* generates a pickle file named `idf_feature_file.pkl` by default that stores the IDF values and the corresponding
feature dictionary. Currently, this file is not allowed to be renamed, or it affects running `cli.py` later on.

If these files need to be named differently from the default, then add `--idfFeatureFile` and the expected
file name to the command above to change the names of the generated files.
~~If the pickle file needs to be named differently from the default, then add
`--idfFeatureFile` and the expected name to the command above to change the names.~~

> **warning:**
> renaming files at this step will affect the command for running the keyword extractor in the later step
> **~~warning:~~**
> ~~renaming the pickle file at this step will affect the command for running the keyword extractor in the later step~~
Default value for max document frequency is 0.85. If a different value for is required, then add `--maxDf`
and the expected float value (max value is 1.0) to the command above.
Expand All @@ -35,13 +51,28 @@ and the expected float value (max value is 1.0) to the command above.

General user instructions for CLAMS apps are available at [CLAMS Apps documentation](https://apps.clams.ai/clamsapp).

> **note:**
> If the file storing the IDF values and the feature dict do not have the default file name as listed above,
> then when running `cli.py`according to the Apps documentation, before entering input and output `mmif` files,
> add `--idfFeatureFile` and the corresponding file name
To run this app in CLI:

`python cli.py --optional_params <input_mmif_file_path> <output_mmif_file_path>`

2 types of input `MMIF` files are acceptable here:
* The ones that are generated through `clams source text:/path/to/the/target/txt/file` to extract keywords for a single
text document.
* The ones whose last view containing TextDocument(s) is the view to extract keywords from.

> **~~note:~~**
> ~~If the file storing the IDF values and the feature dict do not have the default file name as listed above,
> then when running `cli.py`according to the Apps documentation, before entering input and output `MMIF` files,
> add `--idfFeatureFile` and the corresponding file name~~
Default number of keywords extracted from a given text document is 10. If the number of extracted keywords is required
to be different from 10, when running `cli.py`, add `--topN` and a corresponding integer value.

Default number of keywords extracted from a given text document is 10. If this number
is required to be different, when running `cli.py`, add `--topN` and a corresponding integer value
Two scenarios may be seen if the input text document is too short:
1. If the number of tokens in a text document is smaller than the value of `topN`,
then no keywords will be extracted.
2. If the text contains lots of stopwords, then the number of extracted keywords can be less than the value of `topN`,
because the app ignores all stopwords when finding keywords.

### Configurable runtime parameter

Expand Down

0 comments on commit cdeb761

Please sign in to comment.