Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial implementation with clams-python 1.2.4 #3

Merged
merged 11 commits into from
Jul 19, 2024
Prev Previous commit
update information on pretrained model and user instructions
selenasong committed Jul 19, 2024

Verified

This commit was signed with the committer’s verified signature. The key has expired.
florianhofhammer Florian Hofhammer
commit cdeb761a44acfe9b1392a17d902f278160f4cf3e
59 changes: 45 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -6,27 +6,43 @@
This app extracts keywords in a text document according to tokens' TF-IDF scores. The IDF scores are generated from
a given list of text files in a directory.

## Information on the available model
The current available model for keyword extraction is trained with 22 out of 24 NewsHour transcripts listed in
[batch2.txt](https://github.com/clamsproject/aapb-annotations/blob/9cbe41aa124da73a0158bfc0b4dbf8bafe6d460d/batches/batch2.txt).
Excluded files' names and reasons of exclusion are:
* `cpb-aacip-525-028pc2v94s`: File not found in the dataset
* `cpb-aacip_507-r785h7cp0z`: Contains no transcript but an error message

This model is trained with English stopwords removed.
Tokens that appears in more than 85% of these 22 documents are also removed (i.e., `max_df=0.85`)

## User instruction
### System requirements
* Requires Python3 with `clams-python`, and `scikit-learn` to run the app locally.
* Requires Python3 with `clams-python`, `clams-utils` and `scikit-learn` to run the app locally.
* Requires an HTTP client utility (such as `curl`) to invoke and execute analysis.
* Requires docker to run the app in a Docker container

Run `pip install -r requirements.txt` to install the requirements.

### Generate IDF scores for tokens in text documents in a directory
### Train a model with NewsHour transcripts using `tfidf.py`
> **NOTE:**
> If you only look to use the keyword extractor app instead of training your own model,
> please skip this section and follow instructions in the next section.

After getting into the working directory, run the following line on the target dataset:

`python tfidf.py --dataPath path/to/target/dataset/directory`

By running this line, tfidf.py generates a pickle file named `idf_feature_file.pkl` by default that stores the IDF values
and the corresponding feature dictionary for the use of later keyword extraction.
By running this line, `tfidf.py` does 2 things:
* cleans all transcripts in a given directory.
* generates a pickle file named `idf_feature_file.pkl` by default that stores the IDF values and the corresponding
feature dictionary. Currently, this file is not allowed to be renamed, or it affects running `cli.py` later on.

If these files need to be named differently from the default, then add `--idfFeatureFile` and the expected
file name to the command above to change the names of the generated files.
~~If the pickle file needs to be named differently from the default, then add
`--idfFeatureFile` and the expected name to the command above to change the names.~~

> **warning:**
> renaming files at this step will affect the command for running the keyword extractor in the later step
> **~~warning:~~**
> ~~renaming the pickle file at this step will affect the command for running the keyword extractor in the later step~~

Default value for max document frequency is 0.85. If a different value for is required, then add `--maxDf`
and the expected float value (max value is 1.0) to the command above.
@@ -35,13 +51,28 @@ and the expected float value (max value is 1.0) to the command above.

General user instructions for CLAMS apps are available at [CLAMS Apps documentation](https://apps.clams.ai/clamsapp).

> **note:**
> If the file storing the IDF values and the feature dict do not have the default file name as listed above,
> then when running `cli.py`according to the Apps documentation, before entering input and output `mmif` files,
> add `--idfFeatureFile` and the corresponding file name
To run this app in CLI:

`python cli.py --optional_params <input_mmif_file_path> <output_mmif_file_path>`

2 types of input `MMIF` files are acceptable here:
* The ones that are generated through `clams source text:/path/to/the/target/txt/file` to extract keywords for a single
text document.
* The ones whose last view containing TextDocument(s) is the view to extract keywords from.

> **~~note:~~**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> ~~If the file storing the IDF values and the feature dict do not have the default file name as listed above,
> then when running `cli.py`according to the Apps documentation, before entering input and output `MMIF` files,
> add `--idfFeatureFile` and the corresponding file name~~

Default number of keywords extracted from a given text document is 10. If the number of extracted keywords is required
to be different from 10, when running `cli.py`, add `--topN` and a corresponding integer value.

Default number of keywords extracted from a given text document is 10. If this number
is required to be different, when running `cli.py`, add `--topN` and a corresponding integer value
Two scenarios may be seen if the input text document is too short:
1. If the number of tokens in a text document is smaller than the value of `topN`,
then no keywords will be extracted.
2. If the text contains lots of stopwords, then the number of extracted keywords can be less than the value of `topN`,
because the app ignores all stopwords when finding keywords.

### Configurable runtime parameter