This is a package for tagging sentences with their concreteness. The measure is an average of words in a sentence. Words are matched to their root form and person, place, or organizational entities are tagged with a maximum concretenes of 5. The word concreteness ratings that this package relies upon were provided by Brysbaert, Warriner & Kuperman (2013).
This method has been empirically validated in our paper. If you find it helpful, please consider using the following citation:
Aubin Le Quéré, M., Matias, J.N. When curiosity gaps backfire: effects of headline concreteness on information selection decisions. Sci Rep 15, 994 (2025). https://doi.org/10.1038/s41598-024-81575-9
pip install sentence_concreteness
You will also need to download the spacy model.
python -m spacy download en_core_web_sm
csv
string
inflect
spacy
truecase
nltk
See demo.py
for an example of how to run sentence_concreteness.
Note: The python package is still experimental, please contact Marianne if you encounter any issues.
Returns the matched concreteness for an individual word. This method will try to match a word to a root form if able.
Name | Type | Description |
---|---|---|
word |
string | Word that you wish to retrieve the concreteness for. |
Returns the matched concreteness for a sentence word. For each word, this method will try to calculate a concreteness and then take the average of all retrieved concretenesses. If a word is considered an entity, it will automatically be assigned a concreteness of 5.
Name | Type | Description |
---|---|---|
sentence |
string | Sentence that you wish to retrieve the concreteness for. |
verbose |
boolean | Whether you want more information. |
num_unmatched_words_allowed |
int | Number of allowable non-matched words before an error is returned. |
To calculate concreteness ratings, we first identify any person, place, or organizational entities in a headline using the spaCy package, and encode these entities with the highest concreteness score of 5. We then split our headline into a list of tokens and remove standardized stopwords from the headline. We ignore punctuation and cardinal numbers. From the remaining list of tokens, we take an iterative approach to mapping each token to its concreteness rating, checking between each step if the words maps to a concreteness rating. At each step, if we cannot yet retrieve a concreteness rating for a token, we first attempt to retrieve a singular version of the token (e.g. "elephants" → "elephant"), a present tense version (e.g. "lounged" → "lounge"), or a base adjective (e.g. "greatest" → "great"). If these steps all fail and a word is hyphenated, we take the average of both words (e.g. "super-spectacular" → "super", "spectacular").
This scale was validated in the context of news headlines by the publisher Upworthy. Additionally, we selected headlines that were between 14 and 16 words. While the measure can be used more generally to tag sentences and for different sentence lengths, scholars may want to conduct additional validation to ensure the scale works for their specific context.
In very rare instances, truecase
behaves non-deterministically, which can impact the NER results and therefore make the final concreteness score non-deterministic. In my experience, this only happens about once every 10,000 sentences, but is something to be aware of nontheless. This issue can be solved by removing truecase
, which may or may not be appropriate for your use case.
https://maria-antoniak.github.io/2020/03/25/pip.html
https://realpython.com/pypi-publish-python-package/#prepare-your-package-for-publication