Skip to content

maubinle/sentence_concreteness

Repository files navigation

sentence_concreteness

This is a package for tagging sentences with their concreteness. The measure is an average of words in a sentence. Words are matched to their root form and person, place, or organizational entities are tagged with a maximum concretenes of 5. The word concreteness ratings that this package relies upon were provided by Brysbaert, Warriner & Kuperman (2013).

This method has been empirically validated in our paper. If you find it helpful, please consider using the following citation:

Aubin Le Quéré, M., Matias, J.N. When curiosity gaps backfire: effects of headline concreteness on information selection decisions. Sci Rep 15, 994 (2025). https://doi.org/10.1038/s41598-024-81575-9

Installation

pip install sentence_concreteness

You will also need to download the spacy model.
python -m spacy download en_core_web_sm

Requirements

  • csv
  • string
  • inflect
  • spacy
  • truecase
  • nltk

Usage

See demo.py for an example of how to run sentence_concreteness.
Note: The python package is still experimental, please contact Marianne if you encounter any issues.

Documentation

get_concreteness(word)

Returns the matched concreteness for an individual word. This method will try to match a word to a root form if able.

Name Type Description
word string Word that you wish to retrieve the concreteness for.

get_sentence_concreteness(sentence, verbose=False, num_unmatched_words_allowed=3)

Returns the matched concreteness for a sentence word. For each word, this method will try to calculate a concreteness and then take the average of all retrieved concretenesses. If a word is considered an entity, it will automatically be assigned a concreteness of 5.

Name Type Description
sentence string Sentence that you wish to retrieve the concreteness for.
verbose boolean Whether you want more information.
num_unmatched_words_allowed int Number of allowable non-matched words before an error is returned.

Details

To calculate concreteness ratings, we first identify any person, place, or organizational entities in a headline using the spaCy package, and encode these entities with the highest concreteness score of 5. We then split our headline into a list of tokens and remove standardized stopwords from the headline. We ignore punctuation and cardinal numbers. From the remaining list of tokens, we take an iterative approach to mapping each token to its concreteness rating, checking between each step if the words maps to a concreteness rating. At each step, if we cannot yet retrieve a concreteness rating for a token, we first attempt to retrieve a singular version of the token (e.g. "elephants" → "elephant"), a present tense version (e.g. "lounged" → "lounge"), or a base adjective (e.g. "greatest" → "great"). If these steps all fail and a word is hyphenated, we take the average of both words (e.g. "super-spectacular" → "super", "spectacular").

Limitations

This scale was validated in the context of news headlines by the publisher Upworthy. Additionally, we selected headlines that were between 14 and 16 words. While the measure can be used more generally to tag sentences and for different sentence lengths, scholars may want to conduct additional validation to ensure the scale works for their specific context.

In very rare instances, truecase behaves non-deterministically, which can impact the NER results and therefore make the final concreteness score non-deterministic. In my experience, this only happens about once every 10,000 sentences, but is something to be aware of nontheless. This issue can be solved by removing truecase, which may or may not be appropriate for your use case.

Resources used

https://maria-antoniak.github.io/2020/03/25/pip.html
https://realpython.com/pypi-publish-python-package/#prepare-your-package-for-publication

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages