wrapper for input/output, one demo notebook (#146)

ssciwr · Sep 4, 2023 · a7484e3 · a7484e3
1 parent 8eb4fca
commit a7484e3
Show file tree

Hide file tree

Showing 5 changed files with 305 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -51,21 +51,16 @@ Be careful, it requires around 7 GB of disk space.
 
 ## Usage
 
-There are sample notebooks in the `notebooks` folder for you to explore the package:
-1. Text extraction: Use the notebook `get-text-from-image.ipynb` to extract any text from the images. The text is directly translated into English. If the text should be further analysed, set the keyword `analyse_text` to `True` as demonstrated in the notebook.\
+The main demonstration notebook can be found in the `notebooks` folder and also on [google colab](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/DemoNotebook_ammico.ipynb)
+
+There are further sample notebooks in the `notebooks` folder for the more experimental features:
+1. Topic analysis: Use the notebook `get-text-from-image.ipynb` to analyse the topics of the extraced text.\
 **You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/get-text-from-image.ipynb)**  
 Place the data files and google cloud vision API key in your google drive to access the data.
-1. Emotion recognition: Use the notebook `facial_expressions.ipynb` to identify if there are faces on the image, if they are wearing masks, and if they are not wearing masks also the race, gender and dominant emotion.
-**You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/facial_expressions.ipynb)**   
-Place the data files in your google drive to access the data.
-1. Content extraction: Use the notebook `image_summary.ipynb` to create captions for the images and ask questions about the image content.
-**You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/image_summary.ipynb)**
 1. Multimodal content: Use the notebook `multimodal_search.ipynb` to find the best fitting images to an image or text query.
 **You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/multimodal_search.ipynb)**
 1. Color analysis: Use the notebook `color_analysis.ipynb` to identify colors the image. The colors are then classified into the main named colors in the English language.
 **You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/colors_analysis.ipynb)**
-1. Object analysis: Use the notebook `ojects_expression.ipynb` to identify certain objects in the image. Currently, the following objects are being identified: person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, cell phone.
-**You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/objects_expression.ipynb)**
 1. To crop social media posts use the `cropposts.ipynb` notebook. 
 **You can run this notebook on google colab: [Here](https://colab.research.google.com/github/ssciwr/ammico/blob/main/ammico/notebooks/cropposts.ipynb)**
 

diff --git a/ammico/__init__.py b/ammico/__init__.py
@@ -10,7 +10,7 @@
 from ammico.objects import ObjectDetector
 from ammico.summary import SummaryDetector
 from ammico.text import TextDetector, PostprocessText
-from ammico.utils import find_files, initialize_dict, append_data_to_dict, dump_df
+from ammico.utils import find_files, get_dataframe
 
 # Export the version defined in project metadata
 __version__ = metadata.version(__package__)
@@ -27,7 +27,5 @@
     "TextDetector",
     "PostprocessText",
     "find_files",
-    "initialize_dict",
-    "append_data_to_dict",
-    "append_data_to_dict",
+    "get_dataframe",
 ]
diff --git a/ammico/notebooks/DemoNotebook_ammico.ipynb b/ammico/notebooks/DemoNotebook_ammico.ipynb
@@ -0,0 +1,277 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# AMMICO: AI Media and Misinformation Content Analysis Tool\n",
+    "With ammico, you can analyze text on images and image content at the same time. This is a demonstration notebook to showcase the capabilities of ammico.\n",
+    "You can run this notebook on google colab or locally / on your own HPC resource. The first cell only runs on google colab; on all other machines, you need to create a conda environment first and install ammico from the Python Package Index using `pip install ammico`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# if running on google colab\n",
+    "# flake8-noqa-cell\n",
+    "import os\n",
+    "\n",
+    "if \"google.colab\" in str(get_ipython()):\n",
+    "    # update python version\n",
+    "    # install setuptools\n",
+    "    # %pip install setuptools==61 -qqq\n",
+    "    # install ammico\n",
+    "    %pip install git+https://github.com/ssciwr/ammico.git -qqq\n",
+    "    # mount google drive for data and API key\n",
+    "    from google.colab import drive\n",
+    "\n",
+    "    drive.mount(\"/content/drive\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Import the ammico package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import ammico"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Step 1: Read your data into AMMICO\n",
+    "The ammico package reads in one or several input files given in a folder for processing. The user can select to read in all image files in a folder, to include subfolders via the `recursive` option, and can select the file extension that should be considered (for example, only \"jpg\" files, or both \"jpg\" and \"png\" files). For reading in the files, the ammico function `find_files` is used, with optional keywords:\n",
+    "- `path` - the directory containing the image files (defaults to the location set by environment variable `AMMICO_DATA_HOME`);  \n",
+    "- `pattern` - the file extensions that should be considered when reading in the data (defaults to all allowed file extensions, that is \"png\", \"jpg\", \"jpeg\", \"gif\", \"webp\", \"avif\", \"tiff\")\n",
+    "- `recursive` - include subdirectories recursively (defaults to `True`)\n",
+    "- `limit` - the maximum number of files that should be read. This is useful for testing, if you want to run the analysis on only a few input files (defaults to 20, to return all images that are found set to `None` or `-1`)\n",
+    "- `random_seed` - the random seed used for shuffling the images. This is useful if you select only a few images to be read and want to preserve the order and selection (defaults to `None`)\n",
+    "The `find_files` function returns a nested dict that contains the file ids and the paths to the files and is empty otherwise. This dict is filled step by step with more data as each detector class is run on the data (see below)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image_dict = ammico.find_files(\n",
+    "    path=\"/content/drive/MyDrive/misinformation-data/\",\n",
+    "    limit=10,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Inspect the input files using the graphical user interface\n",
+    "A Dash user interface is to select the most suitable options for the analysis, before running a complete analysis on the whole data set. The options for each detector module are explained below in the corresponding sections; for example, different models can be selected that will provide slightly different results. This way, the user can interactively explore which settings provide the most accurate results. In the interface, the nested `image_dict` is passed through the `AnalysisExplorer` class. The interface is run on a specific port which is passed using the `port` keyword; if a port is already in use, it will return an error message, in which case the user should select a different port number. \n",
+    "The interface opens a dash app inside the Jupyter Notebook and allows selection of the input file in the top left dropdown menu, as well as selection of the detector type in the top right, with options for each detector type as explained below. The output of the detector is shown directly on the right next to the image. This way, the user can directly inspect how updating the options for each detector changes the computed results, and find the best settings for a production run.\n",
+    "\n",
+    "Please note that for the Google Cloud Vision API (the TextDetector class) you need to set a key in order to process the images. This key is ideally set as an environment variable using for example\n",
+    "```\n",
+    "os.environ[\n",
+    "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
+    "] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\"\n",
+    "```\n",
+    "where you place the key on your Google Drive if running on colab, or place it in a local folder on your machine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "analysis_explorer = ammico.AnalysisExplorer(image_dict)\n",
+    "analysis_explorer.run_server(port=8055)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3: Analyze all images\n",
+    "After having selected the best options for each detector module from the interactive GUI, the analysis can now be run in production on all images in the data set. Depending on the size of the data set and the computing resources available, this can take some time. Please note that you need to have set your Google Cloud Vision API key for the TextDetector to run.\n",
+    "The desired detector modules are called sequentially in any order, for example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for key in image_dict.keys():\n",
+    "    image_dict[key] = ammico.TextDetector(image_dict[key], analyse_text=True).analyse_image()\n",
+    "    image_dict[key] = ammico.EmotionDetector(image_dict[key]).analyse_image()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For the computationally demanding `SummaryDetector`, it is best to initialize the model first and then analyze each image while passing the model explicitly. This can be done in a separate loop or in the same loop as for text and emotion detection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# initialize the models\n",
+    "summary_model, summary_vis_processors = ammico.SummaryDetector(image_dict).load_model(model_type=\"base\")\n",
+    "# run the analysis without having to re-iniatialize the model\n",
+    "for key in image_dict:\n",
+    "    image_dict[key] = ammico.SummaryDetector(image_dict[key], analysis_type=\"summary\", summary_model=summary_model, summary_vis_processors=summary_vis_processors).analyse_image()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4: Convert analysis output to pandas dataframe and write csv\n",
+    "The content of the nested dictionary can then conveniently be converted into a pandas dataframe for further analysis in Python, or be written as a csv file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image_df = ammico.get_dataframe(image_dict)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Inspect the dataframe:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image_df.head(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Or write to a csv file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image_df.to_csv(\"/content/drive/MyDrive/misinformation-data/data_out.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# The detector modules\n",
+    "The different detector modules with their options are explained in more detail in this section.\n",
+    "## Text detector\n",
+    "Text on the images can be extracted using the `TextDetector` class (`text` module). The text is initally extracted using the Google Cloud Vision API and then translated into English with googletrans. The translated text is cleaned of whitespace, linebreaks, and numbers using Python syntax and spaCy. The user can set if the text should be further summarized, and analyzed for sentiment and named entity recognition, by setting the keyword `analyse_text` to `True` (the default is `False`). If set, the transformers pipeline is used for each of these tasks, with the default models as of 03/2023. Other models can be selected by setting the optional keyword `model_names` to a list of selected models, on for each task: `model_names=[\"sshleifer/distilbart-cnn-12-6\", \"distilbert-base-uncased-finetuned-sst-2-english\", \"dbmdz/bert-large-cased-finetuned-conll03-english\"]` for summary, sentiment, and ner. To be even more specific, revision numbers can also be selected by specifying the optional keyword `revision_numbers` to a list of revision numbers for each model, for example `revision_numbers=[\"a4f8f3e\", \"af0f99b\", \"f2482bf\"]`.\n",
+    "\n",
+    "Summarizing, the text detection is carried out using the following method call and keywords, where `analyse_text`, `model_names`, and `revision_numbers` are optional:\n",
+    "```\n",
+    "image_dict[\"image_id\"] = ammico.TextDetector(image_dict[\"image_id\"], \n",
+    "    analyse_text=True, model_names=[\"sshleifer/distilbart-cnn-12-6\", \n",
+    "    \"distilbert-base-uncased-finetuned-sst-2-english\", \n",
+    "    \"dbmdz/bert-large-cased-finetuned-conll03-english\"], \n",
+    "    revision_numbers=[\"a4f8f3e\", \"af0f99b\", \"f2482bf\"]).analyse_image()\n",
+    "```\n",
+    "The models can be adapted interactively in the notebook interface and the best models can then be used in a subsequent analysis of the whole data set.\n",
+    "\n",
+    "## Image summary and query\n",
+    "\n",
+    "## Detection of faces and facial expression analysis\n",
+    "Faces and facial expressions are detected and analyzed using the `EmotionDetector` class from the `faces` module. Initially, it is detected if faces are present on the image using RetinaFace, followed by analysis if face masks are worn (Face-Mask-Detection). The detection of age, gender, race, and emotions is carried out with deepface.\n",
+    "\n",
+    "Depending on the features found on the image, the face detection module returns a different analysis content: If no faces are found on the image, all further steps are skipped and the result `\"face\": \"No\", \"multiple_faces\": \"No\", \"no_faces\": 0, \"wears_mask\": [\"No\"], \"age\": [None], \"gender\": [None], \"race\": [None], \"emotion\": [None], \"emotion (category)\": [None]` is returned. If one or several faces are found, up to three faces are analyzed if they are partially concealed by a face mask. If yes, only age and gender are detected; if no, also race, emotion, and dominant emotion are detected. In case of the latter, the output could look like this: `\"face\": \"Yes\", \"multiple_faces\": \"Yes\", \"no_faces\": 2, \"wears_mask\": [\"No\", \"No\"], \"age\": [27, 28], \"gender\": [\"Man\", \"Man\"], \"race\": [\"asian\", None], \"emotion\": [\"angry\", \"neutral\"], \"emotion (category)\": [\"Negative\", \"Neutral\"]`, where for the two faces that are detected (given by `no_faces`), some of the values are returned as a list with the first item for the first (largest) face and the second item for the second (smaller) face (for example, `\"emotion\"` returns a list `[\"angry\", \"neutral\"]` signifying the first face expressing anger, and the second face having a neutral expression).\n",
+    "\n",
+    "The emotion detection reports the seven facial expressions angry, fear, neutral, sad, disgust, happy and surprise. These emotions are assigned based on the returned confidence of the model (between 0 and 1), with a high confidence signifying a high likelihood of the detected emotion being correct. Emotion recognition is not an easy task, even for a human; therefore, we have added a keyword `emotion_threshold` signifying the % value above which an emotion is counted as being detected. The default is set to 50%, so that a confidence above 0.5 results in an emotion being assigned. If the confidence is lower, no emotion is assigned. \n",
+    "\n",
+    "From the seven facial expressions, an overall dominating emotion category is identified: negative, positive, or neutral emotion. These are defined with the facial expressions angry, disgust, fear and sad for the negative category, happy for the positive category, and surprise and neutral for the neutral category.\n",
+    "\n",
+    "A similar threshold as for the emotion recognition is set for the race detection, `race_threshold`, with the default set to 50% so that a confidence for the race above 0.5 only will return a value in the analysis. \n",
+    "\n",
+    "Summarizing, the face detection is carried out using the following method call and keywords, where `emotion_threshold` and \n",
+    "`race_threshold` are optional:\n",
+    "```\n",
+    "image_dict[\"image_id\"] = ammico.EmotionDetector(image_dict[\"image_id\"], emotion_threshold=50, race_threshold=50).analyse_image()\n",
+    "```\n",
+    "The thresholds can be adapted interactively in the notebook interface and the optimal value can then be used in a subsequent analysis of the whole data set.\n",
+    "\n",
+    "## Object detection\n",
+    "Certain specified objects on the image are detected with the cvlib library and YOLOv4 model. Of the 80 objects from the [yolov library](https://github.com/AlexeyAB/darknet/blob/master/data/coco.names), the detection is restricted to person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light.\n",
+    "\n",
+    "The analysis returns a dictionary with \"yes\" or \"no\" answers for the detected items: `\"person\": \"yes\", \"bicycle\": \"no\", \"car\": \"no\", \"motorcycle\": \"no\", \"airplane\": \"no\", \"bus\": \"no\", \"train\": \"no\", \"truck\": \"no\", \"boat\": \"no\", \"traffic light\": \"no\", \"cell phone\": \"no\"`.\n",
+    "\n",
+    "The object detection is very straightforward with no further parameters, and is called using\n",
+    "```\n",
+    "image_dict[\"image_id\"] = ammico.ObjectDetector(image_dict[\"image_id\"]).analyse_image()\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Further detector modules\n",
+    "Further detector modules exist, such as `ColorDetector` and `MultimodalSearch`, also it is possible to carry out a topic analysis on the text data, as well as crop social media posts automatically. These are more experimental features and have their own demonstration notebooks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ammico",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}