Added more info to readme

RCW2000 · Oct 22, 2023 · b67927f · b67927f
1 parent 5eb0075
commit b67927f
Show file tree

Hide file tree

Showing 4 changed files with 108 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,48 @@
-# Convert any corpus of text into a Graph of Knowledge
+# Convert any Corpus of Text into a *Graph of Knowledge*
 
 ![Knowledge Graph Banner](./assets/KG_banner.png)
+*A knowledge graph generated using this code*
 
+
+## What is a knowledge graph?
+A knowledge graph, also known as a semantic network, represents a network of real-world entities—i.e. objects, events, situations, or concepts—and illustrates the relationship between them. This information is usually stored in a graph database and visualized as a graph structure, prompting the term knowledge “graph.”
+
+Source: https://www.ibm.com/topics/knowledge-graph
+
+## How to create a simple knowledge graph from a body of work?
+1. Clean the text corpus (The body of work).
+2. Extract concepts and entities from the body of work.
+3. Extract relations between the entities. 
+4. Convert a graph schema. 
+5. Populate nodes (concepts) and edges (relations).
+6. Visualise and Query. 
+
+Step 6 is purely optional, but it has certain artistic gratification associated with it. Network graphs are beautiful objects (just look at the banner image above, isnt it beautiful?). Fortunately there are good number of python libraries available for generating graph visualisations. 
+
+## Why Graph?
+Once the Knowledge Graph (KG) is build, we can use it for many purposes. We can run graph algorithms and calculate centralities of any node, to understand how important a concept (node) is to this body of work. We can calculate communities to bunch the concepts together to better analyse the text. We can understand the connectedness between seemingly disconnected concepts. 
+
+The best of all, we can achieve **Graph Retrieval Augmented Generation (GRAG)** and chat with our text in a much more profound way using Graph as a retriever. This is a new and improved version of **Retrieval Augmented Generation (RAG)** where we use a vectory db as a retriever to chat with our documents. 
+
+---
+
+## This project
+Here I have created a simple knowledge graph from a pdf document. All the components I used here are set up locally, so this project can be run very easily on a personal machine. 
+I have adopted a no-GPT approach here to keep things economical. I am using the fantastic *Mistral 7B openorca instruct* as LLM which crushes this use cases wonderfully. The model can be set up locally using Ollama so generating the KG is basically free (No calls to GPT).
+
+Here is a list of libraries I am using in this project
+
+
+### Mistral 7B with Ollama.
+The Amazing Mistral 7b model for extracting concepts out of text chunks. 
+
+### Python Pandas 
+dataframes for graph schema (can use a graphdb at a later stage).
+
+### NetworkX 
+This is a python library that makes dealing with graphs super easy
+
+### Pyvis
+Pyvis python library for visualisation. It generates amazing web visualisatins using VueJS, so the final graphs can be hosted on the web like github pages. 
+
+// Still to complete this README //
diff --git a/assets/KG_banner.png b/assets/KG_banner.png
diff --git a/concept_graph.ipynb b/concept_graph.ipynb
@@ -16,15 +16,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [],
    "source": [
     "import pandas as pd\n",
     "import numpy as np\n",
     "from pyvis.network import Network\n",
     "import networkx as nx\n",
-    "import seaborn as sns\n"
+    "import seaborn as sns\n",
+    "import random\n"
    ]
   },
   {
@@ -36,7 +37,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
@@ -129,7 +130,7 @@
        "4  33e4526e998e4865bd5b0dde036c2a20  "
       ]
      },
-     "execution_count": 25,
+     "execution_count": 12,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -159,7 +160,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
@@ -194,7 +195,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -346,7 +347,7 @@
        "4     4.5  "
       ]
      },
-     "execution_count": 27,
+     "execution_count": 14,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -389,7 +390,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [
     {
@@ -456,7 +457,7 @@
        "4                              191 countries           2.0"
       ]
      },
-     "execution_count": 36,
+     "execution_count": 15,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -476,18 +477,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
     "G = nx.Graph()\n",
     "for index, row in nodes.iterrows():\n",
-    "    G.add_node(row['entity_L'])\n",
-    "    \n",
+    "    G.add_node(row[\"entity_L\"])\n",
+    "\n",
     "for index, row in df_graph.iterrows():\n",
-    "    G.add_weighted_edges_from(\n",
-    "        [(str(row[\"entity_L\"]), str(row[\"entity_R\"]), row[\"weight\"])]\n",
-    "    )"
+    "    G.add_edge(str(row[\"entity_L\"]), str(row[\"entity_R\"]))"
    ]
   },
   {
@@ -506,15 +505,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 38,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [],
    "source": [
     "communities_generator = nx.community.girvan_newman(G)\n",
     "top_level_communities = next(communities_generator)\n",
     "next_level_communities = next(communities_generator)\n",
-    "communities = sorted(map(sorted, next_level_communities))\n",
-    "\n"
+    "communities = sorted(map(sorted, next_level_communities))\n"
    ]
   },
   {
@@ -526,7 +524,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 74,
+   "execution_count": 57,
    "metadata": {},
    "outputs": [
     {
@@ -560,56 +558,56 @@
        "      <th>0</th>\n",
        "      <td>Member states</td>\n",
        "      <td>4.0</td>\n",
-       "      <td>#db5769</td>\n",
+       "      <td>#db57d3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>10 million deaths</td>\n",
        "      <td>3.0</td>\n",
-       "      <td>#db5784</td>\n",
+       "      <td>#5f57db</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>100 Days Mission</td>\n",
        "      <td>3.0</td>\n",
-       "      <td>#db5784</td>\n",
+       "      <td>#5f57db</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>150,000 Health and Wellness Centres (HWC)</td>\n",
        "      <td>3.0</td>\n",
-       "      <td>#db5784</td>\n",
+       "      <td>#5f57db</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>191 countries</td>\n",
        "      <td>2.0</td>\n",
-       "      <td>#db579e</td>\n",
+       "      <td>#57b9db</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
        "                                    entity_L  importance_L    color\n",
-       "0                              Member states           4.0  #db5769\n",
-       "1                          10 million deaths           3.0  #db5784\n",
-       "2                           100 Days Mission           3.0  #db5784\n",
-       "3  150,000 Health and Wellness Centres (HWC)           3.0  #db5784\n",
-       "4                              191 countries           2.0  #db579e"
+       "0                              Member states           4.0  #db57d3\n",
+       "1                          10 million deaths           3.0  #5f57db\n",
+       "2                           100 Days Mission           3.0  #5f57db\n",
+       "3  150,000 Health and Wellness Centres (HWC)           3.0  #5f57db\n",
+       "4                              191 countries           2.0  #57b9db"
       ]
      },
-     "execution_count": 74,
+     "execution_count": 57,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "\n",
+    "palette = 'hls'\n",
     "## Now add these colors to communities and make another dataframe\n",
     "def colors2Community(communities) -> pd.DataFrame:\n",
     "    ## Define a color palette\n",
-    "    p = sns.color_palette(\"hls\", len(communities)).as_hex()\n",
+    "    p = sns.color_palette(palette, len(communities)).as_hex()\n",
     "    rows = []\n",
     "    for community in communities:\n",
     "        color = p.pop()\n",
@@ -636,18 +634,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 75,
+   "execution_count": 58,
    "metadata": {},
    "outputs": [],
    "source": [
     "G = nx.Graph()\n",
+    "node_size_multiple = 6\n",
     "for index, row in df_nodes_colors.iterrows():\n",
-    "    G.add_node(row['entity_L'], size=row['importance_L']*8, title=row['entity_L'], color=row['color'])\n",
-    "    \n",
+    "    G.add_node(\n",
+    "        row[\"entity_L\"],\n",
+    "        size=row[\"importance_L\"] * 6,\n",
+    "        title=row[\"entity_L\"],\n",
+    "        color=row[\"color\"],\n",
+    "    )\n",
+    "\n",
     "for index, row in df_graph.iterrows():\n",
-    "    G.add_weighted_edges_from(\n",
-    "        [(str(row[\"entity_L\"]), str(row[\"entity_R\"]), row[\"weight\"])]\n",
-    "    )"
+    "    G.add_edge(str(row[\"entity_L\"]), str(row[\"entity_R\"]), weight=row[\"weight\"])"
    ]
   },
   {
@@ -659,36 +661,48 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 76,
+   "execution_count": 59,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "./graph/nodes.html\n"
+      "./docs/index.html\n"
      ]
     }
    ],
    "source": [
-    "graph_output_directory = './graph/index.html'\n",
+    "graph_output_directory = './docs/index.html'\n",
     "\n",
     "net = Network(\n",
     "    notebook=False,\n",
+    "    bgcolor=\"#1a1a1a\",\n",
     "    cdn_resources=\"remote\",\n",
-    "    bgcolor=\"#111212\",\n",
     "    height=\"900px\",\n",
     "    width=\"100%\",\n",
     "    select_menu=True,\n",
-    "    font_color='#dbdbdb',\n",
+    "    font_color= \"#cccccc\",\n",
     "    # filter_menu=True,\n",
     ")\n",
+    "\n",
     "net.from_nx(G)\n",
     "net.repulsion(node_distance=150, spring_length=400)\n",
     "# net.barnes_hut(gravity=-18100, central_gravity=5.05, spring_length=380)\n",
-    "net.show_buttons(filter_='physics')\n",
+    "net.show_buttons(filter_=['physics'])\n",
+    "\n",
     "net.show(graph_output_directory, notebook=False)\n"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {

diff --git a/docs/index.html b/docs/index.html