Minor docs revamp for paraphrase module

mesolitica · Apr 26, 2024 · 6e0930c · 6e0930c
1 parent 3de8894
commit 6e0930c
Showing 1 changed file with 62 additions and 57 deletions.
diff --git a/docs/load-paraphrase.ipynb b/docs/load-paraphrase.ipynb
@@ -11,20 +11,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-info\">\n",
-    "\n",
-    "This tutorial is available as an IPython notebook at [Malaya/example/paraphrase](https://github.com/huseinzol05/Malaya/tree/master/example/paraphrase).\n",
-    "    \n",
-    "</div>"
+    "Paraphrasing refers to the transformation of textual content into an equivalent form using different wording, while preserving the original intent and meaning. In this notebook, you'll see how it can help us summarize long text input and pick out the important facts."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-warning\">\n",
+    "<div class=\"alert alert-info\">\n",
     "\n",
-    "This module only trained on standard language structure, so it is not save to use it for local language structure.\n",
+    "This tutorial is available as an IPython notebook at [Malaya/example/paraphrase](https://github.com/huseinzol05/Malaya/tree/master/example/paraphrase).\n",
     "    \n",
     "</div>"
    ]
@@ -34,8 +30,10 @@
    "metadata": {},
    "source": [
     "<div class=\"alert alert-warning\">\n",
+    "    \n",
+    "This module was only trained on standard language structure, so it is not safe to use it for local language structure that contains slang.\n",
     "\n",
-    "Results generated using stochastic methods.\n",
+    "The results you see here are generated using stochastic methods. Learn more about the stochastic process on <a href=\"https://en.wikipedia.org/wiki/Stochastic_process\" target=\"_blank\">Wikipedia</a>\n",
     "    \n",
     "</div>"
    ]
@@ -69,44 +67,21 @@
     "scrolled": true
    },
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n",
-      "  warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpe_sjkt19\n",
-      "INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpe_sjkt19/_remote_module_non_scriptable.py\n"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "CPU times: user 2.8 s, sys: 3.91 s, total: 6.71 s\n",
-      "Wall time: 1.94 s\n"
+      "CPU times: user 3.27 s, sys: 343 ms, total: 3.61 s\n",
+      "Wall time: 3.62 s\n"
      ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
+      "/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
       "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
-      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
+      "/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
       "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
      ]
     }
@@ -122,7 +97,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### List available HuggingFace model"
+    "### List all available HuggingFace transformers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `malaya` library has a built in function to find all available transformers for this task. As of writing we have three transformers which are:\n",
+    "\n",
+    "1. mesolitica/finetune-paraphrase-t5-base-standard-bahasa-cased <br>\n",
+    "https://huggingface.co/mesolitica/finetune-paraphrase-t5-base-standard-bahasa-cased\n",
+    "   \n",
+    "2. mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased <br>\n",
+    "https://huggingface.co/mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased\n",
+    "\n",
+    "3. mesolitica/finetune-paraphrase-t5-tiny-standard-bahasa-cased <br>\n",
+    "https://huggingface.co/mesolitica/finetune-paraphrase-t5-tiny-standard-bahasa-cased"
    ]
   },
   {
@@ -174,6 +165,13 @@
     "print(malaya.paraphrase.info)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -218,8 +216,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n",
-      "You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
+      "You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
      ]
     }
    ],
@@ -231,8 +228,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Paraphrase\n",
-    "\n",
+    "Here is the `generate` function and the parameters it expects. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "```python\n",
     "def generate(\n",
     "    self,\n",
@@ -258,6 +260,20 @@
     "```"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's give a string to pass into the `generate` method"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -286,9 +302,8 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/home/husein/.local/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.5` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n",
-      "  warnings.warn(\n",
-      "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n"
+      "/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.5` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n",
+      "  warnings.warn(\n"
      ]
     },
     {
@@ -427,24 +442,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Good thing about HuggingFace\n",
+    "### Benefits of HuggingFace\n",
     "\n",
-    "In `generate` method, you can do greedy, beam, sampling, nucleus decoder and so much more, read it at https://huggingface.co/blog/how-to-generate"
+    "With the `generate` method you can use Greedy, Beam, Sampling, Nucleus decoder and so much more, read more about it on the [HuggingFace Article on How to Generate](https://huggingface.co/blog/how-to-generate). And recently, HuggingFace also released a new article [Introducing Csearch](https://huggingface.co/blog/introducing-csearch)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 12,
    "metadata": {},
    "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "CPU times: user 7.63 s, sys: 0 ns, total: 7.63 s\n",
-      "Wall time: 721 ms\n"
-     ]
-    },
     {
      "data": {
       "text/plain": [
@@ -473,8 +480,6 @@
     }
    ],
    "source": [
-    "%%time\n",
-    "\n",
     "model.generate(splitted, \n",
     "    do_sample=True, \n",
     "    max_length=256, \n",
@@ -507,7 +512,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.10"
+   "version": "3.10.12"
   },
   "varInspector": {
    "cols": {
@@ -540,5 +545,5 @@
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }