Skip to content

Commit

Permalink
Minor docs revamp for paraphrase module
Browse files Browse the repository at this point in the history
  • Loading branch information
MagusWyvern committed Apr 26, 2024
1 parent 3de8894 commit 6e0930c
Showing 1 changed file with 62 additions and 57 deletions.
119 changes: 62 additions & 57 deletions docs/load-paraphrase.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"\n",
"This tutorial is available as an IPython notebook at [Malaya/example/paraphrase](https://github.com/huseinzol05/Malaya/tree/master/example/paraphrase).\n",
" \n",
"</div>"
"Paraphrasing refers to the transformation of textual content into an equivalent form using different wording, while preserving the original intent and meaning. In this notebook, you'll see how it can help us summarize long text input and pick out the important facts."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-warning\">\n",
"<div class=\"alert alert-info\">\n",
"\n",
"This module only trained on standard language structure, so it is not save to use it for local language structure.\n",
"This tutorial is available as an IPython notebook at [Malaya/example/paraphrase](https://github.com/huseinzol05/Malaya/tree/master/example/paraphrase).\n",
" \n",
"</div>"
]
Expand All @@ -34,8 +30,10 @@
"metadata": {},
"source": [
"<div class=\"alert alert-warning\">\n",
" \n",
"This module was only trained on standard language structure, so it is not safe to use it for local language structure that contains slang.\n",
"\n",
"Results generated using stochastic methods.\n",
"The results you see here are generated using stochastic methods. Learn more about the stochastic process on <a href=\"https://en.wikipedia.org/wiki/Stochastic_process\" target=\"_blank\">Wikipedia</a>\n",
" \n",
"</div>"
]
Expand Down Expand Up @@ -69,44 +67,21 @@
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n",
" warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpe_sjkt19\n",
"INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpe_sjkt19/_remote_module_non_scriptable.py\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.8 s, sys: 3.91 s, total: 6.71 s\n",
"Wall time: 1.94 s\n"
"CPU times: user 3.27 s, sys: 343 ms, total: 3.61 s\n",
"Wall time: 3.62 s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
"/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
" self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
"/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
"/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
" self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
]
}
Expand All @@ -122,7 +97,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### List available HuggingFace model"
"### List all available HuggingFace transformers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `malaya` library has a built in function to find all available transformers for this task. As of writing we have three transformers which are:\n",
"\n",
"1. mesolitica/finetune-paraphrase-t5-base-standard-bahasa-cased <br>\n",
"https://huggingface.co/mesolitica/finetune-paraphrase-t5-base-standard-bahasa-cased\n",
" \n",
"2. mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased <br>\n",
"https://huggingface.co/mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased\n",
"\n",
"3. mesolitica/finetune-paraphrase-t5-tiny-standard-bahasa-cased <br>\n",
"https://huggingface.co/mesolitica/finetune-paraphrase-t5-tiny-standard-bahasa-cased"
]
},
{
Expand Down Expand Up @@ -174,6 +165,13 @@
"print(malaya.paraphrase.info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -218,8 +216,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n",
"You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
"You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
]
}
],
Expand All @@ -231,8 +228,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Paraphrase\n",
"\n",
"Here is the `generate` function and the parameters it expects. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"def generate(\n",
" self,\n",
Expand All @@ -258,6 +260,20 @@
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's give a string to pass into the `generate` method"
]
},
{
"cell_type": "code",
"execution_count": 7,
Expand Down Expand Up @@ -286,9 +302,8 @@
"name": "stderr",
"output_type": "stream",
"text": [
"/home/husein/.local/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.5` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n",
" warnings.warn(\n",
"spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n"
"/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.5` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n",
" warnings.warn(\n"
]
},
{
Expand Down Expand Up @@ -427,24 +442,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Good thing about HuggingFace\n",
"### Benefits of HuggingFace\n",
"\n",
"In `generate` method, you can do greedy, beam, sampling, nucleus decoder and so much more, read it at https://huggingface.co/blog/how-to-generate"
"With the `generate` method you can use Greedy, Beam, Sampling, Nucleus decoder and so much more, read more about it on the [HuggingFace Article on How to Generate](https://huggingface.co/blog/how-to-generate). And recently, HuggingFace also released a new article [Introducing Csearch](https://huggingface.co/blog/introducing-csearch)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.63 s, sys: 0 ns, total: 7.63 s\n",
"Wall time: 721 ms\n"
]
},
{
"data": {
"text/plain": [
Expand Down Expand Up @@ -473,8 +480,6 @@
}
],
"source": [
"%%time\n",
"\n",
"model.generate(splitted, \n",
" do_sample=True, \n",
" max_length=256, \n",
Expand Down Expand Up @@ -507,7 +512,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.10.12"
},
"varInspector": {
"cols": {
Expand Down Expand Up @@ -540,5 +545,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

0 comments on commit 6e0930c

Please sign in to comment.