Merge pull request #97 from minimaxir/v1.5.0

V1.5.0
minimaxir · Jan 9, 2019 · f65413b · f65413b
2 parents b72afe1 + 61e3514
commit f65413b
Show file tree

Hide file tree

Showing 5 changed files with 372 additions and 16 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2017-2018 Max Woolf
+Copyright (c) 2017-2019 Max Woolf
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/docs/textgenrnn-synthesize.ipynb b/docs/textgenrnn-synthesize.ipynb
@@ -0,0 +1,294 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# textgenrnn 1.5 Model Synthesis\n",
+    "\n",
+    "by [Max Woolf](http://minimaxir.com)\n",
+    "\n",
+    "*Max's open-source projects are supported by his [Patreon](https://www.patreon.com/minimaxir). If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Intro\n",
+    "\n",
+    "You can predict texts from multiple models simultaneously using the `synthesize` function, allowing the creation of texts which incorporate multiple styles without \"locking\" into a given style.\n",
+    "\n",
+    "You will get better results if the input models are trained with high `dropout` (0.8-0.9)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using TensorFlow backend.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from textgenrnn import textgenrnn\n",
+    "from textgenrnn.utils import synthesize, synthesize_to_file\n",
+    "\n",
+    "m1 = \"gaming\"\n",
+    "m2 = \"Programmerhumor\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def create_textgen(model_name):\n",
+    "    return textgenrnn(weights_path='{}_weights.hdf5'.format(model_name),\n",
+    "                     vocab_path='{}_vocab.json'.format(model_name),\n",
+    "                     config_path='{}_config.json'.format(model_name),\n",
+    "                     name=model_name)\n",
+    "\n",
+    "model1 = create_textgen(m1)\n",
+    "model2 = create_textgen(m2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can pass a `list` of models to generate from to `synthesize`. The rest of the input parameters are the same as `generate`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "I wonder why the first thing I do not use the secret of the game of all the games of all the games of all the games of all the games of all the games of all the games of all the games of all the games of the same console interview with a game that they said...\n",
+      "\n",
+      "When you play a game that you don't think we are doing the requirements\n",
+      "\n",
+      "The story of my childhood\n",
+      "\n",
+      "Playing a game for the first time and it still works\n",
+      "\n",
+      "When you finally finish your friends.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "models_list = [model1, model2]\n",
+    "\n",
+    "synthesize(models_list, n=5, progress=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The model generation order is randomized for each creation. It may be worthwhile to double or triple up on models so that the text can generate from the same \"model\" for multiple tokens.\n",
+    "\n",
+    "e.g. `models_list*3` triples the number of input models, allowing generation strategies such as `[model1, model1, model2, model1, model2, model2]`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "When you have to see this on my favorite class\n",
+      "\n",
+      "This is what happens when you can use the day off the game of the new code for my company for this post\n",
+      "\n",
+      "The struggle of the real world but you like the same gun while we're talking about the Star Trate Internship and Mortal Stack of Debugger - PlayStation 4 has some commit history of the real time so many skyrim the struggle\n",
+      "\n",
+      "When you get a new project to a console game for a sequel to the post, because you don't know what you see well because it's the first time are the second of the \"Source code\" in the world\n",
+      "\n",
+      "How to properly delete the final statement of code and still has been announced to program\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "synthesize(models_list*3, n=5, progress=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also `synthesize_to_file`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 10/10 [00:10<00:00,  1.04it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "synthesize_to_file(models_list*3, \"synthesized.txt\", n=10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also use more than 2 models. One approach is to create a weighted average, for example, create a model that is 1/2 `model1`, 1/4 `model2`, 1/4 `model3`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m3 = \"PrequelMemes\"\n",
+    "model3 = create_textgen(m3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "So I was playing Skyrim and you could find a startup game I made a complete controller at work today. Here's a feature\n",
+      "\n",
+      "I have a complete face of a game on a short classic today, I have a second one\n",
+      "\n",
+      "When you finally go to the same time to get the game of all time.\n",
+      "\n",
+      "The only man who game developers start a series space because I'm looking through the comments and I still have a lot of powerful up when the game was wrong...\n",
+      "\n",
+      "When you continue but you can play in the world when I have a good day at the parents for me and my company wants to sell your ass about the rest of the console assignment and got into a meme with a bug in the game\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "models_list2 = [model1, model1, model2, model3]\n",
+    "\n",
+    "synthesize(models_list2, n=5, progress=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For character-level models, the models \"switch\" by default after a list of `stop_tokens`, which by default are a space character or a newline. You can override this behavior by passing `stop_tokens=[]` to a synthesize function, which will cause the model to switch after each character (note: may lead to *creative* results!)\n",
+    "\n",
+    "Word-level models will always switch after each generated token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The first time an all\n",
+      "\n",
+      "Complete Discord Secrets After Made a Couple Front Even So I Love A Search to Buy the Story Scorping character set on the star of the End of the Game of The Year To Post Out of Shenm\n",
+      "\n",
+      "I hope the bottom release was a good church and a promil of the party and save your part of my childhootg.\n",
+      "\n",
+      "When you see a sequerter in the comments\n",
+      "\n",
+      "It all makes me see a prequel problem with a monster from the past!\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "synthesize(models_list2, n=5, progress=False, stop_tokens=[])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LICENSE\n",
+    "\n",
+    "MIT License\n",
+    "\n",
+    "Copyright (c) 2019 Max Woolf\n",
+    "\n",
+    "Permission is hereby granted, free of charge, to any person obtaining a copy\n",
+    "of this software and associated documentation files (the \"Software\"), to deal\n",
+    "in the Software without restriction, including without limitation the rights\n",
+    "to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n",
+    "copies of the Software, and to permit persons to whom the Software is\n",
+    "furnished to do so, subject to the following conditions:\n",
+    "\n",
+    "The above copyright notice and this permission notice shall be included in all\n",
+    "copies or substantial portions of the Software.\n",
+    "\n",
+    "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
+    "IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
+    "FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n",
+    "AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
+    "LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n",
+    "OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n",
+    "SOFTWARE."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/setup.py b/setup.py
@@ -26,7 +26,7 @@
 setup(
     name='textgenrnn',
     packages=['textgenrnn'],  # this must be the same as the name above
-    version='1.4.1',
+    version='1.5.0',
     description='Easily train your own text-generating neural network ' \
     'of any size and complexity',
     long_description=long_description,
@@ -39,5 +39,5 @@
     license='MIT',
     python_requires='>=3',
     include_package_data=True,
-    install_requires=['keras>=2.1.5', 'h5py', 'scikit-learn']
+    install_requires=['keras>=2.1.5', 'h5py', 'scikit-learn', 'tqdm']
 )
diff --git a/textgenrnn/textgenrnn.py b/textgenrnn/textgenrnn.py
@@ -68,14 +68,15 @@ def __init__(self, weights_path=None,
         self.indices_char = dict((self.vocab[c], c) for c in self.vocab)
 
     def generate(self, n=1, return_as_list=False, prefix=None,
-                 temperature=0.5, max_gen_length=300, interactive=False,
-                 top_n=3):
+                 temperature=[1.0, 0.5, 0.2, 0.2],
+                 max_gen_length=300, interactive=False,
+                 top_n=3, progress=True):
         gen_texts = []
-        for _ in range(n):
-            gen_text = textgenrnn_generate(self.model,
+        iterable = trange(n) if progress and n > 1 else range(n)
+        for _ in iterable:
+            gen_text, _ = textgenrnn_generate(self.model,
                                            self.vocab,
                                            self.indices_char,
-                                           prefix,
                                            temperature,
                                            self.config['max_length'],
                                            self.META_TOKEN,
@@ -84,7 +85,8 @@ def generate(self, n=1, return_as_list=False, prefix=None,
                                                'single_text', False),
                                            max_gen_length,
                                            interactive,
-                                           top_n)
+                                           top_n,
+                                           prefix)
             if not return_as_list:
                 print("{}\n".format(gen_text))
             gen_texts.append(gen_text)
@@ -95,7 +97,7 @@ def generate_samples(self, n=3, temperatures=[0.2, 0.5, 1.0], **kwargs):
         for temperature in temperatures:
             print('#'*20 + '\nTemperature: {}\n'.format(temperature) +
                   '#'*20)
-            self.generate(n, temperature=temperature, **kwargs)
+            self.generate(n, temperature=temperature, progress=False, **kwargs)
 
     def train_on_texts(self, texts, context_labels=None,
                        batch_size=128,
@@ -117,6 +119,7 @@ def train_on_texts(self, texts, context_labels=None,
                                  context_labels=context_labels,
                                  num_epochs=num_epochs,
                                  gen_epochs=gen_epochs,
+                                 train_size=train_size,
                                  batch_size=batch_size,
                                  dropout=dropout,
                                  validation=validation,
@@ -228,6 +231,7 @@ def lr_linear_decay(epoch):
 
     def train_new_model(self, texts, context_labels=None, num_epochs=50,
                         gen_epochs=1, batch_size=128, dropout=0.0,
+                        train_size=1.0,
                         validation=True, save_epochs=0,
                         multi_gpu=False, **kwargs):
         self.config = self.default_config.copy()
@@ -285,6 +289,7 @@ def train_new_model(self, texts, context_labels=None, num_epochs=50,
                             context_labels=context_labels,
                             num_epochs=num_epochs,
                             gen_epochs=gen_epochs,
+                            train_size=train_size,
                             batch_size=batch_size,
                             dropout=dropout,
                             validation=validation,