Fix moderation batching #191

mruwnik · 2023-10-15T23:18:00Z

No description provided.

mruwnik · 2023-10-15T23:18:51Z

align_data/embeddings/embedding_utils.py

+    :param Callable[[str], int] tokens_counter: the function used to count tokens
+    """
+    # A very ugly loop that will split the `texts` into smaller batches so that the
+    # total sum of tokens in each batch will not exceed `max_batch_size`


as the comment says. This is very ugly. I couldn't think of anything better that wouldn't be a lot more complicated

no idea what part of this loop is ugly, but we have https://github.com/StampyAI/stampy/blob/853e28b2e002f50d5861583cf09254093fd4e397/utilities/utilities.py#L307 in stampy bot if you want some inspiration (but this one looks better TBH) :D

yup, both are terrible :P I wanted something that would do it in say around 2-3 lines. This offends my sensibilities. But it's better than any alternatives :/

mruwnik · 2023-10-15T23:19:58Z

align_data/embeddings/embedding_utils.py

@@ -93,11 +93,31 @@ def _single_batch_moderation_check(batch: List[str]) -> List[ModerationInfoType]
    return openai.Moderation.create(input=batch)["results"]


-def moderation_check(texts: List[str], max_texts_num: int = 32) -> List[ModerationInfoType]:
-    """Batch moderation checks on list of texts."""
+def moderation_check(texts: List[str], max_batch_size: int = 4096, tokens_counter: Callable[[str], int] = len) -> List[ModerationInfoType]:


this uses len to calculate token usage, which will overestimate by something like 2-3 times. It's also used elsewhere here, from what I can see, so I left it for now

Aprillion · 2023-10-18T09:10:18Z

align_data/embeddings/embedding_utils.py

+    :param Callable[[str], int] tokens_counter: the function used to count tokens
+    """
+    # A very ugly loop that will split the `texts` into smaller batches so that the
+    # total sum of tokens in each batch will not exceed `max_batch_size`


no idea what part of this loop is ugly, but we have https://github.com/StampyAI/stampy/blob/853e28b2e002f50d5861583cf09254093fd4e397/utilities/utilities.py#L307 in stampy bot if you want some inspiration (but this one looks better TBH) :D

Aprillion · 2023-10-18T09:14:16Z

align_data/embeddings/embedding_utils.py

@@ -1,5 +1,5 @@
 import logging
-from typing import List, Tuple, Dict, Any, Optional
+from typing import List, Tuple, Dict, Any, Optional, Callable


I'm starting to reconsider the wisdom against wild imports, for the specific case of from typing import * it might not be as evil as in the general case 🤔

Aprillion · 2023-10-18T09:16:48Z

align_data/embeddings/embedding_utils.py

-def moderation_check(texts: List[str], max_texts_num: int = 32) -> List[ModerationInfoType]:
-    """Batch moderation checks on list of texts."""
+def moderation_check(texts: List[str], max_batch_size: int = 4096, tokens_counter: Callable[[str], int] = len) -> List[ModerationInfoType]:
+    """Batch moderation checks on list of texts.


I would prefer to document the part that explains "what is a moderation check", not the part that explains that code with batch variable is doing some batching 😅

Fix moderation batching

def9b4e

mruwnik requested review from chriscanal, ccstan99, henri123lemoine and Thomas-Lemoine October 15, 2023 23:18

mruwnik commented Oct 15, 2023

View reviewed changes

mruwnik requested a review from Aprillion October 17, 2023 18:05

Aprillion approved these changes Oct 18, 2023

View reviewed changes

mruwnik merged commit 714b252 into main Oct 18, 2023
2 of 3 checks passed

mruwnik deleted the fix-moderation-issue branch October 18, 2023 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix moderation batching #191

Fix moderation batching #191

mruwnik commented Oct 15, 2023

mruwnik Oct 15, 2023

Aprillion Oct 18, 2023

mruwnik Oct 18, 2023

mruwnik Oct 15, 2023

Aprillion Oct 18, 2023

Aprillion Oct 18, 2023

Aprillion Oct 18, 2023

Fix moderation batching #191

Fix moderation batching #191

Conversation

mruwnik commented Oct 15, 2023

mruwnik Oct 15, 2023

Choose a reason for hiding this comment

Aprillion Oct 18, 2023

Choose a reason for hiding this comment

mruwnik Oct 18, 2023

Choose a reason for hiding this comment

mruwnik Oct 15, 2023

Choose a reason for hiding this comment

Aprillion Oct 18, 2023

Choose a reason for hiding this comment

Aprillion Oct 18, 2023

Choose a reason for hiding this comment

Aprillion Oct 18, 2023

Choose a reason for hiding this comment