[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1` #1857

fedaeho · 2023-08-01T05:52:36Z

Describe the bug
nvt.ops.Categorify don't process vocabs correctly when num_buckets>1 is given simultaneously.

Steps/Code to reproduce bug

I tried to use categorify transform with pre-defined vocabs.
I also have to consider multiple oov, so I also gives num_buckets>1 for parameter.

from merlin.core import dispatch
import pandas as pd
import nvtabular as nvt

df = dispatch.make_df(
        {
            "Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], []],
            "Post": [1, 2, 3, 4],
        }
    )

cat_names = ["Authors"]
label_name = ["Post"]

vocabs = {"Authors": pd.Series([f"User_{x}" for x in "ACBE"])}
cat_features = cat_names >> nvt.ops.Categorify(
    num_buckets=2, vocabs=vocabs, max_size = {"Authors": 8},
)

workflow = nvt.Workflow(cat_features + label_name)
df_out = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()

For above code, expected index for each values are like below.

pad: [0]
null: [1]
oov : [2,3]
unique: [4,5,6,7]).

But, I get following result with wrong category dictionary.

df_out

	Authors	Post
0	[7]	1
1	[ 7 10]	2
2	[9 8]	3
3	[]	4

pd.read_parquet("./categories/meta.Authors.parquet")

	kind	offset	num_indices
0	pad	0	1
1	null	1	1
2	oov	2	1
3	unique	3	4

pd.read_parquet("./categories/unique.Authors.parquet")

	Authors
3	User_A
4	User_C
5	User_B
6	User_E

I check inside of Categorify.process_vocabs function and oov_count can get num_buckets correctly.
But when process_vocabs function call Categorify._save_encodings(), it doesn't make the vocabulary dictionary correctly.

Expected behavior
From

NVTabular/nvtabular/ops/categorify.py

Lines 432 to 438 in 77b94a4

    
           if num_buckets: 
        
               oov_count = ( 
        
                   num_buckets if isinstance(num_buckets, int) else num_buckets[col_name] 
        
               ) or 1 
        
           col_df = dispatch.make_df(vals).dropna() 
        
           col_df.index += NULL_OFFSET + oov_count 
        
           save_path = _save_encodings(col_df, base_path, col_name)

I fix the code whereprocess_vocab call Categorify._save_encodings with oov_count.

    def process_vocabs(self, vocabs):
      ...
                oov_count = 1
                if num_buckets:
                    oov_count = (
                        num_buckets if isinstance(num_buckets, int) else num_buckets[col_name]
                    ) or 1
                col_df = dispatch.make_df(vals).dropna()
                col_df.index += NULL_OFFSET + oov_count
                # before
                # save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
                # after
                save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)

and I got following result of df_out like as I expected.

	Authors	Post
0	[4]	1
1	[ 4 7]	2
2	[6 5]	3
3	[]	4

Environment details (please complete the following information):

Environment location: Bare-metal (CentOS 7)
Method of NVTabular install: pip

Additional context
None

The text was updated successfully, but these errors were encountered:

EvenOldridge · 2023-09-11T22:45:12Z

In all of the applications I've built OOV has been a single embedding and used to represent the fact that the item is new or rare. Can you help me understand the use case? Why would you want multiple OOV values. They're so rare that they'll effectively end up as random embeddings. Grouping them gives you some information.

fedaeho added the bug Something isn't working label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1` #1857

[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1` #1857

fedaeho commented Aug 1, 2023

EvenOldridge commented Sep 11, 2023

[BUG] Catetegorify can't process vocabs correctly when num_buckets>1 #1857

[BUG] Catetegorify can't process vocabs correctly when num_buckets>1 #1857

Comments

fedaeho commented Aug 1, 2023

EvenOldridge commented Sep 11, 2023

[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1` #1857

[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1` #1857