Performance is bad #189

nickyi1990 · 2020-09-18T13:38:15Z

I test this method on many tabular dataset, which is much larger than dataset used in notebook example, the performance is really bad... is there anyone have the same situation with me?

eduardocarvp · 2020-09-18T13:39:46Z

Could you share the dataset and the parameters you have used for training?

Optimox · 2020-09-18T13:45:44Z

Hello @nickyi1990,

I'm sad to know that you are not able to get the performance you'd like.
But how are you hoping to receive some help with such a complaint?

We'd be happy to have a new dataset to try TabNet on and improve based on the (poor) results.

But please be constructive and share more information in order to reproduce and help.

Also there is a benchmark issue #127 here, if you took the time to benchmark tabnet on various datasets please share the results with us! If you can share your code that's even better!

Cheers

nickyi1990 · 2020-09-21T12:37:41Z

Hello @nickyi1990,

I'm sad to know that you are not able to get the performance you'd like.
But how are you hoping to receive some help with such a complaint?

We'd be happy to have a new dataset to try TabNet on and improve based on the (poor) results.

But please be constructive and share more information in order to reproduce and help.

Also there is a benchmark issue #127 here, if you took the time to benchmark tabnet on various datasets please share the results with us! If you can share your code that's even better!

Cheers

The code is a little mess and can not run directly, but you can know my way to split dataset and the parameter i use for training from code below, the valid and test performance from tabnet is not as good as my vallina full connected network which is showed below~

vallina dnn
tabnet

from sklearn.preprocessing import LabelEncoder, StandardScaler

def seed_reproducer(seed=2020):
    """Reproducer for pytorch experiment.

    Parameters
    ----------
    seed: int, optional (default = 2019)
        Radnom seed.

    Example
    -------
    seed_reproducer(seed=2019).
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.enabled = True



def preprocess_salary_data():
    """ dataset : http://archive.ics.uci.edu/ml/datasets/Adult
    """
    print("process salary dataset")
    df = pd.read_csv("/home/kunzhong/.fastai/data/adult_sample/adult.csv")
    df["y"] = df["salary"].map({">=50k": 1, "<50k": 0})
    for feature in df.columns:
        if df[feature].dtype == "O":
            df[feature] = df[feature].fillna("Unknown")
        else:
            df[feature] = df[feature].fillna(-1)

    all_feature_names = df.columns
    category_feature_names = [
        "workclass",
        "education",
        "marital-status",
        "occupation",
        "relationship",
        "race",
        "sex",
        "native-country",
    ]
    target_name = "y"
    feature_names_tobe_dropped = ["y", "salary"]
    feature_names = all_feature_names.difference(feature_names_tobe_dropped + safe_list(target_name)).tolist()
    continous_feature_names = [f for f in feature_names if f not in category_feature_names]

    for feature in category_feature_names:
        le = LabelEncoder()
        df[feature] = le.fit_transform(df[feature])

    df_train, df_valid_test = train_test_split(df, test_size=0.3, random_state=2020)
    df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=2020)
    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)
    df_train.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_train_salary.p"))
    df_valid.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_valid_salary.p"))
    df_test.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_test_salary.p"))

preprocess_salary_data()

df_train_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_train_salary.p"))
df_valid_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_valid_salary.p"))
df_test_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_test_salary.p"))


all_feature_names = df_train_salary.columns
cat_features = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
]
target_name = "y"
feature_names_tobe_dropped = ["y", "salary"]
feature_names = all_feature_names.difference(feature_names_tobe_dropped + safe_list(target_name)).tolist()
cat_dim_dict = {}
for cat_feature in cat_features:
    le = LabelEncoder()
    l = df_train_salary[cat_feature].append(df_valid_salary[cat_feature]).append(df_test_salary[cat_feature]).fillna(0)
    le.fit(l)
    cat_dim_dict[cat_feature] = l.nunique()
    df_train_salary[cat_feature] = le.transform(df_train_salary[cat_feature].fillna(0))
    df_valid_salary[cat_feature] = le.transform(df_valid_salary[cat_feature].fillna(0))
    df_test_salary[cat_feature] = le.transform(df_test_salary[cat_feature].fillna(0))
    print(cat_feature, df_train_salary[cat_feature].unique())
    
# 这么写主要是为了维持住变量的顺序
cat_idxs = [feature_names.index(f) for f in cat_features]
cat_dims = [cat_dim_dict[f] for f in cat_features]

for feature_name in list(set(feature_names).difference(cat_features)):
    ss = StandardScaler()
    l = df_train_salary[feature_name].append(df_valid_salary[feature_name]).append(df_test_salary[feature_name]).fillna(0)
    ss.fit(l.values.reshape(-1, 1))
    df_train_salary[feature_name] = ss.transform(df_train_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()
    df_valid_salary[feature_name] = ss.transform(df_valid_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()
    df_test_salary[feature_name] = ss.transform(df_test_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()

Optimox · 2020-09-21T15:44:54Z

Thanks so this is census-income dataset right?

We have an example notebook to show how TabNet works : https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb.
Results look more than ok to me, they are on par with an XGBoost model, which I guess perform better than a "vanilla dnn" whatever that means.

The code is a little mess and can not run directly

Sorry to hear that, but maybe that's one reason you can't get competitive results (as shown for census-income).

You also talked about many tabular datasets maybe you could share some clean code for those?
I'll close the issue for now since your feedback can't be reproduce, feel free to reopen as soon as you have reproducible results on which we can talk.

nickyi1990 · 2020-09-22T03:15:34Z

Thanks so this is census-income dataset right?

We have an example notebook to show how TabNet works : https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb.
Results look more than ok to me, they are on par with an XGBoost model, which I guess perform better than a "vanilla dnn" whatever that means.

The code is a little mess and can not run directly

Sorry to hear that, but maybe that's one reason you can't get competitive results (as shown for census-income).

You also talked about many tabular datasets maybe you could share some clean code for those?

I'll close the issue for now since your feedback can't be reproduce, feel free to reopen as soon as you have reproducible results on which we can talk.

I create a colab notebook, you can check the performance of lightgbm and tabnet in the last cell, there's a big difference.

Optimox · 2020-09-22T08:51:11Z

maybe try a larger patience

nickyi1990 · 2020-09-23T02:27:22Z

maybe try a larger patience

I really appreciate the work u did and the idea of paper, I tried many times and can not reach a similar performance, could u try to tune the Tabnet to let it get similar performance as lightgbm? and could u reopen the issue to let more people join in to find where's the problem? thanks!

nickyi1990 added the bug Something isn't working label Sep 18, 2020

nickyi1990 assigned eduardocarvp, Hartorn, j-abi and Optimox Sep 18, 2020

Optimox removed the bug Something isn't working label Sep 18, 2020

Optimox closed this as completed Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance is bad #189

Performance is bad #189

nickyi1990 commented Sep 18, 2020

eduardocarvp commented Sep 18, 2020

Optimox commented Sep 18, 2020

nickyi1990 commented Sep 21, 2020 •

edited

Loading

Optimox commented Sep 21, 2020

nickyi1990 commented Sep 22, 2020 •

edited

Loading

Optimox commented Sep 22, 2020 •

edited

Loading

nickyi1990 commented Sep 23, 2020

Performance is bad #189

Performance is bad #189

Comments

nickyi1990 commented Sep 18, 2020

eduardocarvp commented Sep 18, 2020

Optimox commented Sep 18, 2020

nickyi1990 commented Sep 21, 2020 • edited Loading

Optimox commented Sep 21, 2020

nickyi1990 commented Sep 22, 2020 • edited Loading

Optimox commented Sep 22, 2020 • edited Loading

nickyi1990 commented Sep 23, 2020

nickyi1990 commented Sep 21, 2020 •

edited

Loading

nickyi1990 commented Sep 22, 2020 •

edited

Loading

Optimox commented Sep 22, 2020 •

edited

Loading