Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance is bad #189

Closed
nickyi1990 opened this issue Sep 18, 2020 · 7 comments
Closed

Performance is bad #189

nickyi1990 opened this issue Sep 18, 2020 · 7 comments
Assignees

Comments

@nickyi1990
Copy link

I test this method on many tabular dataset, which is much larger than dataset used in notebook example, the performance is really bad... is there anyone have the same situation with me?

@nickyi1990 nickyi1990 added the bug Something isn't working label Sep 18, 2020
@Optimox Optimox removed the bug Something isn't working label Sep 18, 2020
@eduardocarvp
Copy link
Collaborator

Could you share the dataset and the parameters you have used for training?

@Optimox
Copy link
Collaborator

Optimox commented Sep 18, 2020

Hello @nickyi1990,

I'm sad to know that you are not able to get the performance you'd like.
But how are you hoping to receive some help with such a complaint?

We'd be happy to have a new dataset to try TabNet on and improve based on the (poor) results.

But please be constructive and share more information in order to reproduce and help.

Also there is a benchmark issue #127 here, if you took the time to benchmark tabnet on various datasets please share the results with us! If you can share your code that's even better!

Cheers

@nickyi1990
Copy link
Author

nickyi1990 commented Sep 21, 2020

Hello @nickyi1990,

I'm sad to know that you are not able to get the performance you'd like.
But how are you hoping to receive some help with such a complaint?

We'd be happy to have a new dataset to try TabNet on and improve based on the (poor) results.

But please be constructive and share more information in order to reproduce and help.

Also there is a benchmark issue #127 here, if you took the time to benchmark tabnet on various datasets please share the results with us! If you can share your code that's even better!

Cheers

The code is a little mess and can not run directly, but you can know my way to split dataset and the parameter i use for training from code below, the valid and test performance from tabnet is not as good as my vallina full connected network which is showed below~

  • vallina dnn
    image

  • tabnet
    image

from sklearn.preprocessing import LabelEncoder, StandardScaler

def seed_reproducer(seed=2020):
    """Reproducer for pytorch experiment.

    Parameters
    ----------
    seed: int, optional (default = 2019)
        Radnom seed.

    Example
    -------
    seed_reproducer(seed=2019).
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.enabled = True



def preprocess_salary_data():
    """ dataset : http://archive.ics.uci.edu/ml/datasets/Adult
    """
    print("process salary dataset")
    df = pd.read_csv("/home/kunzhong/.fastai/data/adult_sample/adult.csv")
    df["y"] = df["salary"].map({">=50k": 1, "<50k": 0})
    for feature in df.columns:
        if df[feature].dtype == "O":
            df[feature] = df[feature].fillna("Unknown")
        else:
            df[feature] = df[feature].fillna(-1)

    all_feature_names = df.columns
    category_feature_names = [
        "workclass",
        "education",
        "marital-status",
        "occupation",
        "relationship",
        "race",
        "sex",
        "native-country",
    ]
    target_name = "y"
    feature_names_tobe_dropped = ["y", "salary"]
    feature_names = all_feature_names.difference(feature_names_tobe_dropped + safe_list(target_name)).tolist()
    continous_feature_names = [f for f in feature_names if f not in category_feature_names]

    for feature in category_feature_names:
        le = LabelEncoder()
        df[feature] = le.fit_transform(df[feature])

    df_train, df_valid_test = train_test_split(df, test_size=0.3, random_state=2020)
    df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=2020)
    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)
    df_test = df_test.reset_index(drop=True)
    df_train.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_train_salary.p"))
    df_valid.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_valid_salary.p"))
    df_test.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_test_salary.p"))

preprocess_salary_data()

df_train_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_train_salary.p"))
df_valid_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_valid_salary.p"))
df_test_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_test_salary.p"))


all_feature_names = df_train_salary.columns
cat_features = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
]
target_name = "y"
feature_names_tobe_dropped = ["y", "salary"]
feature_names = all_feature_names.difference(feature_names_tobe_dropped + safe_list(target_name)).tolist()
cat_dim_dict = {}
for cat_feature in cat_features:
    le = LabelEncoder()
    l = df_train_salary[cat_feature].append(df_valid_salary[cat_feature]).append(df_test_salary[cat_feature]).fillna(0)
    le.fit(l)
    cat_dim_dict[cat_feature] = l.nunique()
    df_train_salary[cat_feature] = le.transform(df_train_salary[cat_feature].fillna(0))
    df_valid_salary[cat_feature] = le.transform(df_valid_salary[cat_feature].fillna(0))
    df_test_salary[cat_feature] = le.transform(df_test_salary[cat_feature].fillna(0))
    print(cat_feature, df_train_salary[cat_feature].unique())
    
# 这么写主要是为了维持住变量的顺序
cat_idxs = [feature_names.index(f) for f in cat_features]
cat_dims = [cat_dim_dict[f] for f in cat_features]

for feature_name in list(set(feature_names).difference(cat_features)):
    ss = StandardScaler()
    l = df_train_salary[feature_name].append(df_valid_salary[feature_name]).append(df_test_salary[feature_name]).fillna(0)
    ss.fit(l.values.reshape(-1, 1))
    df_train_salary[feature_name] = ss.transform(df_train_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()
    df_valid_salary[feature_name] = ss.transform(df_valid_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()
    df_test_salary[feature_name] = ss.transform(df_test_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()

@Optimox
Copy link
Collaborator

Optimox commented Sep 21, 2020

  • Thanks so this is census-income dataset right?

We have an example notebook to show how TabNet works : https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb.
Results look more than ok to me, they are on par with an XGBoost model, which I guess perform better than a "vanilla dnn" whatever that means.

  • The code is a little mess and can not run directly

Sorry to hear that, but maybe that's one reason you can't get competitive results (as shown for census-income).

  • You also talked about many tabular datasets maybe you could share some clean code for those?

  • I'll close the issue for now since your feedback can't be reproduce, feel free to reopen as soon as you have reproducible results on which we can talk.

@Optimox Optimox closed this as completed Sep 21, 2020
@nickyi1990
Copy link
Author

nickyi1990 commented Sep 22, 2020

  • Thanks so this is census-income dataset right?

We have an example notebook to show how TabNet works : https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb.
Results look more than ok to me, they are on par with an XGBoost model, which I guess perform better than a "vanilla dnn" whatever that means.

  • The code is a little mess and can not run directly

Sorry to hear that, but maybe that's one reason you can't get competitive results (as shown for census-income).

  • You also talked about many tabular datasets maybe you could share some clean code for those?
  • I'll close the issue for now since your feedback can't be reproduce, feel free to reopen as soon as you have reproducible results on which we can talk.

I create a colab notebook, you can check the performance of lightgbm and tabnet in the last cell, there's a big difference.

@Optimox
Copy link
Collaborator

Optimox commented Sep 22, 2020

maybe try a larger patience

@nickyi1990
Copy link
Author

maybe try a larger patience

I really appreciate the work u did and the idea of paper, I tried many times and can not reach a similar performance, could u try to tune the Tabnet to let it get similar performance as lightgbm? and could u reopen the issue to let more people join in to find where's the problem? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants