-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance is bad #189
Comments
Could you share the dataset and the parameters you have used for training? |
Hello @nickyi1990, I'm sad to know that you are not able to get the performance you'd like. We'd be happy to have a new dataset to try TabNet on and improve based on the (poor) results. But please be constructive and share more information in order to reproduce and help. Also there is a benchmark issue #127 here, if you took the time to benchmark tabnet on various datasets please share the results with us! If you can share your code that's even better! Cheers |
The code is a little mess and can not run directly, but you can know my way to split dataset and the parameter i use for training from code below, the valid and test performance from tabnet is not as good as my vallina full connected network which is showed below~ from sklearn.preprocessing import LabelEncoder, StandardScaler
def seed_reproducer(seed=2020):
"""Reproducer for pytorch experiment.
Parameters
----------
seed: int, optional (default = 2019)
Radnom seed.
Example
-------
seed_reproducer(seed=2019).
"""
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = True
def preprocess_salary_data():
""" dataset : http://archive.ics.uci.edu/ml/datasets/Adult
"""
print("process salary dataset")
df = pd.read_csv("/home/kunzhong/.fastai/data/adult_sample/adult.csv")
df["y"] = df["salary"].map({">=50k": 1, "<50k": 0})
for feature in df.columns:
if df[feature].dtype == "O":
df[feature] = df[feature].fillna("Unknown")
else:
df[feature] = df[feature].fillna(-1)
all_feature_names = df.columns
category_feature_names = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
]
target_name = "y"
feature_names_tobe_dropped = ["y", "salary"]
feature_names = all_feature_names.difference(feature_names_tobe_dropped + safe_list(target_name)).tolist()
continous_feature_names = [f for f in feature_names if f not in category_feature_names]
for feature in category_feature_names:
le = LabelEncoder()
df[feature] = le.fit_transform(df[feature])
df_train, df_valid_test = train_test_split(df, test_size=0.3, random_state=2020)
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=2020)
df_train = df_train.reset_index(drop=True)
df_valid = df_valid.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_train.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_train_salary.p"))
df_valid.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_valid_salary.p"))
df_test.to_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_test_salary.p"))
preprocess_salary_data()
df_train_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_train_salary.p"))
df_valid_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_valid_salary.p"))
df_test_salary = pd.read_pickle(os.path.join(PROCESSED_FOLDER_PATH, "df_test_salary.p"))
all_feature_names = df_train_salary.columns
cat_features = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
]
target_name = "y"
feature_names_tobe_dropped = ["y", "salary"]
feature_names = all_feature_names.difference(feature_names_tobe_dropped + safe_list(target_name)).tolist()
cat_dim_dict = {}
for cat_feature in cat_features:
le = LabelEncoder()
l = df_train_salary[cat_feature].append(df_valid_salary[cat_feature]).append(df_test_salary[cat_feature]).fillna(0)
le.fit(l)
cat_dim_dict[cat_feature] = l.nunique()
df_train_salary[cat_feature] = le.transform(df_train_salary[cat_feature].fillna(0))
df_valid_salary[cat_feature] = le.transform(df_valid_salary[cat_feature].fillna(0))
df_test_salary[cat_feature] = le.transform(df_test_salary[cat_feature].fillna(0))
print(cat_feature, df_train_salary[cat_feature].unique())
# 这么写主要是为了维持住变量的顺序
cat_idxs = [feature_names.index(f) for f in cat_features]
cat_dims = [cat_dim_dict[f] for f in cat_features]
for feature_name in list(set(feature_names).difference(cat_features)):
ss = StandardScaler()
l = df_train_salary[feature_name].append(df_valid_salary[feature_name]).append(df_test_salary[feature_name]).fillna(0)
ss.fit(l.values.reshape(-1, 1))
df_train_salary[feature_name] = ss.transform(df_train_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()
df_valid_salary[feature_name] = ss.transform(df_valid_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze()
df_test_salary[feature_name] = ss.transform(df_test_salary[feature_name].fillna(0).values.reshape(-1, 1)).squeeze() |
We have an example notebook to show how TabNet works : https://github.com/dreamquark-ai/tabnet/blob/develop/census_example.ipynb.
Sorry to hear that, but maybe that's one reason you can't get competitive results (as shown for census-income).
|
I create a colab notebook, you can check the performance of lightgbm and tabnet in the last cell, there's a big difference. |
maybe try a larger patience |
I really appreciate the work u did and the idea of paper, I tried many times and can not reach a similar performance, could u try to tune the Tabnet to let it get similar performance as lightgbm? and could u reopen the issue to let more people join in to find where's the problem? thanks! |
I test this method on many tabular dataset, which is much larger than dataset used in notebook example, the performance is really bad... is there anyone have the same situation with me?
The text was updated successfully, but these errors were encountered: