Table of Contents
train.csv
the training set from the bidder datasettest.csv
the test set from the bidder datasetbids.csv
the bid dataset
- There are two datasets in this competition. One is a bidder dataset that includes a list of bidder information, including their id, payment account, and address. The other is a bid dataset that includes 7.6 million bids on different auctions. The bids in this dataset are all made by mobile devices.
- The online auction platform has a fixed increment of dollar amount for each bid, so it doesn't include an amount for each bid. You are welcome to learn the bidding behavior from the time of the bids, the auction, or the device.
For the bidder dataset
bidder_id
Unique identifier of a bidder.payment_account
Payment account associated with a bidder. These are obfuscated to protect privacy.address
Mailing address of a bidder. These are obfuscated to protect privacy.outcome
Label of a bidder indicating whether or not it is a robot. Value 1.0 indicates a robot, where value 0.0 indicates human.
For the bid dataset
bid_id
Unique id for this bidbidder_id
Unique identifier of a bidder (same as the bidder_id used in train.csv and test.csv)auction
Unique identifier of an auctionmerchandise
The category of the auction site campaign, which means the bidder might come to this site by way of searching for "home goods" but ended up bidding for "sporting goods" - and that leads to this field being "home goods". This categorical field could be a search term, or online advertisement.device
Phone model of a visitortime
Time that the bid is made (transformed to protect privacy).country
The country that the IP belongs toip
IP address of a bidder (obfuscated to protect privacy).url
URL where the bidder was referred from (obfuscated to protect privacy).
- Scored on AUC which is the area under an ROC (Receiver Operating Characteristics) curve; aggregates the performance of the model at all threshold values
- measure of the linear relationship between 2 or more variables
- find a subset of features, such that in the data space spanned by the selected features, the distances between data points in different classes are as large as possible, while the distances between data points in the same class are as small as possible
Ranks of the variables based on fisher’s score in descending order
Used SMOTE to balance class distribution by randomly increasing the minority class
- bids_per_url - mean number of bids made per URL
- bids_per_auction - mean number of bids made per auction
- ip_per_device - mean number of IPs used for every device a bidder used
- url_per_auction - mean number of URLs used per auction
- lg_total_bids - log transformation of the total number of bids made by a bidder
- lg_total_device - log transformation of the total number of devices used by a bidder
- lg_total_country - log transformation of the total number of country an IP belongs to
- lg_total_ip - log transformation of the total number of IPs used
- lg_total_url - log transformation of the total number of URLs used
- lg_total_auction - log transformation of the total number of auctions a bidder joined
Used an ensemble of
- Random Forest Classifier
- CatBoost Classifier
- Gradient Boosting Classifier