Vowpal Wabbit - Large cardinality #2341

AllardJM · 2025-01-15T13:36:20Z

Spark: 3.2
synapse: com.microsoft.azure:synapseml_2.12:0.11.3

I have a large dataset (~19 billion rows). If I run VW on the data, using 8-10 columns (all but 1 are not numeric), the process completes in about 9 minutes, even with multiple quadratic terms (not shown) in the pass through args.

model = VowpalWabbitGeneric(
    numPasses=1,
    numBits = 18,
    useBarrierExecutionMode=False,
    passThroughArgs=" --loss_function logistic --link logistic --l1 0.000001  "
).fit(sdf)

However, if I take the same data and hash the 8-10 columns so that the resulting feature has ~5.5million distinct values and run the above, it runs forever (I killed the process after 10 hours).

Is there anything to know in terms of running VW on Spark when a name space has a very large and potentially sparse cardinality?

The text was updated successfully, but these errors were encountered:

github-actions bot added the triage label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vowpal Wabbit - Large cardinality #2341

Vowpal Wabbit - Large cardinality #2341

AllardJM commented Jan 15, 2025

Vowpal Wabbit - Large cardinality #2341

Vowpal Wabbit - Large cardinality #2341

Comments

AllardJM commented Jan 15, 2025