Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
In addition to common ededup parameters Ray implementation provides two additional ones
- hash_cpu - specifies amount of CPU per hash actor
- num_hashes - specifies number of hash actors
We also provide an estimate to roughly determine cluster size for running transformer.
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the launcher.
--ededup_hash_cpu EDEDUP_HASH_CPU
number of CPUs per hash
--ededup_num_hashes EDEDUP_NUM_HASHES
number of hash actors to use
--ededup_doc_column EDEDUP_DOC_COLUMN
name of the column containing document
--ededup_doc_id_column EDEDUP_DOC_ID_COLUMN
name of the column containing document id
--ededup_use_snapshot EDEDUP_USE_SNAPSHOT
flag to continue from snapshot
--ededup_snapshot_directory EDEDUP_SNAPSHOT_DIRECTORY
location of snapshot files
These correspond to the configuration keys described above.