SNAP configuration parameters

Name(prefix with spark.sparklinedata.spmd, unless specified)	Description	Default	Bytes Unit
local.segment.cache	Local Folder(s) to use to cache index files. The index files will be copied to one of these locations and then memory-mapped into Spark Executors.	{"storageLocations" : [{"path" : "/tmp/olapcache", "maxSize" : 10000000 }], "columnCacheSizeBytes" : 0, "avgSizePerCacheFile" : 524288000, "LocalUnzipFileSizeFactor" : 4, "shareCacheAcrossExecutors" : true}
segment.query.cache	Control Query Caching	{"useCache" : false,"sizeInMBytes" : 1024,"expireAfterSeconds" : 60,"resultSizeMax" : 20000}
avgsizeperpartition	0	Used by subsequent Indexing Jobs as the avgSizePerPartition setting for the partitions being indexed. Usually this should be set in the Index parameters once during create olap index.	ByteUnit.BYTE
preferredsegmentsize	0	Used by subsequent Indexing Jobs as the preferredSegmentSize setting for the partitions being indexed. Usually this should be set in the Index parameters once during create olap index.	ByteUnit.BYTE
sizereductionpercent	0.25	An estimate in size reduction of the SPMD format compared to the orginal data. Ideally this is set by indexing a representatve sample and recording the size difference from the original datasize. By default this is set to 0.25
indexing.rowFlushBoundary		The row batchsize used during indexing, this impacts the memory footprint of an indexing task; by default this is based on the value set in the Index Options, but can be override using this session level parameter. If this is set to a non-zero value, this value takes precedence over the value in the Index Options.
select.query.buffersize		Preferred size( in bytes) of the Pagesize when running an Index Select Query, should be 1-10s of MB for optimal performance
num_partitions_indexed	0	Number of partitions being indexed in subsequent Indexing Jobs
gByEngine.offheapsize	1gb	Off Heap Pool used by each instance of the Index GroupBy Engine; there is 1 for every core assigned to an Executor.	ByteUnit.MiB
selectquery.pagesize	10000	Num. of rows fetched on each invocation of SNAP Index Select Query
enable.segmentcachemanager	true	If true, SegmentCacheManager is used to track segment locations, and influence olap Query locations
indexing.memory.percore	1gb	Heap Space to Use for Indexing. Number of Concurrent Indexing Merge operations "is restricted such that the total memory needed doesn't exceed this value * the number of spark cores	ByteUnit.MiB
spark.sparklinedata.use.snapwritercontainer	true	replace dynamiccontainerwriter with snapwritercontainer; for snap generated insert plans sorting of data in the dynamicwriter is not need, as this is done by Repartition/repartitionExpression operators added to the Plan. Only turn this off if you are directly writing to the SNAP Index.
spark.sparklinedata.indexing.default.rowbatch.memory	200mb	If the memory footprint for a rowbatch cannot be inferred, than this value is used.	ByteUnit.MiB
spark.sparklinedata.cache.metadata	false	Enable Caching Metadata at the Session Level. Requires thriftserver restart. Use with caution see Invalidate Session Metadata Cache for details
groupingsetrewrite.maxgroupings	3	Maximum Number of Grouping Sets in a Query that will trigger rewrite to Union of Aggregates.
groupingsetrewrite.nostats	false	should we rewrite to Union of Aggregates even when we cannot compute operator stats.
groupingsetrewrite.size.reduction.ratio	0.5	rewrite to Union of Aggregates only if output size of Union of Aggregates is reduced by this amount relative to estimate non aggregate(Expand Operator) size.
spark.sparklinedata.startup.script	None	SQL Script to run before staring ThriftServer Listener. This is an unsupported feature, please check with SNAP team before using.
spark.sparklinedata.startup.script.exec.delay	2 secs	Number of seconds to wait(to allow for cluster executors to register) before running the startup script.
spark.sparklinedata.enable.custom.filecommit.protocol	false	If true, we use SNAP classes for FileCommit protocol during SNAP Indexing. This should not be turned off at a session level; only works when all concurrent sessions set this flag to true
spark.sparklinedata.spmd.source.lastupdatetimestamp	0	Update operation will only consider the source rows after the given timestamp (in long, representation of milliseconds). If value is 0, it would consider all of the source rows. Please note, if snap_source_last_update_time is used in IndexUpdateInfo, It will be applicable for insert operation too. You may reset this before each individual insert and update operation.

SparklineData Website

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNAP configuration parameters

Clone this wiki locally