Skip to content
This repository has been archived by the owner on Jun 29, 2021. It is now read-only.

SNAP configuration parameters

sandeep akinapelli edited this page May 21, 2019 · 7 revisions
Name(prefix with spark.sparklinedata.spmd, unless specified) Description Default Bytes Unit
local.segment.cache Local Folder(s) to use to cache index files. The index files will be copied to one of these locations and then memory-mapped into Spark Executors. {"storageLocations" : [{"path" : "/tmp/olapcache", "maxSize" : 10000000 }], "columnCacheSizeBytes" : 0, "avgSizePerCacheFile" : 524288000, "LocalUnzipFileSizeFactor" : 4, "shareCacheAcrossExecutors" : true}
segment.query.cache Control Query Caching {"useCache" : false,"sizeInMBytes" : 1024,"expireAfterSeconds" : 60,"resultSizeMax" : 20000}
avgsizeperpartition 0 Used by subsequent Indexing Jobs as the avgSizePerPartition setting for the partitions being indexed. Usually this should be set in the Index parameters once during create olap index. ByteUnit.BYTE
preferredsegmentsize 0 Used by subsequent Indexing Jobs as the preferredSegmentSize setting for the partitions being indexed. Usually this should be set in the Index parameters once during create olap index. ByteUnit.BYTE
sizereductionpercent 0.25 An estimate in size reduction of the SPMD format compared to the orginal data. Ideally this is set by indexing a representatve sample and recording the size difference from the original datasize. By default this is set to 0.25
indexing.rowFlushBoundary The row batchsize used during indexing, this impacts the memory footprint of an indexing task; by default this is based on the value set in the Index Options, but can be override using this session level parameter. If this is set to a non-zero value, this value takes precedence over the value in the Index Options.
select.query.buffersize Preferred size( in bytes) of the Pagesize when running an Index Select Query, should be 1-10s of MB for optimal performance
num_partitions_indexed 0 Number of partitions being indexed in subsequent Indexing Jobs
gByEngine.offheapsize 1gb Off Heap Pool used by each instance of the Index GroupBy Engine; there is 1 for every core assigned to an Executor. ByteUnit.MiB
selectquery.pagesize 10000 Num. of rows fetched on each invocation of SNAP Index Select Query
enable.segmentcachemanager true If true, SegmentCacheManager is used to track segment locations, and influence olap Query locations
indexing.memory.percore 1gb Heap Space to Use for Indexing. Number of Concurrent Indexing Merge operations "is restricted such that the total memory needed doesn't exceed this value * the number of spark cores ByteUnit.MiB
spark.sparklinedata.use.snapwritercontainer true replace dynamiccontainerwriter with snapwritercontainer; for snap generated insert plans sorting of data in the dynamicwriter is not need, as this is done by Repartition/repartitionExpression operators added to the Plan. Only turn this off if you are directly writing to the SNAP Index.
spark.sparklinedata.indexing.default.rowbatch.memory 200mb If the memory footprint for a rowbatch cannot be inferred, than this value is used. ByteUnit.MiB
spark.sparklinedata.cache.metadata false Enable Caching Metadata at the Session Level. Requires thriftserver restart. Use with caution see Invalidate Session Metadata Cache for details
groupingsetrewrite.maxgroupings 3 Maximum Number of Grouping Sets in a Query that will trigger rewrite to Union of Aggregates.
groupingsetrewrite.nostats false should we rewrite to Union of Aggregates even when we cannot compute operator stats.
groupingsetrewrite.size.reduction.ratio 0.5 rewrite to Union of Aggregates only if output size of Union of Aggregates is reduced by this amount relative to estimate non aggregate(Expand Operator) size.
spark.sparklinedata.startup.script None SQL Script to run before staring ThriftServer Listener. This is an unsupported feature, please check with SNAP team before using.
spark.sparklinedata.startup.script.exec.delay 2 secs Number of seconds to wait(to allow for cluster executors to register) before running the startup script.
spark.sparklinedata.enable.custom.filecommit.protocol false If true, we use SNAP classes for FileCommit protocol during SNAP Indexing. This should not be turned off at a session level; only works when all concurrent sessions set this flag to true
spark.sparklinedata.spmd.source.lastupdatetimestamp 0 Update operation will only consider the source rows after the given timestamp (in long, representation of milliseconds). If value is 0, it would consider all of the source rows. Please note, if snap_source_last_update_time is used in IndexUpdateInfo, It will be applicable for insert operation too. You may reset this before each individual insert and update operation.
Clone this wiki locally