-
Notifications
You must be signed in to change notification settings - Fork 121
Performance tips
To configure the CosmosDB Spark Connector to achieve the best performance, below are a few things you can consider.
If the scenario only works on a subset of the data from CosmosDB collection, you can specify a custom query by setting the query_custom
that is used to fetch the data from each of the CosmosDB collection partition. By default, the query is constructed with the predicates from the SparkSQL query. This default query is overridden if a custom query is provided.
"query_custom" : "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c"
When fetching data from CosmosDB collection, the connector maps each CosmosDB collection partition to an executor on the Spark worker node and uses the CosmosDB Java SDK to send a query to the target partition. As a parameter in the query options, you can specify the number of documents that each of the query pages should contain with query_pagesize
. The larger the page size the less network round trip is required to get the data and hence better throughput. The backend will get as many documents as the specified page size while keeping the response size within a certain threshold.
# Configuration for PySpark
# Connection
flightsConfig = {
"Endpoint" : "https://$account$.documents.azure.com:443/",
"Masterkey" : "$masterkey$",
"Database" : "$database$",
"preferredRegions" : "$region1$, $region2$",
"Collection" : "$collection",
"SamplingRatio" : "1.0",
"schema_samplesize" : "1000",
"query_pagesize" : "2147483647",
"query_custom" : "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c"
}
When launching SparkContext, the executor parameters can be fine tuned so that all cluster resources are utilized. The notable parameters are --num-executors
, --executor-cores
, and --executor-memory
which specifies the number of executors and their computing and memory capacity.
For applications which do more analytics, you can use more cores for each executor. On the other hand, if that is not needed, you can increase the number of executors while using less number of cores for each of them to increase the parallelism and throughput.
In writing scenario, if we can provision the Spark cluster so that the number of cores is at least the number of tasks, all the tasks will be done at once in parallel, giving the best throughput. If the number of Spark partition is more than the number of executors, it can usually help by coalescing the data by executors or worker nodes before writing to avoid the scheduling overhead.