Open
Description
Describe the feature you'd like
Currently, everything for PySpark (Processor, training etc.) is logged at INFO level, including basic setup of cluster, for example:
<!-- Site specific YARN configuration properties --> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>10.0.215.164</value> <description>The hostname of the RM.</description> </property> <property> <name>yarn.nodemanager.hostname</name> <value>algo-1</value> <description>The hostname of the NM.</description> </property> <property> <name>yarn.nodemanager.webapp.address</name> <value>algo-1:8042</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>5</value> <description>Ratio between virtual memory to physical memory.</description> </property> <property> <name>yarn.resourcemanager.am.max-attempts</name> <value>1</value> <description>The maximum number of application attempts.</description> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,YARN_HOME,AWS_CONTAINER_CREDENTIALS_RELATIVE_URI,AWS_REGION</value> <description>Environment variable whitelist</description> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>32768</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>8</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>32768</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>8</value> </property>
--
<br class="Apple-interchange-newline">
This results in a lot of CloudWatch logs, with major downsides:
- most of the logs are completely useless, printing internal details of infrastructure
- searching logs for anything actually important is very hard, which is very problematic for training and monitoring mission-critical models
- it is quite costly for running short jobs, making CloudWatch costs high compared to actual compute costs
Adding options to configure log4j logger, or at least some options to limit this (e.g. minimal logging level), would be to get rid of this. It is also very simple to implement.
How would this feature be used? Please describe.
Additional argument(s) passed to e.g. PySparkProcessor
.
Describe alternatives you've considered
Full customizability is not necessarily required, but setting minimal log level is very important.