Skip to content

Add configuration of loggers for PySpark #4082

Open
@j-adamczyk

Description

@j-adamczyk

Describe the feature you'd like
Currently, everything for PySpark (Processor, training etc.) is logged at INFO level, including basic setup of cluster, for example:

<!-- Site specific YARN configuration properties --> <configuration>     <property>         <name>yarn.resourcemanager.hostname</name>         <value>10.0.215.164</value>         <description>The hostname of the RM.</description>     </property>     <property>         <name>yarn.nodemanager.hostname</name>         <value>algo-1</value>         <description>The hostname of the NM.</description>     </property>     <property>         <name>yarn.nodemanager.webapp.address</name>         <value>algo-1:8042</value>     </property>     <property>         <name>yarn.nodemanager.vmem-pmem-ratio</name>         <value>5</value>         <description>Ratio between virtual memory to physical memory.</description>     </property>     <property>         <name>yarn.resourcemanager.am.max-attempts</name>         <value>1</value>         <description>The maximum number of application attempts.</description>     </property>     <property>         <name>yarn.nodemanager.env-whitelist</name>         <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,YARN_HOME,AWS_CONTAINER_CREDENTIALS_RELATIVE_URI,AWS_REGION</value>         <description>Environment variable whitelist</description>     </property>   <property>    <name>yarn.scheduler.minimum-allocation-mb</name>    <value>1</value>  </property>  <property>    <name>yarn.scheduler.maximum-allocation-mb</name>    <value>32768</value>  </property>  <property>    <name>yarn.scheduler.minimum-allocation-vcores</name>    <value>1</value>  </property>  <property>    <name>yarn.scheduler.maximum-allocation-vcores</name>    <value>8</value>  </property>  <property>    <name>yarn.nodemanager.resource.memory-mb</name>    <value>32768</value>  </property>  <property>    <name>yarn.nodemanager.resource.cpu-vcores</name>    <value>8</value>  </property>
--
 

<br class="Apple-interchange-newline">

This results in a lot of CloudWatch logs, with major downsides:

  • most of the logs are completely useless, printing internal details of infrastructure
  • searching logs for anything actually important is very hard, which is very problematic for training and monitoring mission-critical models
  • it is quite costly for running short jobs, making CloudWatch costs high compared to actual compute costs

Adding options to configure log4j logger, or at least some options to limit this (e.g. minimal logging level), would be to get rid of this. It is also very simple to implement.

How would this feature be used? Please describe.
Additional argument(s) passed to e.g. PySparkProcessor.

Describe alternatives you've considered
Full customizability is not necessarily required, but setting minimal log level is very important.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions