Add configuration of loggers for PySpark

**Describe the feature you'd like**
Currently, everything for PySpark (Processor, training etc.) is logged at INFO level, including basic setup of cluster, for example:
```
 <configuration>     <property>         <name>yarn.resourcemanager.hostname</name>         <value>10.0.215.164</value>         <description>The hostname of the RM.</description>     </property>     <property>         <name>yarn.nodemanager.hostname</name>         <value>algo-1</value>         <description>The hostname of the NM.</description>     </property>     <property>         <name>yarn.nodemanager.webapp.address</name>         <value>algo-1:8042</value>     </property>     <property>         <name>yarn.nodemanager.vmem-pmem-ratio</name>         <value>5</value>         <description>Ratio between virtual memory to physical memory.</description>     </property>     <property>         <name>yarn.resourcemanager.am.max-attempts</name>         <value>1</value>         <description>The maximum number of application attempts.</description>     </property>     <property>         <name>yarn.nodemanager.env-whitelist</name>         <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,YARN_HOME,AWS_CONTAINER_CREDENTIALS_RELATIVE_URI,AWS_REGION</value>         <description>Environment variable whitelist</description>     </property>   <property>    <name>yarn.scheduler.minimum-allocation-mb</name>    <value>1</value>  </property>  <property>    <name>yarn.scheduler.maximum-allocation-mb</name>    <value>32768</value>  </property>  <property>    <name>yarn.scheduler.minimum-allocation-vcores</name>    <value>1</value>  </property>  <property>    <name>yarn.scheduler.maximum-allocation-vcores</name>    <value>8</value>  </property>  <property>    <name>yarn.nodemanager.resource.memory-mb</name>    <value>32768</value>  </property>  <property>    <name>yarn.nodemanager.resource.cpu-vcores</name>    <value>8</value>  </property>
--
 

<br class="Apple-interchange-newline">
```
This results in a lot of CloudWatch logs, with major downsides:
- most of the logs are completely useless, printing internal details of infrastructure
- searching logs for anything actually important is very hard, which is very problematic for training and monitoring mission-critical models
- it is quite costly for running short jobs, making CloudWatch costs high compared to actual compute costs

Adding options to configure log4j logger, or at least some options to limit this (e.g. minimal logging level), would be to get rid of this. It is also very simple to implement.

**How would this feature be used? Please describe.**
Additional argument(s) passed to e.g. `PySparkProcessor`.

**Describe alternatives you've considered**
Full customizability is not necessarily required, but setting minimal log level is very important.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configuration of loggers for PySpark #4082

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add configuration of loggers for PySpark #4082

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions