From f17cf14d4f2272a4e247585a2fd20eadede3ec46 Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Fri, 22 Nov 2024 14:54:02 +0800 Subject: [PATCH] Update document. --- demo/dask/dask_learning_to_rank.py | 3 +++ doc/tutorials/dask.rst | 39 +++++++++++++++++++++--------- doc/tutorials/learning_to_rank.rst | 6 +++-- 3 files changed, 35 insertions(+), 13 deletions(-) diff --git a/demo/dask/dask_learning_to_rank.py b/demo/dask/dask_learning_to_rank.py index 3567176ba0bf..65df5a84eae7 100644 --- a/demo/dask/dask_learning_to_rank.py +++ b/demo/dask/dask_learning_to_rank.py @@ -6,6 +6,9 @@ MSLR_10k_letor dataset. For more infomation about the dataset, please visit its `description page `_. +See :ref:`ltr-dist` for a general description for distributed learning to rank and +:ref:`ltr-dask` for Dask-specific features. + """ from __future__ import annotations diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index e039e8e5aa11..805c28120c21 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -529,6 +529,8 @@ See https://github.com/coiled/dask-xgboost-nyctaxi for a set of examples of usin with dask and optuna. +.. _ltr-dask: + **************** Learning to Rank **************** @@ -539,17 +541,32 @@ Learning to Rank Position debiasing is not yet supported. -Similar to the (Py)Spark interface, the XGBoost Dask interface can automatically sort and -group the samples based on input query ID since version 3.0. However, the automatic -grouping in the Dask interface has some caveats that one needs to be aware of, namely it -increases memory usage and it groups only worker-local data. This should be similar with -other interfaces. - -For the memory usage part, XGBoost first checks whether the query ID is sorted before -actually sorting the samples. If you don't want XGBoost to sort and group the data during -training, one solution is to sort it beforehand. See :ref:`ltr-dist` for more info about -the implication of worker-local grouping, -:ref:`sphx_glr_python_dask-examples_dask_learning_to_rank.py` for a worked example. +There are two operation modes in the Dask learning to rank for performance reasons. The +difference is whether a distributed global sort is needed. Please see :ref:`ltr-dist` for +how rankings work with distributed training in general. Below we will discuss some of the +Dask-specific features. + +First, if you use the :py:class:`~xgboost.dask.DaskQuantileDMatrix` interface or the +:py:class:`~xgboost.dask.DaskXGBRanker` with ``allow_group_split`` set to ``True``, +XGBoost will try to sort and group the samples for each worker based on the query ID. This +mode tries to skip the global sort and sort only worker-local data, and hence no +inter-worker data shuffle. Please note that even worker-local sort is costly, particularly +in terms of memory usage as there's no spilling when +:py:meth:`~pandas.DataFrame.sort_values` is used. XGBoost first checks whether the QID is +already sorted before actually performing the sorting operation. One can choose this if +the query groups are relatively consecutive, meaning most of the samples within a query +group are close to each other and are likely to be resided to the same worker. Don't use +this if you have performed a random shuffle on your data. + +If the input data is random, then there's no way we can guarantee most of data within the +same group being in the same worker. For large query groups, this might not be an +issue. But for small query groups, it's possible that each worker gets only one or two +samples from their group for all groups, which can lead to disastrous performance. In that +case, we can partition the data according to query group, which is the default behavior of +the :py:class:`~xgboost.dask.DaskXGBRanker` unless the ``allow_group_split`` is set to +``True``. This mode performs a sort and a groupby on the entire dataset, which might lead +to slow performance. See :ref:`sphx_glr_python_dask-examples_dask_learning_to_rank.py` for +a worked example. .. _tracker-ip: diff --git a/doc/tutorials/learning_to_rank.rst b/doc/tutorials/learning_to_rank.rst index 756f5089df23..8743a672d219 100644 --- a/doc/tutorials/learning_to_rank.rst +++ b/doc/tutorials/learning_to_rank.rst @@ -180,9 +180,11 @@ counterpart. Please refer to document of the respective XGBoost interface for de Position-debiasing is not yet supported for existing distributed interfaces. -XGBoost works with collective operations, which means data is scattered to multiple workers. We can divide the data partitions by query group and ensure no query group is split among workers. However, this requires a costly sort and groupby operation and might only be necessary for selected use cases. Splitting and scattering a query group to multiple workers is theoretically sound but can affect the model's accuracy. For most use cases, the small discrepancy is not an issue, as the amount of training data is usually large when distributed training is used. For a longer explanation, assuming the pairwise ranking method is used, we calculate the gradient based on relevance degree by constructing pairs within a query group. If a single query group is split among workers and we use worker-local data for gradient calculation, then we are simply sampling pairs from a smaller group for each worker to calculate the gradient and the evaluation metric. The comparison between each pair doesn't change because a group is split into sub-groups, what changes is the number of total and effective pairs and normalizers like `IDCG`. One can generate more pairs from a large group than it's from two smaller subgroups. As a result, the obtained gradient is still valid from a theoretical standpoint but might not be optimal. +XGBoost works with collective operations, which means data is scattered to multiple workers. We can divide the data partitions by query group and ensure no query group is split among workers. However, this requires a costly sort and groupby operation and might only be necessary for selected use cases. Splitting and scattering a query group to multiple workers is theoretically sound but can affect the model's accuracy. If there are only a small number of groups sitting at the boundaries of workers, the small discrepancy is not an issue, as the amount of training data is usually large when distributed training is used. -Unless there's a very small number of query groups, we don't need to partition the data based on query groups. As long as each data partitions within a worker are correctly sorted by query IDs, XGBoost can aggregate sample gradients accordingly. And both the (Py)Spark interface and the Dask interface can sort the data according to query ID, please see respected tutorials for more information. +For a longer explanation, assuming the pairwise ranking method is used, we calculate the gradient based on relevance degree by constructing pairs within a query group. If a single query group is split among workers and we use worker-local data for gradient calculation, then we are simply sampling pairs from a smaller group for each worker to calculate the gradient and the evaluation metric. The comparison between each pair doesn't change because a group is split into sub-groups, what changes is the number of total and effective pairs and normalizers like `IDCG`. One can generate more pairs from a large group than it's from two smaller subgroups. As a result, the obtained gradient is still valid from a theoretical standpoint but might not be optimal. As long as each data partitions within a worker are correctly sorted by query IDs, XGBoost can aggregate sample gradients accordingly. And both the (Py)Spark interface and the Dask interface can sort the data according to query ID, please see respected tutorials for more information. + +However, it's possible that a distributed framework shuffles the data during map reduce and splits every query group into multiple workers. In that case, the performance would be disastrous. As a result, it depends on the data and the framework for whether a sorted groupby is needed. ******************* Reproducible Result