Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selector Upgrade #7101

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Selector Upgrade #7101

wants to merge 11 commits into from

Conversation

acurtis-evi
Copy link

@acurtis-evi acurtis-evi commented Mar 1, 2023

resolves #5009

Description

There are some really cool features in DBT such as the ability to choose models and their descendents, graph ordered builds, and much more. However, I found that some of the syntax seemed to be lacking (particularly around the selectors yml syntax).

The definition part of the selector yaml supports some model selection operators, but requires that the end user use a different approach to describe what can be done in short-hand on the command line. For example, on the command line, I can select the children of a model using

dbt ls --select some_model+

In the yaml,

- name: some_selector
  description:
    - some_model+

does not work and instead

- name: some_selector
  description:
    - method: fqn
       value: some_model
       children: true    

must be used.

The yaml is very useful in defining routine selections. This inconsistency with the command line forces the end user to learn two approaches.

Furthermore, there is support from the command line to union and intersect using the --select operator. However, if one wishes to exclude, the separate --exclude operator is needed. The syntax is very straight forward for the command line approach, but again the yaml approach is different. I see value in the yaml approach. I just believe that the two should be compatible. In this PR, I propose that the --select can include NOT (all caps - but easy to change) as a way of embedding excluded models in a single string.

dbt ls --select 'selector:some_selector+1 NOT selector:some_selector'

In addition, dbt doesn't allow selectors to be expanded using the children/parent/ancestor operators. This PR addresses that and makes selectors just like any other method.

Syntax in the yaml is also supported

- definition: selector:some_other_selector+1 NOT some_model+2 some_other_model
    name: some_selector

Most of the code removes the selectors dictionary from being passed around and instead opts to pass the project to the graph/cli get_selected and get_graph_queue methods.

Checklist

@acurtis-evi acurtis-evi requested a review from a team March 1, 2023 20:32
@acurtis-evi acurtis-evi requested a review from a team as a code owner March 1, 2023 20:32
@acurtis-evi acurtis-evi requested review from stu-k and gshank March 1, 2023 20:32
@cla-bot cla-bot bot added the cla:yes label Mar 1, 2023
return self.get_selected(self.project.get_selector(spec.value))
else:
method = self.get_method(spec.method, spec.method_arguments)
return set(method.search(included_nodes, spec.value))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.project.get_selector is the only place that checks if selectors actually exist

sel_def = definition.get("value")
if sel_def not in selector_dict:
raise DbtSelectorsError(f"Existing selector definition for {sel_def} not found.")
return selector_dict[definition["value"]]["definition"]
return definition
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checked in graph/selector.py (or config/project.py)

definition = cls.parse_from_definition(
selector["definition"], selector_dict=deepcopy(selector_dict)
selector_dict[sel_name]["definition"] = cls.parse_from_definition(
selector["definition"]
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selector_dict doesn't initially resolve the selectors. This is left until the end to allow for it to be more dynamic

@@ -33,28 +33,64 @@ def parse_union(
# turn ['a b', 'c'] -> ['a', 'b', 'c']
raw_specs = itertools.chain.from_iterable(r.split(" ") for r in components)
union_components: List[SelectionSpec] = []
exclude_components: List[SelectionSpec] = []
in_exclude = False

flags = get_flags()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow for --select 'some_expression --exclude some_other_expression', having it be one string allows for the yaml syntax to support union/intersection/exclude in a single string

sel_def = definition.get("value")
if sel_def not in result:
raise DbtValidationError(f"Existing selector definition for {sel_def} not found.")
return result[definition["value"]]["definition"]
elif "method" in definition and "value" in definition:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to graph/selector.py / config/project.py

@@ -45,6 +45,7 @@ class MethodName(StrEnum):
Metric = "metric"
Result = "result"
SourceStatus = "source_status"
Selector = "selector"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "selector" is added without a corresponding class

@@ -58,7 +58,7 @@ def __init__(self, args, config, manifest):
def _iterate_selected_nodes(self):
selector = self.get_node_selector()
spec = self.get_selection_spec()
nodes = sorted(selector.get_selected(spec))
nodes = sorted(selector.get_selected(spec, self.config))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_selected needs the project to call get_selector

@@ -125,7 +125,7 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):
def get_graph_queue(self) -> GraphQueue:
selector = self.get_node_selector()
spec = self.get_selection_spec()
return selector.get_graph_queue(spec)
return selector.get_graph_queue(spec, self.config)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_graph_queue needs the project to call get_selector

@acurtis-evi acurtis-evi requested a review from a team as a code owner March 1, 2023 21:26
@acurtis-evi
Copy link
Author

acurtis-evi commented Mar 2, 2023

The --exclude operator was meant to work like a NOT operator like ...

dbt ls --select 'selector:some_selector+1 NOT selector:some_selector,... something_else'

where everything to the right of NOT would be used to exclude from the left.

The reason to choose --exclude was that it was very unlikely that it would conflict with another model and break anything. Also, the cli could just append --exclude to the --select.

With the selector being added, it effectively allows for parenthetical support and very advanced expressions.

The exact syntax could easily change if it is desired.

@acurtis-evi
Copy link
Author

@gshank and @stu-k - I believe this is ready from a functional perspective. I've never actually contributed anything to DBT, so not sure what else should be done

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 3, 2023

@acurtis-evi Very cool! :)

I definitely want to take a look at this, and play around with some of the proposed additions from a UX perspective, before moving into more-detailed code review with the dev team. (@dbeatty10 too, if you're able & interested!)

@dbeatty10
Copy link
Contributor

The --exclude operator was meant to work like a NOT operator like ...

dbt ls --select 'selector:some_selector+1 NOT selector:some_selector,... something_else'
where everything to the right of NOT would be used to exclude from the left.

Just to make sure I'm following, this is a set difference, right?

A quick brainstorm of ideas relating to the syntax:

  • NOT
  • not
  • exclude
  • minus
  • except
  • - (minus sign)
  • (Set Minus Unicode U+2216)
  • \ (backslash)

@acurtis-evi
Copy link
Author

It is a set difference. In this PR, the set difference part is isolated to one function parse_union in dbt/core/graph/cli.py.

The idea is that it would work like

dbt ls --select 'selector:some_selector+1' --exclude 'selector:some_selector,... something_else'

The other ideas in this PR are more powerful, particularly the ability to allow the selector yaml syntax to be identical to command line syntax and to be able to expand upon selectors themselves (selector:some_selector+1 for example)

@acurtis-evi
Copy link
Author

I think EXCEPT might be a better name than NOT

@acurtis-evi
Copy link
Author

Changed NOT => EXCEPT

Support using --selector + --select / --exclude at the same time. A new selector named arg_selector is introduced which is the combination of --select/--exclude and if --selector is specified, that remains the primary action.

dbt ls --selector arg_selector --select some_model

@acurtis-evi
Copy link
Author

There is also support for the arg_selector in the selectors.yml. This PR moves the selectors.yml validation to runtime, allowing for selectors to be out of order and to refer to the arg_selector. It is possible to have cycles with this approach, but I tend to think that the benefit of the selection flexibility significantly outweighs the downside of it.

@acurtis-evi acurtis-evi mentioned this pull request Mar 8, 2023
6 tasks
@jtcohen6 jtcohen6 self-assigned this Mar 22, 2023
@jtcohen6 jtcohen6 added the Refinement Maintainer input needed label Jun 25, 2023
@gshank gshank removed request for gshank and stu-k July 19, 2023 14:37
Copy link
Contributor

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 20, 2024
@mroy-seedbox
Copy link

It would be nice to be able to run selectors minus one or two models.

Right now, the only way to do so is to define another selector, which is a cumbersome process in some environment (but would be very simple on the command line).

@github-actions github-actions bot removed the stale Issues that have gone stale label Jan 22, 2024
@dbeatty10 dbeatty10 added the community This PR is from a community member label Mar 22, 2024
@mroy-seedbox
Copy link

Alternative (and perhaps simpler) implementation proposed here: #10992.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member Refinement Maintainer input needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-469] Combine --select/--exclude with --selector when used together?
4 participants