Skip to content
This repository has been archived by the owner on Jul 2, 2024. It is now read-only.

Obrok/fuzzer redux #3062

Merged
merged 22 commits into from
Sep 6, 2018
Merged

Obrok/fuzzer redux #3062

merged 22 commits into from
Sep 6, 2018

Conversation

obrok
Copy link
Contributor

@obrok obrok commented Sep 5, 2018

I want to make it so that most fuzzer queries can be ran against a data source as opposed to failing at compile time. I think some large changes must happen to that end:

  1. Generation of some structures must go bottom-up. In particular subqueries for a given query need to be figured out, before that query can be generated.
  2. Generation of some other structures must go top-down. For example some expressions are forbidden in subqueries, so we would like to know if we're in the context of a top-level query or subquery when generating those.
  3. Currently the fuzzer either generates incredibly complex queries with many subqueries and very complex expressions or very simple queries. I want to take direct control of the complexity of what's being generated such that the amount of complexity "allowed" is "invested" in various ways - either building more complex expressions or a more complex overall query structure. This should make the fuzzer generate queries of medium complexity, making it more likely that such a query will both be useful and able to run.

Achieving both of these with stream_data is somewhat tricky. The size parameter which is supposed to guide generator complexity is difficult to properly manage. It's also somewhat cumbersome to generate data when both a top-down and bottom-up approach are needed. Because of this I'm trying a hand-crafted approach.

As a further note on stream_data: I was hoping we'd be able to use the minimization features to be able to identify a small failing example with much less human intervention. However, because of a complex-looking issue in the lib (whatyouhide/stream_data#97) this didn't pan out. It seems like it should be possible to do a hand-crafted example minimization algorithm, but, of course, we don't need stream_data for that.

The version in this PR generates queries that don't really reference data, except for selecting user ids. However it already uses most of the high-level query features (WHERE, HAVING, etc.) and manages to retain over 95% successful query rate (only querying postgres for now), despite generating deeply nested subqueries. Even this simple version already uncovered three issues (#3049, #3061, #3059), so I think it can be called a small success.

obrok added 19 commits August 31, 2018 18:11
I want to make it so that most fuzzer queries can be ran against a data
source as opposed to failing at compile time. I think some large changes
must happen to that end:

1. Generation of some structures must go bottom-up. In particular
  subqueries for a given query need to be figured out, before that query
  can be generated.
2. Generation of some other structures must go top-down. For example
  some expressions are forbidden in subqueries, so we would like to know
  if we're in the context of a top-level query or subquery when
  generating those.
3. Currently the fuzzer either generates incredibly complex queries with
  many subqueries and very complex expressions or very simple queries. I
  want to take direct control of the complexity of what's being
  generated such that the amount of complexity "allowed" is "invested"
  in various ways - either building more complex expressions or a more
  complex overall query structure. This should make the fuzzer generate
  queries of medium complexity, making it more likely that such a query
  will both be useful and able to run.

Achieving both of these with stream_data is somewhat tricky. The size
parameter which is supposed to guide generator complexity is difficult to
properly manage. It's also somewhat cumbersome to generate data when
both a top-down and bottom-up approach are needed. Because of this I'm
trying a hand-crafted approach.
@aircloak-robot

This comment has been minimized.

@aircloak-robot
Copy link
Collaborator

Standard tests have passed 🎉

@obrok obrok requested a review from sasa1977 September 5, 2018 17:14
sasa1977
sasa1977 previously approved these changes Sep 6, 2018
|> Function.type_specs()
|> Enum.map(fn {argument_types, return_type} ->
{function, argument_types, return_type}
defmacrop generate(complexity, {:%{}, line, options}) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line doesn't seem to be used. You could also use a list of tuples instead of a map, and then you wouldn't need to depend on the quoted format (because a quoted list of pairs is a list of pairs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed later. I didn't like the extra parens required in case of a list of tuples, so I settled on this format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more important question is why does this need to be a macro? Macros have all kinds of weird semantics, and I can't immediately tell what are the gains here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't make it a macro, except for the need for lazy evaluation of the options. The other alternative would be to make it a function with calling syntax something like:

frequency(complexity, [
  {1, fn -> first_option(complexity) end},
  {1, fn -> second_option(comlexity / 2, complexity / 2) end}
])

I had that in the beginning, but it looks rather messy with the extra fns.

end
defp generate_scaffold(tables, complexity) do
generate(complexity, %{
3 => %Scaffold{from: Enum.random(tables), complexity: complexity},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these 1,2,3 keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are weights associated with the options. So in this case the first option is 3 times more likely than the other two. Furthermore if complexity is low, only the earlier options will be chosen. Here, if complexity is 4, it will only ever choose the first or second option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be made more explicit if you used a list of tuples instead of a map? Something like:

[
  {%Scaffold{...}, weight: 3},
  # ...
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current map representation means we can't specify more options with the same likelihood. Is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, because it goes through the macro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha, you're right, but I gotta say that's quite hacky and confusing. If I ever saw a code like

%{
  1 => foo(),
  1 => bar()
}

I'd be very puzzled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll change it. WDYT about using the macro to make them evaluate lazily?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about using the macro to make them evaluate lazily?

I'm not completely convinced whether it's more helpful or distracting, but it's fairly simple, so I'll leave that decision up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I'll leave as-is for now. It's just a mechanical task to change this later if it proves problematic for some reason.


defp group_by_elements({:select, _, items}) do
items
|> Enum.with_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also do |> Enum.with_index(1), and then you wouldn't need + 1 two lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

})
end

defp simple_condtion(), do: {:=, nil, [constant(), constant()]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo in condtion

@aircloak-robot
Copy link
Collaborator

Pull request can be merged 🎉

@aircloak-robot
Copy link
Collaborator

Standard tests have passed 🎉

@obrok
Copy link
Contributor Author

obrok commented Sep 6, 2018

Ready for review

@aircloak-robot
Copy link
Collaborator

Pull request can be merged 👍

@obrok obrok merged commit 46f700f into master Sep 6, 2018
@obrok obrok deleted the obrok/fuzzer-redux branch September 6, 2018 11:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants