Obrok/fuzzer redux #3062

obrok · 2018-09-05T16:28:42Z

I want to make it so that most fuzzer queries can be ran against a data source as opposed to failing at compile time. I think some large changes must happen to that end:

Generation of some structures must go bottom-up. In particular subqueries for a given query need to be figured out, before that query can be generated.
Generation of some other structures must go top-down. For example some expressions are forbidden in subqueries, so we would like to know if we're in the context of a top-level query or subquery when generating those.
Currently the fuzzer either generates incredibly complex queries with many subqueries and very complex expressions or very simple queries. I want to take direct control of the complexity of what's being generated such that the amount of complexity "allowed" is "invested" in various ways - either building more complex expressions or a more complex overall query structure. This should make the fuzzer generate queries of medium complexity, making it more likely that such a query will both be useful and able to run.

Achieving both of these with stream_data is somewhat tricky. The size parameter which is supposed to guide generator complexity is difficult to properly manage. It's also somewhat cumbersome to generate data when both a top-down and bottom-up approach are needed. Because of this I'm trying a hand-crafted approach.

As a further note on stream_data: I was hoping we'd be able to use the minimization features to be able to identify a small failing example with much less human intervention. However, because of a complex-looking issue in the lib (whatyouhide/stream_data#97) this didn't pan out. It seems like it should be possible to do a hand-crafted example minimization algorithm, but, of course, we don't need stream_data for that.

The version in this PR generates queries that don't really reference data, except for selecting user ids. However it already uses most of the high-level query features (WHERE, HAVING, etc.) and manages to retain over 95% successful query rate (only querying postgres for now), despite generating deeply nested subqueries. Even this simple version already uncovered three issues (#3049, #3061, #3059), so I think it can be called a small success.

I want to make it so that most fuzzer queries can be ran against a data source as opposed to failing at compile time. I think some large changes must happen to that end: 1. Generation of some structures must go bottom-up. In particular subqueries for a given query need to be figured out, before that query can be generated. 2. Generation of some other structures must go top-down. For example some expressions are forbidden in subqueries, so we would like to know if we're in the context of a top-level query or subquery when generating those. 3. Currently the fuzzer either generates incredibly complex queries with many subqueries and very complex expressions or very simple queries. I want to take direct control of the complexity of what's being generated such that the amount of complexity "allowed" is "invested" in various ways - either building more complex expressions or a more complex overall query structure. This should make the fuzzer generate queries of medium complexity, making it more likely that such a query will both be useful and able to run. Achieving both of these with stream_data is somewhat tricky. The size parameter which is supposed to guide generator complexity is difficult to properly manage. It's also somewhat cumbersome to generate data when both a top-down and bottom-up approach are needed. Because of this I'm trying a hand-crafted approach.

aircloak-robot · 2018-09-05T16:42:43Z

Standard tests have passed 🎉

sasa1977 · 2018-09-06T08:16:56Z

cloak/lib/compliance/query_generator.ex

-      |> Function.type_specs()
-      |> Enum.map(fn {argument_types, return_type} ->
-        {function, argument_types, return_type}
+  defmacrop generate(complexity, {:%{}, line, options}) do


line doesn't seem to be used. You could also use a list of tuples instead of a map, and then you wouldn't need to depend on the quoted format (because a quoted list of pairs is a list of pairs).

Fixed later. I didn't like the extra parens required in case of a list of tuples, so I settled on this format.

Perhaps a more important question is why does this need to be a macro? Macros have all kinds of weird semantics, and I can't immediately tell what are the gains here.

I wouldn't make it a macro, except for the need for lazy evaluation of the options. The other alternative would be to make it a function with calling syntax something like:

frequency(complexity, [ {1, fn -> first_option(complexity) end}, {1, fn -> second_option(comlexity / 2, complexity / 2) end} ])

I had that in the beginning, but it looks rather messy with the extra fns.

sasa1977 · 2018-09-06T08:17:17Z

cloak/lib/compliance/query_generator.ex

-    end
+  defp generate_scaffold(tables, complexity) do
+    generate(complexity, %{
+      3 => %Scaffold{from: Enum.random(tables), complexity: complexity},


What are these 1,2,3 keys?

These are weights associated with the options. So in this case the first option is 3 times more likely than the other two. Furthermore if complexity is low, only the earlier options will be chosen. Here, if complexity is 4, it will only ever choose the first or second option.

Maybe this could be made more explicit if you used a list of tuples instead of a map? Something like:

[ {%Scaffold{...}, weight: 3}, # ... ]

The current map representation means we can't specify more options with the same likelihood. Is that intended?

We can, because it goes through the macro.

Hahaha, you're right, but I gotta say that's quite hacky and confusing. If I ever saw a code like

%{ 1 => foo(), 1 => bar() }

I'd be very puzzled.

OK, I'll change it. WDYT about using the macro to make them evaluate lazily?

WDYT about using the macro to make them evaluate lazily?

I'm not completely convinced whether it's more helpful or distracting, but it's fairly simple, so I'll leave that decision up to you.

In that case I'll leave as-is for now. It's just a mechanical task to change this later if it proves problematic for some reason.

sasa1977 · 2018-09-06T08:24:34Z

cloak/lib/compliance/query_generator.ex

+
+  defp group_by_elements({:select, _, items}) do
+    items
+    |> Enum.with_index()


You could also do |> Enum.with_index(1), and then you wouldn't need + 1 two lines below.

sasa1977 · 2018-09-06T08:26:03Z

cloak/lib/compliance/query_generator.ex

+    })
+  end
+
+  defp simple_condtion(), do: {:=, nil, [constant(), constant()]}


small typo in condtion

aircloak-robot · 2018-09-06T08:42:04Z

Pull request can be merged 🎉

aircloak-robot · 2018-09-06T11:12:00Z

Standard tests have passed 🎉

obrok · 2018-09-06T11:14:12Z

Ready for review

aircloak-robot · 2018-09-06T11:30:44Z

Pull request can be merged 👍

obrok added 19 commits August 31, 2018 18:11

Split out generic helpers from QueryGenerator

75b3145

Fix generating joins in fuzzer

645a6a6

Disallow any errors in fuzzer

07a1289

Carry table information in fuzzer

a3e46f9

Select user id as needed in fuzzer

0e4ec5a

Add user id comparison to ON as needed in fuzzer

9c342d9

Output seed in fuzzer

ac9ea16

Allow running fuzzer query from seed

16f545a

Fix detecting when to select user id in fuzzer

52a8107

Alias tables as needed in fuzzer

51d8407

Generate GROUP BY clauses in fuzzer

86487a9

Generate aggregates only in aggregate subqueries

3df87df

Generate WHERE clauses in fuzzer

d0eccb2

Generate SAMPLE_USERS clauses in fuzzer

ad81c21

Generate ORDER BY clauses in fuzzer

b39f744

Generate OFFSET clauses in fuzzer

9b224fd

Generate HAVING clauses in fuzzer

189d635

Add moduledoc

24e6985

This comment has been minimized.

Sign in to view

obrok requested a review from sasa1977 September 5, 2018 17:14

obrok assigned sasa1977 Sep 5, 2018

sasa1977 previously approved these changes Sep 6, 2018

View reviewed changes

obrok added 3 commits September 6, 2018 12:45

Refactor indexing in fuzzer

de5ef90

Fix typo

519bb66

Simplify frequency macro in fuzzer

ec99b4c

obrok dismissed sasa1977’s stale review via ec99b4c September 6, 2018 11:05

sasa1977 approved these changes Sep 6, 2018

View reviewed changes

obrok merged commit 46f700f into master Sep 6, 2018

obrok deleted the obrok/fuzzer-redux branch September 6, 2018 11:50

sebastian unassigned sasa1977 Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obrok/fuzzer redux #3062

Obrok/fuzzer redux #3062

obrok commented Sep 5, 2018

This comment has been minimized.

aircloak-robot commented Sep 5, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

obrok Sep 6, 2018

sasa1977 Sep 6, 2018

aircloak-robot commented Sep 6, 2018

aircloak-robot commented Sep 6, 2018

obrok commented Sep 6, 2018

aircloak-robot commented Sep 6, 2018

Obrok/fuzzer redux #3062

Obrok/fuzzer redux #3062

Conversation

obrok commented Sep 5, 2018

This comment has been minimized.

aircloak-robot commented Sep 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aircloak-robot commented Sep 6, 2018

aircloak-robot commented Sep 6, 2018

obrok commented Sep 6, 2018

aircloak-robot commented Sep 6, 2018