- Prepare two binaries
tidb-master
andtidb-4.0
in thebin
directory. - Run
make abtest
.
Print 200 SQL statements randomly after setting @@global.tidb_enable_clustered_index
to true:
func main() {
state := NewState()
state.InjectTodoSQL("set @@global.tidb_enable_clustered_index=true")
gen := NewGenerator(state)
for i := 0; i < 200; i++ {
fmt.Printf("%s;\n", gen())
}
}
Print 200 SQL statements randomly:
make gen count=200
To check/modify the generation rules, see the file sqlgen/start.go
.
This project provides a flexible way to generate SQL strings. Comparing with randgen/go-randgen, it has the following advantages:
-
Good readability. It uses Yacc-style code to describe the grammar. Here is the comparison for a simple 'or' branch:
dmlStmt: query | commonDelete | commonInsert | commonUpdate
dmlStmt = NewFn("dmlStmt", func() Fn { return Or( query, commonDelete, commonInsert, commonUpdate, ) })
-
Flexible state management. It provides full functions of Golang.
Why do we need state management? Because the correctness of the generating SQL is guaranteed by the state.
For example, the index columns in a SQL like
ALTER TABLE t ADD INDEX idx(a, b);
are chosen randomly. This requires the embedded language to store these metadata(including all available table names, all column names in each table, etc.).However, for scripting language like Perl/Lua, this can mess things up if the there are a lot of metadata to track:
T = { cols = {}, col_types = {}, cur_col = nil, indices = {}, } T.next_idx = function() return G.c_index_num.seq:next() end T.next_col = function() return G.c_column_num.seq:next() end T.cols[#T.cols+1] = util.col('c_int', G.c_int.rand) T.cols[#T.cols+1] = util.col('c_str', G.c_str.rand) T.cols[#T.cols+1] = util.col('c_datetime', G.c_datetime.rand) ... add_column: alter table t add column { local new_col_type = T.new_rand_col_types() local col_name = sprintf('col_%d', T.next_col()); T.cols[#T.cols+1] = util.col(col_name, new_col_type.rand) printf('%s %s', col_name, new_col_type.name) }
We need to maintain many arrays/maps carefully. What's more, if there is a syntax error or semantic/logic error, it is hard to debug due to the lack of syntax highlight support and the debug information.
-
Ability to interact with foreign data. For randgen/go-randgen, the generating process is isolated.
Sometimes the users may need to generate SQL dynamically according to the database status or some given complex conditions. It is not convenient to achieve them in a scripting language. On the other hand, managing information in a modular way with Golang is easier.
-
Production: a rule that defines how to generate a part of string. All the productions are described in this way:
prod_name = NewFn("prod_name", func() Fn { /* initialization or preprocess */ /* production body */ return Or( sub_prod1, sub_prod2, ... ) })
Fn
is a representation of production.When it is being evaluated, the sub-productions are also evaluated. Each production builds its own string by using both information about itself and sub-productions. Finally, a string is concatenated in a bottom-up way and returned.
-
Production combinator: A function that accepts zero or more
Fn
s and returns exactly oneFn
. It is used to keep productions readable, there are many combinators that have corresponding notations in Yacc. Here are a few combinators:Or(...prod)
: Chooses one branch in the sub-productions, like notation|
in Yacc.And(...prod)
: Concatenates all the sub-productions.Str(prod)
: Indicates a terminal/leaf production.If(cond, ...prod)
: Only generates the sub-production if the condition is satisfied.OptIf(cond, ...prod)
: Similar toIf()
, the difference is thatOptIf()
will generate a empty string if the condition is not satisfied. As a dummy rule, please useIf()
insideOr()
and useOptIf()
insideAnd
.
-
Production weight: the probability of this branch being selected. It is only meaningful inside the
Or()
combinator. EachFn
has an attributeweight
, one can set it throughSetW()
. For example,return Or( dmlStmt.SetW(12), ddlStmt.SetW(3), splitRegion.SetW(2), commonAnalyze.SetW(1), ... ),
The ratio of selecting probability of
[dmlStmt : ddlStmt : splitRegion : commonAnalyze]
is[12 : 3 : 2 : 1]
. -
State: the place where stores all of the meta information. The database entities are abstracted to the corresponding structs:
Table
,Column
andIndex
.Each of them also exposes convenient functions to retrieve/mutate the data. Here is an example of the most common usage:
createTable = NewFn("createTable", func() Fn { tbl := GenNewTable(state.AllocGlobalID(ScopeKeyTableUniqID)) // ... state.AppendTable(tbl) // <--------------------- mutate // ... return And( Str("create table"), Str(tbl.name), Str("("), definitions, Str(")"), OptIf(rand.Intn(5) == 0, partitionDef), ) })
commonAnalyze = NewFn("commonAnalyze", func() Fn { tbl := state.GetRandTable() // <--------------------- retrieve return And(Str("analyze table"), Str(tbl.name)) })
When we generating a
CREATE TABLE
SQL, a table entity*Table
is also put to the state. The functionGetRandTable
is used to get a random table from the state. -
Scope management: for most productions, the requirement of message passing can be satisfied with nested definitions. For example, we have the production dependency
createTable -> definitions -> idxDefs -> idxDef
. We need to modify the state during generatingidxDef
by appending a new index to the current table(currentTable.Append(newIndex)
). To reference the entitycurrentTable
, we can define them in a nested block:createTable = NewFn("createTable", func() Fn { tbl := GenNewTable(state.AllocGlobalID(ScopeKeyTableUniqID)) // <------ `tbl` referenced state.AppendTable(tbl) // ... definitions = NewFn("definitions", func() Fn { // ... idxDefs = NewFn("idxDefs", func() Fn { return Or( idxDef, And(idxDef, Str(","), idxDefs).SetW(2), ) }) idxDef = NewFn("idxDef", func() Fn { idx := GenNewIndex(state.AllocGlobalID(ScopeKeyIndexUniqID), tbl) tbl.AppendIndex(idx) // -------------------------------------> referencing `tbl` // ...
Advanced usage: one drawback of this approach is the disruption to the readability and reusability. If we really need to reuse a production, the
scoping
functions of state come into rescue. They are:(s *State) CreateScope()
: Create a new scope. Similar to entering the blocks ({}
) in Golang.(s *State) DestroyScope()
: Destroy the innermost scope. Similar to leaving the blocks in Golang.(s *State) Store(key ScopeKeyType, val ScopeObj)
: Attach an entity to the scope.(s *State) Search(key ScopeKeyType) ScopeObj
: Search an entity with a given key.
The
CreateScope
andDestroyScope
can be automated through theProductionListener
provided bygenerater_lib.go
(which is a hook being called before and after theFn
evaluation). Thus, we only need toStore()
the entity in the parent production, thenSearch()
the entity in children productions.
sqlgen
├── db_constant.go
├── db_generator.go # A collection of functions that generate random values, like column types, column values and others
├── db_mutator.go # A collection of functions that **modify** the state
├── db_printer.go # A collection of functions that help printing the target string
├── db_retriever.go # A collection of functions that **read** the state
├── db_transformer.go
├── db_type.go
├── db_util.go
├── declarations.go # Production declarations
├── example_test.go
├── generator_lib.go # Production combinators
├── generator_types.go # Basic struct types
└── start.go # Yacc-style grammar file
Here are some suggestions to keep the code clean:
- Name each function regularly.
- the state-mutate functions start with
Set
/Append
/Remove
; - the state-retrieve functions start with
Get
; - the generating functions have the pattern
Gen...()
; - the format functions have the pattern
Print...()
; - the random entity/value producing functions have the pattern
Rand...()
.
- the state-mutate functions start with
- Put the functions to different files according to their intend.
Assert
the assumptions explicitly.- DRY(Do not Repeat Yourself).
- Avoid passing information through global variables.
For the productions with easy syntax, only 4 steps are required normally:
- Put the declaration to
declarations.go
. - Write the production body. Combinators like
And()
,Or()
andStrs()
may be helpful. - Insert the new production to another existing & used production body.
- Test it in
example_test.go
. One may need to adjust the weight throughSetW()
to increase the probability to generate.
For the productions with a lot of constraints or complex conditions, answering the following questions and coding step by step should help:
- Does this production involve a new entity? If yes, define it in
db_types.go
and provide corresponding retrieve/mutate functions. - Do you expect to control the generating strategy with some arguments? If yes, put it in the config
ControlOption
and don't forget to updateDefaultControlOption()
. - What are the constraints of this production? Consider using
If()
orOptIf()
. If necessary, the Golang keywordif
with multiplereturn
statements may be a good replacement. - What is the message passing flow? Consider using nested definitions over scoping mechanism.
- Does it require complex transformation? If yes, do the transformation work in
db_transformer.go
. - Does it need to be printed in a special way? If yes, do the format work in
db_printer.go
. - Need to make it generate multiple SQLs at a time? Use
InjectTodoSQL()
. It can append custom strings to the queue in the generator. - Does it require the cleanup work? Here is a signature of a hook:
(p *PostListener) Register(fnName string, fn func())
. A custom function can be registered to a production, it will be called after the production is generated.
-
Why use struct instead of function to represent production?
The functions are evaluated immediately when they are called. So it is not possible to reference each other in the block. However, this pattern is not uncommon in Yacc grammar. What's more, it is also inconvenient to intercept the function call.