autodb 2.0.0
The general theme for this version is classes for intermediate results: functional dependencies, schemas, and databases now have fleshed-out classes, with methods to keep them self-consistent. They all have their own constructors, for users to create their own, instead of having to generate them from a given data frame.
Breaking changes
- Renamed
dfd
todiscover
, to reflect the generalisation to allow the use of other methods. At the moment, this just includes DFD. - Removed
flatten
from exported functions, in favour of flattening the functional dependencies indfd
/discover
instead. Sinceflatten
was usually called anyway, and its output is more readable since adding aprint
method for it, there was little reason to keep the olddfd
/discover
output format, where functional dependencies were grouped by dependant. - Renamed
cross_reference
toautoref
, to better reflect its purpose as generating foreign key references. - Renamed
normalise
tosynthesise
, to reflect its only creating relation schemas, not foreign key references. The new function namednormalise
now calls a wrapper for bothsynthesise
andautoref
, since in most cases we don't need to do these steps separately. Additionally,ensure_lossless
is now an argument forsynthesise
rather thanautoref
: this is a more nature place to put it, sincesynthesise
creates relations, andautoref
adds foreign key references. - As noted in improvements, functional dependency objects now have their own subsetting methods. In particular, they have a
[[
method, so code that used[[
to extract determinant sets or dependants from functional dependencies will no longer work. These should be extracted with the newdetset
anddependant
functions instead. - Similarly, the
database
class has its own subsetting methods, so components must be extracted withrecords
,keys
, and so on. - The
database
class no longer assigns aparents
attribute to each relation, since this duplicates the foreign key reference information given inreferences
. - The
database
class no longer has aname
attribute. This was only used to name the graph when using thegv
function, so is now an argument for thedatabase
method ofgv
instead, bringing its arguments into line with those of the other methods. relationships
indatabase_schema
anddatabase
objects are now calledreferences
, to better reflect their being foreign key constraints, and they are stored in a format that better reflects this: instead of an element for each pair of attributes in a foreign key, there is one element for the whole foreign key, containing all of the involved attributes. Similarly, they are now printed in the format "child.{c1, c2, ...} -> parent.{p1, p2, ...}" instead of "child.c1 -> parent.p1; child.c2 -> parent.p2; ...".cross_reference
/autoref
now defaults to generating more than one foreign key reference per parent-child relation pair, rather than keeping only the one with the first child key by priority order. This can result in some confusion on plots, since references are still plotted one attribute pair at a time.
Improvements
- Added classes and methods for important data structures:
- Added a
functional_dependency
class for flattened functional dependency sets. The attributes vector is now stored as an attribute, so that the dependencies can be accessed as a simple list without list subsetting operators. There are alsodetset
,dependant
, andattrs_order
generic functions for extracting the relevant parts.detset
anddependant
, in particular, should be useful for the purposes of filtering predicates. - Added a
relation_schema
class for relational schema sets, as returned bysynthesise
. The attributes and keys are now stored together in a named list, with theattrs_order
vector attribute order stored as an attribute. As with thefunctional_dependency
, this lets the schemas be accessed like a vector. There is alsomerge_empty_keys
for combining schemas with an empty key, andattrs
,keys
, andattrs_order
generic functions for extracting the relevant parts. - Added a
database_schema
class for database schemas, as returned bynormalise
. This inherits fromrelation_schema
, and has foreign key references as an additionalreferences
attribute. There is amerge_empty_keys
method that conserves validity of the foreign key references. Additionally, when the names of the contained relation schemas are changed usingnames<-
, the references are changed to use the new names. - Added a
relation
class for vectors of relations containing data. Since adatabase_schema
is just arelation_schema
vector with foreign key references added, therelation
class was added as the equivalent underlying vector for thedatabase
class. A user of the package probably won't need to use it. database
is now a wrapper class aroundrelation
, that adds foreign key references, and handles them separately in its methods.- All of the above have their own methods for the
[
,[[
, and -- except forfunctional_dependency
--$
subsetting operators, along with their replacement equivalents,[<-
etc., to allow treating them as vectors of relation schemas or relations. Subsetting also removes any foreign key references indatabase_schema
anddatabase
objects that are no longer relevant. These methods prevent the subsetting operators from being used to access the object's internal components, so many of the generic functions mentioned above were written to allow access in a more principled manner, not requiring knowledge of how the structure is implemented. - All of the above have a
c
method for vector-like concatenation. There are two non-trivial aspects to this. Firstly, when concatenating objects with differentattrs_order
attributes,c
merges the orders to keep them consistent, if possible. Secondly, fordatabase_schema
anddatabase
, foreign key references are changed to reflect any changes made to relation names to keep them unique. - All of the above have a
unique
method for vector-like removal of duplicate schemas / relations. This conserves validity of foreign key references fordatabase_schema
anddatabase
objects. Forrelation
anddatabase
objects, duplication doesn't require records to be kept in the same order.
- Added a
- Added some basic database creation/manipulation generic functions for the above data structures:
- All of the above have a
names<-
method for consistently changing relation (schema) names. In particular, for databases and database schemas, this ensures the names are also changed in references. - All of the above, except
functional_dependency
, have arename_attrs
method for renaming the attributes across the whole object. This renames them in all schemas, relations, references, and so on. - Added a
create
generic function, for creatingrelation
anddatabase
objects fromrelation_schema
anddatabase_schema
objects, respectively. The created objects contain no data. This function is roughly the equivalent toCREATE TABLE
in SQL, but the vectorised nature of the relation classes means that several tables are created at once. - Added an
insert
generic function forrelation
anddatabase
objects, which takes a data frame of new data, and inserts it into any relation in the object whose attributes are all present in the new data. This is roughly equivalent to SQL'sINSERT
, but works over multiple relations at once, and means there's now a way to put data into adatabase
outside ofdecompose
. Indeed,decompose
is now equivalent to callingcreate
, then callinginsert
with all the relations.
- All of the above have a
- Adjusted
normalise
to prefer to remove dependencies with dependants and determinant sets later in table order, and with larger dependant sets. This brings it more in line with similar decisions made in other package functions. - Simplified some internals of
dfd
/discover
to improve computation time. - Added a
skip_bijections
option todfd
/discover
, to speed up functional dependency searches where there are pairwise-equivalent attributes present.
Fixes
- Corrected vignette re: when to remove spurious dependencies before.
- Corrected
autodb
documentation link to page with database format information. - Corrected
df_equiv
to work withdata.frame
columns that are lists. - Fixed several issues related to doubles / floating-point:
- Fixed
dfd
/discover
treating similar numeric values as equal, resulting in data frames not being insertable into their own schema. - Fixed
database
checks not handling doubles correctly. Specifically, foreign key reference checks involve merging tables together, and merge operates on doubles with a tolerance that's set within an internal method, so merges can create duplicates that need to be removed afterwards. - Similarly, fixed
rejoin
in the case where merges are based on doubles, sometimes resulting in duplicates.
- Fixed
- Fixed
normalise
's return output to be invariant to the given order of thefunctional_dependency
input. - Fixed
normalise
returning relations with attributes in the wrong order in certain cases whereremove_avoidable = TRUE
. - Fixed
gv
giving Graphviz code that could result in incorrect diagrams: relation and attribute names were converted to lower case, and not checked for uniqueness afterwards. This could result in incorrect foreign key references being drawn. The fix also accounts for a current bug in Graphviz, where edges between HTML-style node ports ignore case for the port labels.