Skip to content

Suggestions for simple modifications

ghalliday edited this page Dec 9, 2011 · 4 revisions

New to the system, and looking for something small and/or fairly self-contained to work on?

The following is a list of potential changes which will hopefully fit the bill. Please feel free to ask for more details.

##Code generator

  • Clean up executing the eclcc code generator regression suite on linux.

  • gh-467 A new dataset operator for extracting subsets of a stream

  • New operator: DATASET(count, transform(COUNTER)).

  • Optimize NORMALIZE(dataset(row), , transform) to use DATASET(count, transform)

  • Add a user flag to keyed joins to indicate order does not need to be preserved.

  • Better BETWEEN code. Sometimes the test expression is calculated twice - e.g., EXP(SUM(values, LN(prob))) between 0.1 and 0.9

  • Optimize generation of return exists(...).
    Simplest way would be to have a special target which when you assigned to it generated a return - but you need to be careful about assigning twice then.

  • Constant fold SORT(constant-inline-dataset)

  • Optimize comparisons of utf8/unicode against blank strings, and use the rtlCompareStrBlank() in more situations.

  • Optimize if (count(ds)>0, ds[1].field, ) to ds[1].field

  • Allow link counted child rows as well as child datasets.

  • Special case child datasets with a maxcount of 1 and just store a pointer.

  • Optimize code generated for EXISTS(JOIN(a,b,cond,all)) when evaluated inline on child datasetes inside a transform.

  • Optimize COUNT(DATASET(myset, rec)) to COUNT(myset)

  • Combine aggregations. E.g,

cnt1 := COUNT(ds(filter1));
cnt2 := COUNT(ds(filter2));
becomes
agg := TABLE(ds, {cnt1 := COUNT(GROUP, filter1); cnt2 := COUNT(GROUP, filter2); });
cnt1 := agg[1].cnt1;
cnt2 := agg[1].cnt2;

The advantage is that the counts will be done directly on the disk buffer withour creating records and splitting - and will also reduce multithreading overhead.

  • Use a thread variable for the bcd stack and remove the bcd critical block.

  • Finish support for utf8 fields.

  • Introduce a compressed archive format (which appropriate magic header).

  • Implement a IEclSourceCollection that links to libarchive to allow building direct from compressed tar files etc.

  • Restrict hqlfold to only fold registered plugins.

  • Optimize AGGREGATE(ds, SELF.x := RIGHT.x & LEFT.x) to directly modify the target record.

  • Allow main module to be split over multiple C++ files. (Could be useful for compile times on very large queries.)

  • Cache the row allocators in the xml transformation classes. (Will require onCreate() to be added..)

  • Add a PROJECT (?) option on a JOIN to indicate it is worth pre-projecting any complex join fields. (Note projecting a guarded join condition could slow it down significantly, so may only apply to first, or need to be configurable.)

More complex:

  • Better support for C++ definitions - allow dependencies on other attributes, and on other C++ files/libraries.

  • Finish work on allowing conditional statements in graphs, and enable.

  • Support an ecl through-pipe activity, with streaming input(s) and output(s?)

  • Revisit the packing option, and the option to auto-pack fields. (Always add pack in implicit project if the field order isn't fixed.)

  • Better processing of the dataset format. E.g., don't deserialize on slaves or for disk-read->output.

  • Make dataset size-field configurable.

  • Option to only use link counted child rows on datasets with elements above a certain size (e.g., sizeof(void *) bytes)?

  • Add a transform to track which sort orders/distributions are actually used, and then tag activities (e.g., keyed joins) or remove them if the sort/distribution isn't required.

  • Allow datasets and strings etc to configure whether they are prefixed with a count/length or size.
    Main complication is the number of places that would need to be changed.

  • Allow length/size for a dataset/string to be separated from the data.
    Would improve packing and alignment, but the representation is tricky.

  • Add an attribute to all activities (especially piperead, random, user-functions) which allow it to reference a expression which inidcates the scope it should be evaluated in. It may need to be an operator (in addition?).

  • Optimize multiple aggregates on the same dataset - e.g., count(ds(a=x)), count(ds(a=y)) into a single loop when done inline.

  • Better resourcing of inline dataset operations

  • Minimize the data sent to roxie slaves e.g., for indexread/ keyed join by generating a separate slave helper.
    It would also have the benefit of making the master helper "colocal" (in smae meory spaces as the owner activity).

  • Implement link counted strings, and switch all temporary strings over to using them.
    Will cause incompatibilities with existing plugins.

  • Implement a costing algorithm for IHqlExpressions

  • Implement some kind of MAP type, and use a hash table lookup for "x in MAP(...)"

  • Optimize order of filters. Costing is a prerequisite.

  • Use the expression costing to implicitly add ,PROJECT to a JOIN

  • Expand implicit project code to work on child records.

  • Lightwieght grouped self-join which performs an all self-join on each input group. (May require minimal work in engines.)

  • Allow much more flexiblity in out of line user-functions.

##PARSE

  • Allow UTF8 strings to be processed efficiently (creating DFAs etc.)

  • Revisit the tomita parser and allow it to use a more general lexer.

  • Rethink the pattern/token/rule approach of the tomita parser and allow it to be used interchangably with the regex parser.

  • Better unicode pattern matching - look at how flex handles character classes.

  • Provide the option for using a strictly conforming xml parser for xml reads. (Using a 3rd party library).

##FileView2

  • Revisit the field mapping transformations and allow them to define an arbitrary ecl function.

  • Allow any datset (including alien datatypes) to be displayed without generating a helper function. Probably requires an ECL interpreter. Extending the scope (to cover hqlfold expressions) would help a lot.

##Windows

  • Port eclcc to windows64 (disabling boost/ssl would leave hqlfold to be implemented)

##Mac

  • Finish port of eclcc to mac