Suggestions for simple modifications

New to the system, and looking for something small and/or fairly self-contained to work on?

The following is a list of potential changes which will hopefully fit the bill. Please feel free to ask for more details.

##Code generator

Clean up executing the eclcc code generator regression suite on linux.
gh-467 A new dataset operator for extracting subsets of a stream
New operator: DATASET(count, transform(COUNTER)).
Optimize NORMALIZE(dataset(row), , transform) to use DATASET(count, transform)
Add a user flag to keyed joins to indicate order does not need to be preserved.
Better BETWEEN code. Sometimes the test expression is calculated twice - e.g., EXP(SUM(values, LN(prob))) between 0.1 and 0.9
Optimize generation of return exists(...).
Simplest way would be to have a special target which when you assigned to it generated a return - but you need to be careful about assigning twice then.
Constant fold SORT(constant-inline-dataset)
Optimize comparisons of utf8/unicode against blank strings, and use the rtlCompareStrBlank() in more situations.
Optimize if (count(ds)>0, ds[1].field, ) to ds[1].field
Allow link counted child rows as well as child datasets.
Special case child datasets with a maxcount of 1 and just store a pointer.
Optimize code generated for EXISTS(JOIN(a,b,cond,all)) when evaluated inline on child datasetes inside a transform.
Optimize COUNT(DATASET(myset, rec)) to COUNT(myset)
Combine aggregations. E.g,

cnt1 := COUNT(ds(filter1));
cnt2 := COUNT(ds(filter2));
becomes
agg := TABLE(ds, {cnt1 := COUNT(GROUP, filter1); cnt2 := COUNT(GROUP, filter2); });
cnt1 := agg[1].cnt1;
cnt2 := agg[1].cnt2;

The advantage is that the counts will be done directly on the disk buffer withour creating records and splitting - and will also reduce multithreading overhead.

Use a thread variable for the bcd stack and remove the bcd critical block.
Finish support for utf8 fields.
Introduce a compressed archive format (which appropriate magic header).
Implement a IEclSourceCollection that links to libarchive to allow building direct from compressed tar files etc.
Restrict hqlfold to only fold registered plugins.
Optimize AGGREGATE(ds, SELF.x := RIGHT.x & LEFT.x) to directly modify the target record.
Allow main module to be split over multiple C++ files. (Could be useful for compile times on very large queries.)
Cache the row allocators in the xml transformation classes. (Will require onCreate() to be added..)
Add a PROJECT (?) option on a JOIN to indicate it is worth pre-projecting any complex join fields. (Note projecting a guarded join condition could slow it down significantly, so may only apply to first, or need to be configurable.)

More complex:

Better support for C++ definitions - allow dependencies on other attributes, and on other C++ files/libraries.
Finish work on allowing conditional statements in graphs, and enable.
Support an ecl through-pipe activity, with streaming input(s) and output(s?)
Revisit the packing option, and the option to auto-pack fields. (Always add pack in implicit project if the field order isn't fixed.)
Better processing of the dataset format. E.g., don't deserialize on slaves or for disk-read->output.
Make dataset size-field configurable.
Option to only use link counted child rows on datasets with elements above a certain size (e.g., sizeof(void *) bytes)?
Add a transform to track which sort orders/distributions are actually used, and then tag activities (e.g., keyed joins) or remove them if the sort/distribution isn't required.
Allow datasets and strings etc to configure whether they are prefixed with a count/length or size.
Main complication is the number of places that would need to be changed.
Allow length/size for a dataset/string to be separated from the data.
Would improve packing and alignment, but the representation is tricky.
Add an attribute to all activities (especially piperead, random, user-functions) which allow it to reference a expression which inidcates the scope it should be evaluated in. It may need to be an operator (in addition?).
Optimize multiple aggregates on the same dataset - e.g., count(ds(a=x)), count(ds(a=y)) into a single loop when done inline.
Better resourcing of inline dataset operations
Minimize the data sent to roxie slaves e.g., for indexread/ keyed join by generating a separate slave helper.
It would also have the benefit of making the master helper "colocal" (in smae meory spaces as the owner activity).
Implement link counted strings, and switch all temporary strings over to using them.
Will cause incompatibilities with existing plugins.
Implement a costing algorithm for IHqlExpressions
Implement some kind of MAP type, and use a hash table lookup for "x in MAP(...)"
Optimize order of filters. Costing is a prerequisite.
Use the expression costing to implicitly add ,PROJECT to a JOIN
Expand implicit project code to work on child records.
Lightwieght grouped self-join which performs an all self-join on each input group. (May require minimal work in engines.)
Allow much more flexiblity in out of line user-functions.

##PARSE

Allow UTF8 strings to be processed efficiently (creating DFAs etc.)
Revisit the tomita parser and allow it to use a more general lexer.
Rethink the pattern/token/rule approach of the tomita parser and allow it to be used interchangably with the regex parser.
Better unicode pattern matching - look at how flex handles character classes.
Provide the option for using a strictly conforming xml parser for xml reads. (Using a 3rd party library).

##FileView2

Revisit the field mapping transformations and allow them to define an arbitrary ecl function.
Allow any datset (including alien datatypes) to be displayed without generating a helper function. Probably requires an ECL interpreter. Extending the scope (to cover hqlfold expressions) would help a lot.

##Windows

Port eclcc to windows64 (disabling boost/ssl would leave hqlfold to be implemented)

##Mac

Finish port of eclcc to mac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for simple modifications

Clone this wiki locally