-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Lucene.Net.Grouping to eliminate usage of LINQ, covariant interfaces, and non-generic collection interfaces #1059
Comments
This is awesome analysis and well-written, thanks @NightOwl888. I might take this one after I finish up a few others I'm working on, if no one else jumps on it in the meantime. |
I spent some time looking at this, and playing around with the APIs in both Lucene and Lucene.NET beta 17, and I've got a pretty good grasp on it now. The problem boils down to not as much that the Lucene Java code uses wildcard generics, but that we're trying to force a reified generics design on this API, when in Java the generics are type-erased. The generic methods where only the generic type parameter is used in the return type lets the caller pretend like it's any type they want, but it will fail unless it's the exact type that is hydrated at runtime. For example, Lucene/Java happily let you write unintelligible code like the following: TopGroups<System> topGroups = groupingSearch.search(searcher, query, 0, topNGroups); (Yes, that System.) Since these types are erased, there's nothing telling I've got the GroupingSearch class broken down into those three classes with a common abstract base class. The only failing test I'm chasing down now is that the Function tests are failing to cast from MutableValue to MutableValueStr (or Int32), so I have to figure out a solution to that. |
It sounds like we probably need a |
This still needs XML doc comments, but this breaks the GroupingSearch "god class" into three subclasses that implement a common abstract class. This should allow us to not need covariant interfaces for the return types. In order to randomly switch between these classes with incompatible generic type arguments, the test shows how you can use a delegate to work around this limitation.
An update: I have the GroupingSearch class split out, and many of the interfaces and covariance removed, I just have one failing unit test that I am having difficulty tracking down. You can track my progress at my branch: https://github.com/paulirwin/lucene.net/tree/issue/1059 - or I welcome if anyone can pull that down and see what the problem is. The tests are a bit wonky as I have to create types that use The GroupingSearch class now has three helper factory methods, but you also could use the split-out classes directly if you wanted to. I'll add XML doc comments before I publish the PR, for now I'm focused on getting it working. Once I get the failing test working, I'll finish removing the interfaces, then focus on converting IEnumerable back to ICollection and the other items on the list. |
Re: "Restore generic collections where we are currently using ICollection and IDictionary", ICollection is no longer there, but IDictionary still is, because that requires Lucene.Net.Queries.Function.ValueSource to have a generic dictionary context argument to GetValues, and that would be a much larger change. I think that can be broken out as its own issue/PR as this is big enough as it is, and this being non-generic only affects the Function types, so the scope is pretty limited. I think we can safely replace IDictionary with |
Also note that I had to add an extra public abstract class AbstractDistinctValuesCollector<GC, TGroupValue> : ICollector
where GC : AbstractDistinctValuesCollector.GroupCount<TGroupValue> (Note that I haven't yet looked at the Collector changes, still using ICollector for now.) This is required because we can't make abstract types covariant, and the Function implementation uses MutableValue while Term uses BytesRef, and if you try to do something like Since most people will probably be using the Function or Term implementations and not creating their own, this likely won't affect much of anyone except for those that are needing to create their own grouping search classes. |
I took a look and this seems to be a common problem that we have had when porting generic types from Java. C# sees public abstract class FirstPassGroupingCollector<T> : FirstPassGroupingCollector
{
// implementation
}
public abstract class FirstPassGroupingCollector // Note this could be an interface or an abstract class
: ICollector
{
private protected FirstPassGroupingCollector() {} // Only allow internal use
// implementation
} This would allow the type to be referenced without being aware of its generic type. private AbstractFirstPassGroupingCollector CreateFirstPassCollector<T>(string groupField, Sort groupSort, int topDocs, AbstractFirstPassGroupingCollector firstPassGroupingCollector)
{
// LUCENENET specific - we need to look for our wrapper types
if (typeof(ObjectFirstPassGroupingCollector<BytesRef>).IsAssignableFrom(firstPassGroupingCollector.GetType()))
{
ValueSource vs = new BytesRefFieldSource(groupField);
return new FunctionFirstPassGroupingCollector<MutableValue>(vs, new Hashtable(), groupSort, topDocs);
}
else if (typeof(ObjectFirstPassGroupingCollector<MutableValue>).IsAssignableFrom(firstPassGroupingCollector.GetType()))
{
return new TermFirstPassGroupingCollector(groupField, groupSort, topDocs);
}
fail();
return null;
} If you think about it, since we are having trouble testing it this way, it is very likely that users will need to be able to store these distinct types in a common variable type (after all, Java allows this), so making this a first class feature rather than a test mock is necessary. This would eliminate the need to create an object wrapper type. There may still be the need to cast somewhere, but there may be ways to mitigate that cast (perhaps by calling a method or property on a class that already is aware of the generic type and can handle the cast internally). |
Would it make sense to add the closing types of |
Wait, it looks like that is only for |
I'm not so sure it's necessary, this is only a problem with using the low-level Abstract*Collector types. I don't think many people are going to be using those low-level types (instead using the types in the GroupingSearch file), and those that are further still are probably not going to need to be switching between them in the same variable like the test does, and further still if they really need to do that, they can do a workaround like we're doing here. What I don't want to have to do is expose object-based properties that are shadowed in the generic classes (or worse, i.e.
No, because ValueSource implementations treat it as |
Also note that a value source might even store/retrieve different key and value types in the context within the same method. |
But it is how we do it elsewhere (for example, in FieldComparer). Having the ability to store a collection of That said, there doesn't seem to be a use case where it makes sense to do this for
Gotcha. Perhaps we are better off sticking with Need to think about this some more. |
I actually already removed that in lieu of having this do the same approach as the other collectors in the tests, so that we're not adding an unnecessary interface just for testing purposes. It's unlikely that someone would need just the group counts in the real world without also accessing the groups, so I figured it would be best to not pollute the public API with that interface when it was only used (read: not needed, just used as one approach) for test purposes. |
I just had an idea, based on that last comment, if you're open to it. We could create non-generic interfaces (not abstract classes) that do not expose collections, but give you APIs to retrieve items. By using an interface instead of an abstract class, we wouldn't be forced to convert to/from object except for in the explicitly implemented methods. It also would not change the type hierarchy. For example: public interface IAllGroupsCollector : ICollector // NOTE: we'd need to keep this interface
{
int GroupCount { get; }
object GetGroup(int index);
} We would then explicitly implement the interface for the GetGroup method. I think that would be cleaner than exposing a non-generic ICollection, although I could be swayed on explicitly implementing that too. Thoughts? |
That is the reason why I said "abstraction" rather than "abstract class", because either would probably be fine. I just used abstract classes in most other places because it meant keeping the same syntax as the upstream code. For example: JavaList<AbstractAllGroupsCollector> collectors = new ArrayList();
ValueSource vs = new BytesRefFieldSource(groupField);
collectors.add(new FunctionAllGroupHeadsCollector(vs, new Map(), sortWithinGroup));
collectors.add(TermAllGroupHeadsCollector.Create(groupField, sortWithinGroup)); C#List<AbstractAllGroupsCollector> collectors = new List<AbstractAllGroupsCollector>();
ValueSource vs = new BytesRefFieldSource(groupField);
collectors.Add(new FunctionAllGroupHeadsCollector(vs, new Hashtable(), sortWithinGroup));
collectors.Add(TermAllGroupHeadsCollector.Create(groupField, sortWithinGroup)); Where if we used an interface, it would be declared a bit different: List<IAllGroupsCollector> collectors = new List<IAllGroupsCollector>(); The main reason for doing this in other places was for being able to access constants and fields without having to specify the non-generic type (in which case the syntax stays the same as upstream). It would also allow the use of the That being said, using an interface does have an advantage we may be able to take advantage of: it can be declared internal and used just for testing purposes. If we use an abstract class, it will have to be declared public in order for a public type to inherit it. Although, I am not entirely convinced that getting the After mulling this over a bit more, if the user needed to create a collection of mixed collectors, they could just declare it //List<AbstractGroupingSearch> collectors = new List<AbstractGroupingSearch>();
List<object> collectors = new List<object>();
collectors.Add(GroupingSearch.ByField(...));
collectors.Add(GroupingSearch.ByFunction<MutableValueStr>(...));
collectors.Add(GroupingSearch.ByFunction<MutableValueInt32>(...));
collectors.Add(GroupingSearch.ByDocBlock<string>(...));
collectors.Add(GroupingSearch.ByDocBlock<int>(...));
foreach (object collector in collectors)
{
if (collector is FieldGroupingSearch fieldGroupingSearch)
// do something with the collector
else if (collector is FunctionGroupingSearch<MutableValueStr> functionOnString)
// do something with the collector
else if (collector is FunctionGroupingSearch<MutableValueInt32> functionOnInt32)
// do something with the collector
else if (collector is DocBlockGroupingSearch<string> docBlockOnString)
// do something with the collector
else if (collector is DocBlockGroupingSearch<int> docBlockOnInt32)
// do something with the collector
} This would be very cumbersome if all they want to do is read a field that is common among all of these types and even moreso if they wanted to do that for any custom subclass (which would probably require reflection). I am not opposed to keeping |
I think if you're saying that users have a valid use case of needing to store multiple of these different collectors in a variable or collection with different type parameters, we might as well make the interface public then. But as I noted above, I think interfaces instead of abstract classes have another advantage: being able to explicitly implement them in the generic abstract classes. If we take the abstract class approach, you can't explicitly implement an abstract class. It will require shadowing the non-generic properties if they have the same name. I think an interface lets us have the best of both worlds: we can allow for this use case, while simplifying our testing code. I'll give it a try and see how it comes out. |
Yeah, I definitely would stay away from a non-generic
Agreed. I often use interfaces to separate members of the same name that only differ by return type. In fact, I made a fluent API generator that creates a builder class that implements interfaces to enforce complex business rules, such as only allowing a specific setting to be applied up to 3 times (after which it will return the builder instance cast to an interface that no longer declares the method for the setting). I would also like to see how that approach turns out. And I agree that there is more utility for this than just testing. |
This is going well so far, I'll have a commit soon. I'm creating adapter classes to adapt i.e. an |
@NightOwl888 This is pushed up to my branch, let me know what you think. This allowed me to remove the wrapping types so the test code is closer to Lucene. It required adding those "Casting" types to Lucene.Net.Support (as internal classes) to do the casting for the explicit interface implementations without having to allocate new collections (and perform O(n) operations to do so) or use LINQ. Still has one failing test, but I think it'll be easier to figure out what's wrong from the diff now that the test code has less changes. |
Of course, as always happens, I realized what the problem was as soon as I posted that. Fixed, and the unit test passes now! |
Is there an existing issue for this?
Task description
There were several workarounds done during the initial port of Lucene.Net.Grouping that make it less usable and even less performant than the grouping module in Lucene 4.8.1.
IEnumerable<T>
in places where Lucene used the Java equivalent ofICollection<T>
IEnumeable<T>
, we have to use LINQCount()
andAny()
in a few placesCollector
abstract class in Lucene.Net.Queries was replaced with anICollector
interface to make it covariantThe fundamental problem with Lucene.Net.Grouping is that C# doesn't support wildcard generics and the Lucene design requires them. When I ported it, it took several tries just to find something that would compile and this is essentially our first (failed) attempt. It functions, but it has problems that rippled throughout the search namespace. It has bottlenecks that didn't exist in Java where we use LINQ and extra collection allocations. These bottlenecks will not be fixable after the Lucene.NET 4.8.0 release because they require breaking API changes, both to Lucene.Net.Queries and Lucene.Net.Grouping.
Through the high-level analysis work that @rclabo did, we learned that the crux of the issue is with the
GroupingSearch
class. It is a God Object that the user can see all of the grouping options on. But since it has fields that need to be multiple different generic types, it doesn't work in C#. Ron refactored the search method into different types of search, but I think it still requires covariance to exist. That is where we are today.I suspect we can fix all of the incompatibles by correcting the God Object design and following the Single Responsibility Principle. There are 3 different types of grouping searches (each defined by a separate constructor):
FieldCache
ValueSource
My thought is to start by refactoring the
GroupingSearch
class intoFieldGroupingSearch
,FunctionGroupingSearch
, andDocBlockGroupingSearch
, separating all of the fields into their respective classes. If there is any shared functionality remaining, create a new class namedAbstractGroupingSearch
to put this shared state and members in. This removes the need to have anything resembling Java wildcard generics because we will have separate classes with their own generic types.This will ripple changes through the Lucene.Net.Grouping project, but we should aim to remove the extra interfaces we created just to be able to declare types as covariant (which was a workaround we used for C#'s lack of wildcard generics).
IAbstractAllGroupsCollector<out TGroupValue>
AbstractDistinctValuesCollector.IGroupCount<out TGroupValue>
IAbstractDistinctValuesCollector<out GC>
IAbstractFirstPassGroupingCollector<out TGroupValue>
IAbstractSecondPassGroupingCollector<out TGroupValue>
IGroupDocs<out TGroupValue>
ISearchGroup<out TGroupValue>
ITopGroups<out TGroupValue>
ICollector
(in Lucene.Net.Queries). Note that we could keep this interface, but we should use the originalCollector
abstract class design. It has a constant declaration that cannot be put into an interface, so we have to have a class, anyway.Once we remove the interfaces, the violations to the SRP, and fix the 4 issues above, we can then use the class
GroupingSearch
to either delegate the call to the specializedGroupingSearch
classes (preferred - this would give us a similar UX as Lucene) or to add static factory methods to create instances of them. This is going to require some creativity. I am not sure the former is possible because of the different return types for each search method (which was the primary reason we needed covariance to port it), but separating the functionality will likely reveal more options than we previously had.Deliverables
Collector
back to its original abstract class design as it was in Lucene 4.8.1IEnumerable<T>
declarations toICollection<T>
as they were in Lucene 4.8.1IEnuerable<T>
and restoreICollection<T>.Count
as it was in Lucene 4.8.1ICollection
andIDictionary
LUCENENET TODO: Try to work out how to do this without an O(n) operation
lines)The text was updated successfully, but these errors were encountered: