Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizing CachingParser #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

FurcyPin
Copy link

When using the method CachingParser.parse(), useless caching is done :

The CachingParser.parse() uses a cache, then calls its parent Parser.parse() method, which calls Parser.parseUserAgent(), which is overriden by CachingParser.parseUserAgent() wich uses a second cache.

The same holds for parseOS().

@mmarmol
Copy link

mmarmol commented Mar 1, 2016

Hey @FurcyPin I am currently working in a project that is bases on this lib but I am trying to put things up to date. For example I have a remote url load for the yaml and a scheduled load. Also on the cache issue you mention I have implemented Guava and I am doing differently. I would love to hear feedback from you.

https://github.com/mmarmol/uap-java

@FurcyPin
Copy link
Author

FurcyPin commented Mar 3, 2016

Hi @mmarmol. The cache issue I mentioned was minor, simply values are cached twice.
Being able to update the yaml on-the-file is a great idea.
However I'm not sure I understand why you needed to recode the caching implementation using Guava?
I'm not sure what kind of feedback you are asking for? Did you mean feedback about this specific caching issue or about your whole project? In both case I would be happy to help.

@mmarmol
Copy link

mmarmol commented Mar 3, 2016

@FurcyPin whole project. I know that the caching is a minor thing. The idea os Guava was to provide better control in case of needed. You believe is a bad idea? My main problem so far with the lib here is that those not look to active. I started a project that depends on User Agent parsing and though it would be a good idea to share it since as needed in a commercial project it will probably get updated frequently. Thanks for getting back.

@FurcyPin
Copy link
Author

FurcyPin commented Mar 3, 2016

@mmarmol
You need to create a
Parser parser = new CachingParser()
instead of a
Parser parser = new Parser()
for the cache to be enabled.

Internally, the CachingParser uses a org.apache.commons.collections.map.LRUMap, and the cache size is hardcoded, so I guess that indeed Guava might allow to have a better control on the cache.

Here are a few comments about your implementation of the Parser :

  • If you want to use your parser to get only one field, say the OS, and use parseOS(), you won't have any caching done. If you look at the CachingParser, 4 caches are created, one for each field and one for the triple.
  • When you update the yaml, have you thought about potential concurrency issues? Maybe you should use a lock in the initialize() and parse() functions.
  • You also probably want to empty your cache when you refresh the yaml.
  • If you do both points above, you will probably want to check if the yaml has changed, and re-initialize() only when needed, to avoid emptying your cache needlessly.
  • Finally, I believe it would be great if you could add this feature as a part of this project rather than starting your own. Perhaps you could extends this Parser class with an UpdatableParser and do a pull request?
    I think this would be best to contribute to this project (which will hopefully still be maintained, @Ironholds?), as it is part of the whole cross-language uap project (for instance the Pig implementation is using this Java implementation: https://github.com/ua-parser/uap-pig)

Anyway, updating the yaml on the fly would be an awesome feature, great idea!

Hope this help!

@mmarmol
Copy link

mmarmol commented Mar 3, 2016

@FurcyPin thanks so much for this feedback!!!

If you want to use your parser to get only one field, say the OS, and use parseOS(), you won't have any caching done. If you look at the CachingParser, 4 caches are created, one for each field and one for the triple.

It will actually have caching done, if somebody calls parseOS with agent string "A" I will have a cache entry for that OS with key "userAgent:A". The user user calls just parse to get the full client when calling internally parseOS I will hit cache and return that one, them I will add cache for the other two. But I removed the client cache since is a derived from the other 3. Makes sense?

When you update the yaml, have you thought about potential concurrency issues? Maybe you should use a lock in the initialize() and parse() functions.

Actually new parsing instance for os, userAgent and device is created on each yaml update. So they aren't used until the others ends. I don't think it will be unstable while updating, it should keep working normally with the current services.

You also probably want to empty your cache when you refresh the yaml.

I think is also valid not doing the cleaning cause the cache itself will eventually free itself if proper configured. So I have set it as option boolean to decide if cache should be cleaned on updates.

If you do both points above, you will probably want to check if the yaml has changed, and re-initialize() only when needed, to avoid emptying your cache needlessly.

I need to figure out how to do this, I am not sure if a md5 or something is available to do that check without having to download the file or save the whole content and do a md5. What would recommend?

Finally, I believe it would be great if you could add this feature as a part of this project rather than starting your own.

I have been following this lib for several months and haven seen any detailed follow up on the pull request, and there are a couple great like one for devices that I have added to my lib. I really need things moving faster since it will be used commercially but we want to keep it opensource. If I see movement here we may get back to this one.

Thanks,

@FurcyPin
Copy link
Author

FurcyPin commented Mar 3, 2016

About Caching:
this makes perfectly sense, sorry I didn't see you were using the same cache for all 3 entries

About Concurrency issue:
You may still have a concurrency issue if you update your yaml or empty your cache in
the middle of the parse() function:

public Client parse(String agentString) {
  UserAgent ua = parseUserAgent(agentString);
  OS os = parseOS(agentString);
  Device device = parseDevice(agentString);
  return new Client(ua, os, device);
}

For instance you might get a new UserAgent recognized in a yaml update, and your update
could happen after parseOS() and before parseDevice()
Then you will have an OS which is null and a Device with a parsed value.
You might have the same kind of issues if you don't store the 3 parsed values (ua, os, device)
together in your cache and don't clean your cache after an update: you could have one of the value expiring and being updated without the other being updated too, which could cause weird discrepancies during updates...

About checking if the yaml has changed:
You can compute a md5 in java using java.security.MessageDigest. It can process a ByteBuffer, but indeed I am not sure if you can do it without having to download the stream twice from the URL, or storing the whole yaml in RAM...

I understand your need to move fast, keeping your changes open-source is already a great thing,
let's hope this project will continue to evolve.

@mmarmol
Copy link

mmarmol commented Mar 10, 2016

@FurcyPin The yaml is check now by md5 before doing the replace.
The parser are all change at one, so no more inconsistentes.
The cache it may not be full clean while doing a replace since multi-threading, but I am against locking. I will try to figure out another way to have it done without that or leave it just for now. Giving a max time to live to objects should solve that issue.

@FurcyPin
Copy link
Author

Seems good.
About the caching issue, I agree locking the cache might not be ideal.

The only simple solution I see right now would be to modify your Parser.parse() method to
store the triple (ua,os,device) in the cache, which is exactly what this project's CachingParser does.

I think it is safe to assume that any application using the ua project will either always ask for the triple
or always ask for only one same field. In this case, you won't store your data twice.

Just make sure to avoid the bug my commit is fixing.

@mmarmol
Copy link

mmarmol commented Mar 11, 2016

@FurcyPin yes, I have included your code. On last version added also 3 parsers at once replacement. I have another fix to release since the MD5 check for the file right now is wrong. I will release it on version 0.3.1 probably today.

@tkaymak
Copy link

tkaymak commented Jul 17, 2017

Any update on this?

@FurcyPin
Copy link
Author

not on my side

@mmarmol
Copy link

mmarmol commented Jul 17, 2017

https://github.com/mmarmol/uap-java

I am using this one for personal projects if you want to try it.

mbarbon added a commit to mbarbon/uap-java that referenced this pull request Sep 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants