-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any strategies for rendering the compression result? #35
Comments
Hi @ddssff, thanks for taking a look at this package! Just to make sure I am understanding correctly, you would like to be able to convert the result of One of the advantages to this implementation of RLE (or for that matter MTF or FM-Index) on an arbitrary There isn't a function in this package that turns the I'll look into adding this functionality. |
Yes, that's exactly what I'm looking for. I love the representation you are using, but some helper functions at the end for converting that representation to and from something string-like. In my particular case I'd like to use it as an html property, so that has its own constraints. This might turn into a whole thing of its own about escaping and un-escaping. |
Actually this might best be integrated with https://hackage.haskell.org/package/zenc - it looks like it only outputs alphanumeric characters, so one could use _ or $ for the eol. |
Great! Yeah this shouldn't be hard to do at all. Based on the intention of this package (compression algorithms surrounding textual data), I think it may be best to keep this functionality within this package? It appears that the zenc (https://hackage.haskell.org/package/zenc) has a different focus (String -> C name) than that of pure textual compression. I'm thinking for this functionality, it could choose from an ordered list of "preferred" characters to use (to replace the To invert back to the @ddssff what do you think? |
Do we need to worry about distinguishing the RLE counts from digits in the string? |
@ddssff great catch, we certainly do. Maybe a custom parser can ensure appropriate positioning of counts vs. data and then these can allow us to differentiate when deciding upon the preferred virtual EOF character. |
In the case of zenc, there may be enough left over ASCII characters to create an alternative set of digits to be used for the run lengths. So more generally, a set of characters which represent themselves (the "preferred" characters above) and a second list of characters to be treated as digits? And an EOF character. |
Would the implicit structure of the RLE help here? For any ‘RLET’, all even (starting with zero) characters would be the run length and the odd (starting with one) would be the digit/character of the original string. Maybe it’s possible to figure out the set of characters in the run lengths and associated characters and then pick one that doesn’t exist in either as the EOF character? My apologies if I’m misunderstanding you. |
Whatever works and makes a short string! |
Sounds good! Let me work on this, and then I’ll test and release a new version to hackage. |
Poking around in the code, it looks to me like a better representation for the
A little test suite would help here. |
This is great, thank you! Let me take a crack at switching over to that representation. And totally agree, a test suite is in need indeed. I’m on vacation now, so I probably won’t be able to work on this until next week. |
Actually, even better than |
I have looked over this package briefly, and I am wondering how you could take the result of, say,
textToBWTToRLET :: Text -> RLET
and convert it to a string in such a way that it can be uniquely converted back to anRLET
, so it could then be decompressed? I read over some general RLE papers and they seem to always use examples that have no digits.The text was updated successfully, but these errors were encountered: