Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary instances for OsString, etc. #161

Open
hasufell opened this issue Jul 16, 2022 · 10 comments
Open

Binary instances for OsString, etc. #161

hasufell opened this issue Jul 16, 2022 · 10 comments

Comments

@hasufell
Copy link
Member

In GitLab by @phadej on Jul 17, 2022, 03:04

Where these should be?

  1. binary (then binary would depend on filepath)
  2. filepath (then filepath would depend on binary)
  3. somewhere else? (https://hackage.haskell.org/package/binary-instances or new filepath-binary-instances).

I don't know if how GHC serializes its stuff, but I bet there are filepaths in there. (Cabal does, and uses Generics). So that rules out third option, except if it's a new filepath-binary-instances package which is bundled with GHC.

Otherwise I'd prefer second option, so the type-class providing package stays as dependency-light as possible.

ping @bgamari

@hasufell
Copy link
Member Author

In GitLab by @maerwald on Jul 17, 2022, 04:12

To me it seems fine if binary depends on filepath.

What would the instance actually do? Just unwrap the underlying ShortByteString? Remember that we have two different constructors for unix and windows:

newtype WindowsString = WindowsString ShortByteString

newtype PosixString = PosixString ShortByteString

newtype OsString = OsString
#if defined(mingw32_HOST_OS)
  WindowsString
#else
  PosixString
#endif

Should you be able to encode on unix, send the bytestring over the wire and decode on windows? What are the semantics?

@hasufell
Copy link
Member Author

In GitLab by @phadej on Jul 17, 2022, 04:17

To me it seems fine if binary depends on filepath.

It's not. filepath is already dependency heavy: through exceptions it depends on mtl, transformers, and stm. IMO adding binary to the list is ok.

@doyougnu
Copy link

doyougnu commented Aug 22, 2024

I've recently been benchmarking GHC building Cabal and normalise shows up as hot because of ++ and is used all over the linker and for the interface files. In any case, I'm in the process of migrating the ubiquitous FilePath in the Loader and in the Interface file system with OsPath. So naturally I ran into this issue because there is no Binary instance for OsString and we serialize the interface files.

Is there anyway to make progress here? I don't mind spending some cycles to open the appropriate MR/PR

@hasufell
Copy link
Member Author

What would the semantics of the instance be?

@doyougnu
Copy link

What would the semantics of the instance be?

Ah, well I suppose I can just write the unpack-to-shortbytestring instances in GHC, its not like GHC will be sending binary instances over the wire. Although I'm unsure if this will mess with cross-compilation, at least I can test it.

Now to actually answer your question. My inclination is just to unpack to short bytestring. I'm suspicious that the solicited problem of Should you be able to encode on unix, send the bytestring over the wire and decode on windows? What are the semantics? does not occur frequently in the wild. But then again I haven't been working at the application level. From the GHC perspective though, if someone is sending an interface file from unix to windows over the wire then I think they would be on their own. In any case, I think its reasonable to do the unpack to shortbytestring instance and say that these instances are not portable across platforms. That at least serves some of the current users and leaves the exact problem you solicit to future work. This way we don't let the harder problem gate the simple solution.

@hasufell
Copy link
Member Author

So you're saying that:

  1. put (str :: OsString) always works on any platform
  2. r :: OsString <- get may fail when data is sent between platforms (Get has a MonadFail instance)
  3. WindowsString is gettable on any platform
  4. PosixString is gettable on any platform

Is that correct? How would the get implementation for OsString, WindowsString and PosixString look like? We need to ensure that:

  • a serialized WindowsString can't accidentially be deserialized as a PosixString or vice versa
  • OsString deserialization only works on the corresponding platform

@doyougnu
Copy link

a serialized WindowsString can't accidentially be deserialized as a PosixString or vice versa
OsString deserialization only works on the corresponding platform

Both of these could be solved with a sentinel value in the Binary instances and this would only require one bit as well.

So we:

  • define Binary instances for WindowsString and PosixString.
  • For windows we write a leading 0 on a put and then the rest of the payload using the ShortByteString instance.
  • For Posix we write a leading 1.
  • For a get we read the leading bit, we hard code the sentinel but use ifdefs so if we are on Windows and read a 1 then we can error out by checking 1 /= sentinel before reading the rest of the payload. Same for Posix, the headache will just be managing the conditional compilations.

So I have in mind something like this in os-string

#if defined(mingw32_HOST_OS) || defined(__MINGW32__)
sentinel :: Word8
sentinel = 0 -- not exported 
#else
sentinel :: Word8
sentinel = 1
#endif

and then:

instance Binary WindowsString where
  put ws = putWord8 sentinel >> put (unWindowsString ws)
  get ws = getWord8 >>= toOsString 
    where toOsString read_sentinel | read_sentinel == sentinel = WindowsString <$> getTheRest
                                                       | otherwise = fail "oops wrong format for this platform!"

of course we would need to also store the number of bytes to read on a get but that's fine since we can store this in the same sentinel value and just bit fiddle. So imagine we use Word32 then would reserve 1 bit for the platform and 31 for number of bytes to read.

Note that reading some of the binary instances in GHC I'm not sure if that's actually necessary since they don't do that much hand holding. But in any case that's the idea.

@hasufell
Copy link
Member Author

Both WindowsString and PosixString are types that may be used on any platform. As an example: the tar package uses PosixString on windows, because the tar specification is as such. That allows us to safely convert PosixString to OsString.

The only type that needs to check that it matches the current platform is OsString.

@Bodigrim
Copy link
Contributor

I would prefer binary depending on os-string / filepath, not vice versa, because it makes easier to move OsString into base at some point in future.

@doyougnu
Copy link

doyougnu commented Aug 26, 2024

Both WindowsString and PosixString are types that may be used on any platform. As an example: the tar package uses PosixString on windows, because the tar specification is as such. That allows us to safely convert PosixString to OsString.

Ah I see, in that case perhaps its better not to provide binary instances but rather functions that would do the serialization. That would allow binary to depend on os-string/filepath, but also force the user to know what they are doing. To be clear I mean write functions like getPosixString and getWindowsString and then live with the code duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants