Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add info about media type to Memento #56

Open
Mr0grog opened this issue Oct 21, 2020 · 2 comments
Open

Add info about media type to Memento #56

Mr0grog opened this issue Oct 21, 2020 · 2 comments
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Oct 21, 2020

In EDGI’s web monitoring tools, we often look at the media type (also often referred to as MIME type or content type) of a Memento. One example is that we need to know how to parse the body in order to extract a title (you’d do very different things for HTML vs. PDF, for example). It might be nice to expose some sort of media type information on the Memento class.

We originally planned to do this in #2, but it wasn’t critical and there were enough open questions and options that it seemed worth waiting on coming up with a better design for:

  • Should this be as simple as just the media type with no parameters?

    a_memento.headers['Content-Type'] == 'text/html; encoding=utf-8; some-param=value'
    a_memento.media_type == 'text/html'
  • Should it be a detailed representation?

    class MediaType:
        # Has attributes for each part of a media type string,
        # Probably a __str__() implementation, etc.
        ...
    
    a_memento.headers['Content-Type'] == 'text/html; encoding=utf-8; some-param=value'
    a_memento.media_type == MediaType(type='text',
                                      subtype='html',
                                      parameters={'encoding': 'utf-8',
                                                  'some-param': 'value'})
  • Should it convert known non-canonical types into the canonical one?

    a_memento.headers['Content-Type'] == 'application/xhtml'
    a_memento.media_type == 'application/xhtml+xml'  # Means the same thing, and is more correct.
  • Should it sniff?

    a_memento.headers['Content-Type'] == None
    a_memento.content == b'%PDF-blahblahblah...'
    a_memento.media_type == 'application/pdf'
@Mr0grog Mr0grog added enhancement New feature or request question Further information is requested labels Oct 21, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 21, 2020

First-pass implementation of the complex type I had rigged up when first working on #2, before realizing all the other questions here. It may be useful in the future:

import re

ESCAPE_OR_QUOTE = re.compile(r'\\(.)|"')

class MediaType:
    """
    Represents a media type, like ``text/html``.

    For more information about media types, see `RFC 2045`_ and `RFC 2046`_.

    .. _RFC 2045: https://tools.ietf.org/html/rfc2045
    .. _RFC 2046: https://tools.ietf.org/html/rfc2046

    Attributes
    ----------
    type : str
        The top-level type, e.g. ``'text'`` in ``'text/html; charset=utf8'``.
    subtype : str
        The subtype, e.g. ``'html'`` in ``'text/html; charset=utf8'``.
    parameters : dict
        All the parameters that were specified, e.g. ``{'charset': 'utf-8'}``
        in ``'text/html; charset=utf8'``.
    media
    parameter_string
    """
    type = ''
    subtype = ''
    parameters = None

    def __init__(self, type, subtype, parameters=None):
        if not type or not subtype:
            raise ValueError('Type and subtype must be non-empty strings')

        self.type = type
        self.subtype = subtype
        self.parameters = parameters or {}

    @property
    def media(self):
        return f'{self.type}/{self.subtype}'

    @property
    def parameter_string(self):
        # FIXME: parameter values need to be quoted if they contain special
        # characters. See https://tools.ietf.org/html/rfc2045#section-5.1
        return '; '.join(f'{key}={value}' for key, value in self.parameters)

    def __str__(self):
        if self.parameters:
            return f'{self.media}; {self.parameter_string}'
        else:
            return self.media

    @classmethod
    def parse(cls, text, strict=True):
        """
        Build a :class:`wayback.MediaType` instance from a media type string,
        like ``'text/html; charset=utf8'``.

        Returns
        -------
        wayback.MediaType
        """
        main, _, parameter_text = text.partition(';')

        types = [item.strip().lower() for item in main.split('/', 1)]
        if len(types) != 2:
            if strict:
                raise ValueError(f'Malformed media type: "{text}"')
            else:
                types = ['application', 'octet-stream']

        parameters = {}
        to_parse = parameter_text
        while to_parse:
            name, _, to_parse = to_parse.partition('=')
            name = name.strip().lower()
            to_parse = to_parse.lstrip()
            if to_parse[0] == '"':
                value = ''
                position = 1
                while True:
                    match = ESCAPE_OR_QUOTE.search(to_parse, position)
                    if match:
                        if match.group(0) == '"':
                            value += to_parse[position:match.start()]
                            position = match.end()
                            break
                        else:
                            value += to_parse[position:match.start()] + match.group(1)
                            position = match.end()
                    elif strict:
                        raise ValueError(f'Media parameter "{name}" has no end')
                    else:
                        value += to_parse[position:]
                        position = len(to_parse)
                        break

                _, _, to_parse = to_parse[position:].partition(';')
            else:
                value, _, to_parse = to_parse.partition(';')
                value = value.strip()

            parameters[name] = value

        return cls(types[0], types[1], parameters)



media = MediaType.parse('text/html; mad=" oh yeah \\"unbalanced embedded string; crazy end\\\\"; another-thing = yeah')
print(f'Type:    "{media.type}"')
print(f'Subtype: "{media.subtype}"')
print('Parameters:')
for name, value in media.parameters.items():
    print(f'  |{name}| = |{value}|')

@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 21, 2020

(Also: I now know more than I knew there was to know about the syntax rules for HTTP headers and for Media Types.)

@Mr0grog Mr0grog added this to the v0.5.x milestone Nov 10, 2022
@Mr0grog Mr0grog modified the milestones: v0.5.x, v0.4.x Dec 4, 2023
@Mr0grog Mr0grog moved this to Backlog in Wayback Roadmap Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
Status: Backlog
Development

No branches or pull requests

1 participant