-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add info about media type to Memento
#56
Labels
Milestone
Comments
Mr0grog
added
enhancement
New feature or request
question
Further information is requested
labels
Oct 21, 2020
First-pass implementation of the complex type I had rigged up when first working on #2, before realizing all the other questions here. It may be useful in the future: import re
ESCAPE_OR_QUOTE = re.compile(r'\\(.)|"')
class MediaType:
"""
Represents a media type, like ``text/html``.
For more information about media types, see `RFC 2045`_ and `RFC 2046`_.
.. _RFC 2045: https://tools.ietf.org/html/rfc2045
.. _RFC 2046: https://tools.ietf.org/html/rfc2046
Attributes
----------
type : str
The top-level type, e.g. ``'text'`` in ``'text/html; charset=utf8'``.
subtype : str
The subtype, e.g. ``'html'`` in ``'text/html; charset=utf8'``.
parameters : dict
All the parameters that were specified, e.g. ``{'charset': 'utf-8'}``
in ``'text/html; charset=utf8'``.
media
parameter_string
"""
type = ''
subtype = ''
parameters = None
def __init__(self, type, subtype, parameters=None):
if not type or not subtype:
raise ValueError('Type and subtype must be non-empty strings')
self.type = type
self.subtype = subtype
self.parameters = parameters or {}
@property
def media(self):
return f'{self.type}/{self.subtype}'
@property
def parameter_string(self):
# FIXME: parameter values need to be quoted if they contain special
# characters. See https://tools.ietf.org/html/rfc2045#section-5.1
return '; '.join(f'{key}={value}' for key, value in self.parameters)
def __str__(self):
if self.parameters:
return f'{self.media}; {self.parameter_string}'
else:
return self.media
@classmethod
def parse(cls, text, strict=True):
"""
Build a :class:`wayback.MediaType` instance from a media type string,
like ``'text/html; charset=utf8'``.
Returns
-------
wayback.MediaType
"""
main, _, parameter_text = text.partition(';')
types = [item.strip().lower() for item in main.split('/', 1)]
if len(types) != 2:
if strict:
raise ValueError(f'Malformed media type: "{text}"')
else:
types = ['application', 'octet-stream']
parameters = {}
to_parse = parameter_text
while to_parse:
name, _, to_parse = to_parse.partition('=')
name = name.strip().lower()
to_parse = to_parse.lstrip()
if to_parse[0] == '"':
value = ''
position = 1
while True:
match = ESCAPE_OR_QUOTE.search(to_parse, position)
if match:
if match.group(0) == '"':
value += to_parse[position:match.start()]
position = match.end()
break
else:
value += to_parse[position:match.start()] + match.group(1)
position = match.end()
elif strict:
raise ValueError(f'Media parameter "{name}" has no end')
else:
value += to_parse[position:]
position = len(to_parse)
break
_, _, to_parse = to_parse[position:].partition(';')
else:
value, _, to_parse = to_parse.partition(';')
value = value.strip()
parameters[name] = value
return cls(types[0], types[1], parameters)
media = MediaType.parse('text/html; mad=" oh yeah \\"unbalanced embedded string; crazy end\\\\"; another-thing = yeah')
print(f'Type: "{media.type}"')
print(f'Subtype: "{media.subtype}"')
print('Parameters:')
for name, value in media.parameters.items():
print(f' |{name}| = |{value}|') |
(Also: I now know more than I knew there was to know about the syntax rules for HTTP headers and for Media Types.) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In EDGI’s web monitoring tools, we often look at the media type (also often referred to as MIME type or content type) of a Memento. One example is that we need to know how to parse the body in order to extract a title (you’d do very different things for HTML vs. PDF, for example). It might be nice to expose some sort of media type information on the
Memento
class.We originally planned to do this in #2, but it wasn’t critical and there were enough open questions and options that it seemed worth waiting on coming up with a better design for:
Should this be as simple as just the media type with no parameters?
Should it be a detailed representation?
Should it convert known non-canonical types into the canonical one?
Should it sniff?
The text was updated successfully, but these errors were encountered: