-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i and b vs em and strong #652
Comments
Implementers (or plugin authors) could decide to use asterisk PS: I prefer the underscore for presentational elements, because it fits well with introducing underlined |
Inkwells have been drained and spools of paper emptied rehashing the arguments for whether The short version is that this is an HTML distinction that doesn't exist in Markdown and use cases vary so each application may make different choices on how to map the elements to another format. Markdown only has one pair of semantic elements (in spite of having to syntax options) while HTML has two. You can really only map to one of them at a time. Many rendering engines give you the option, or a way to filter tags and write them how you please. Doing both is not really at option at this point for legacy reasons. |
Personal attacks and assumptions about what I've researched are not appropriate. I'm not new to the topic of what elements Markdown should use. I'm just new to CommonMark specifically.
With one being a not-always-valid subset of the other. |
I wasn't making a personal attack. I did assume since you didn't even hint at knowing any of the background that you might not be aware of it. In any event if you are aware of some of the background then surely you know jumping in with a dogmatic assertion that one set of tags is better isn't going to resolve things.
No, one is not a subset of the other. It's more like two competing standards, one focusing more on structural semantics and the other on presentation and legacy. If one was a subset of the other, the superset would always be interchangeable with some loss of meaning. Such is not the case, one could intend a kind of emphasis that was not supposed to be styled with italics or just as well as italics may not be used only for emphasis. |
I didn't mean to step right into a fight here. I love markdown and I know there is a lot of hard work behind it (and projects adjacent to it) over the years. I wouldn't care about this issue if I wasn't invested in the language and seeing it as the future of markup. And, I understand that I'm probably a decade too late (or I don't know when the CommonMark project started). It seems pretty locked in. But this language might still be in use fifty years from now. I hope Markdown's future is longer than its past. The W3C puts it like this in the HTML standard:
Emphasis is a subset of that. But yeah, their description of b isn't a superset of strong emphasis on the other hand, although historically b has been used that way. I tried to be careful when phrasing the original post in this issue thread. I wrote "Is there a way to get i" rather than "em is always wrong".
in source code is not correct. Seeing
has legacy precedent and is not wrong per se, even though I'd rather use em there. Markdown by its very nature as a plaintext-adjacent, "email tradition based" language does have its roots in presentation and legacy rather than structural semantics, which is why I feel that a way to easily make i or b is appropriate. Just as easily as I would write
in an email. But it's for the sake of structural semantics that this matters. A web-scrapin' robot going through web pages and trying to scan for what is emphatic text might go "Wow, people back in the early 21st century really got stoked about their French phrases!" Yes, i or b is sometimes a little blunt, sometimes a little non-specific, sometimes not optimal. But when I overuse em or strong that's sometimes flat out a falsehood, sometimes flat out what I don't mean. I've seen some semantical horror stories over the years, like people using h1 for centering images (because that particular forum had h1 headers centered) or h6 when semantically they should've used h1 "because they think the smaller letters look cuter". There's no way to stop all such misuse, but we don't have to build it into the language, either. |
You're talking to somebody who uses custom Pandoc filters to overload The way you get |
To go back to the first question:
I do think they often are the right thing, but indeed, they aren’t always. To get I do think it’s unfortunate that CommonMark does not say anything about semantics. And that its definition (“6.4 Emphasis and strong emphasis”) is not aligned with HTML (In HTML, nested emphasis is used for “strong” emphasis, whereas the |
What the CM reference implementation should do, though, is to retain information in its AST about which character has been used in the source. PS: https://talk.commonmark.org/t/em-strong-vs-i-b-or-cite-dfn-etc/1242 https://talk.commonmark.org/t/revisting-underline-healthcare-documents/3078/3 |
@snan I agree with most everything if not everything you said. I agree with the following in spirit
but I don't think Markdown will last even another ten unless it evolves*. It's still mostly used by technical types, myself included, who are comfortable writing for machines -- that is to say, quite used to and adept at thinking in terms of, What do I need to do to get the machine to do what I want?. Markdown was definitely a step in the right direction away from HTML for authoring. But we need to make more steps. *I'm not sure it can. I think/hope something Markdown-like will replace it. A bit of a reboot is necessary. |
I was under the (mistaken?) impression that that was for HTML output only; i.o.w. it's more of a "passthrough" feature than a "breakout" feature. In pandoc 2.5, which I have at hand, when compiling the text to LaTeX, it just drops those
♥ I do that too, on my own system, (lua ftw) but the reason I just found out about CM is that Stack Exchange announced that they are going to adopt it and I was like "OK, so it's no longer Gruber that I have to go bug about this". |
I'm seeing a lot of sites switching over to wysiwyg or wysiwym but I'm not wholly on board with that. I love markup languages. |
Wow, so it is a subset of b after all! |
That impression is correct: though when going to LaTeX, it doesn’t really matter whether
The HTML spec also says on
I don’t think I agree that it’s good to see |
It just drops the tags. So if you want to publish to both TeX and HTML you're sol if you use
This language to me also implies fallback, catchall, default. When you can be more specific, you should. With a visual/presentation based markup like the email-derived |
As alluded to upthread, we know that through the life-changing magic of CSS, em and strong aren't strict subsets of i and b respectively. You can style it to use underlining or small caps to emphasize. So I don't mean a strict subset, I mean… kinda a subset. It's correct to say that one is semantics and the other is presentation. It's just that
I'm not disputing that we want semantics. I just don't want wrong semantics.♥ |
I'm also definitely not saying that the solution is that markdown's output for em and strong should instead always be i and b. I've tried avoiding taking that position in this thread. It's what I would do, but I realize that that's a compromise with some serious downsides, and I'm open to other solutions. |
That to me sounds like a Pandoc problem, which I was under the impression could turn HTML into TeX.
I do see that you never proposed that in posts; but to me the title of this issue, “i and b vs em and strong”, pits them against each other. I do think em and strong are better defaults than i and b, but I recognize they aren’t always. I would say that CommonMark talking about semantics is an acceptable solution, ushering users to care about semantics instead of presentation. And that i and b created according to @Crissov’s suggestion would be a welcome addition in userland. |
Pandoc can indeed convert HTML to LaTeX. However, here the input format is Markdown, and pandoc drops raw HTML when rendering to non-HTML formats. (This behavior is at least sometimes what you want.) However, you can always use a lua filter that converts these raw HTML nodes to something that makes sense in your target format. |
I love lua♥ |
Here's just one idea (just green hat brainstorming for a solution here): What if |
Intuitively I feel the same way; I usually do think "emphasis" when I use the asterisks, and do usually think presentationally cursive when I use the underscore (sometimes that part of my brain is sloppy and thinks presentationally cursive when it should be thinking emphasis ← wow, I just did it in this sentence involuntarily, those were underlines just then). However, in some implementations asterisks work inside of words like this and underscores don't, like t_hi_s. Are people more likely to use presentational cursive in words or emphasis? I guess emphasis so this paragraph isn't much of a "however" and instead should be an "additionally" since I come down on the same divide as you do, Crissov. And that it might be too late to change, I wouldn't know if that was true. Crystal ball is on the fritz over here |
All proper implementations of Commonmark support asterisks inside words (but not underscores), while only some implementations of Markdown do. |
Which only strengthens I was trying to say about that, rather than contradict it.♥ |
Oh, I just saw that this is getting downvotes. And I'd rather have find a perfect solution than a compromise that no-one is truly happy with, but it's frustrating that we aren't getting anywhere nearer a solution here. For those who use pandoc for html only, it's not a big problem because they can do manual italics but it's difficult for when we want to use the same source documents for standards compliant HTML and for ConTeXt or LaTeX. |
A month has passed and I find I'm OK with writing However, there are many times where the i, b or cite is getting lost. On Reddit, on Stack Exchange, and sometimes in Pandoc. That's why I wanted i, b and cite to become "part of the language" or at least some sort of recommendation that implementations don't throw away this information. |
Here is a thread where that has been an issue.
That doesn't help someone who is posting on Reddit or Stack Exchange or the hundreds of other sites where these render into em. CommonMark implementations make the web full of Example 393 in CommonMark's own spec is evidence of this. The call is coming from inside the house! 😱 |
If by “part of the language” you mean a new syntax, I’d probably be against it. I worry that the grammar will become too crowded. Depending on what design you come up with, it’s either likely easy to type, which will also mean that it would break lots of existing markdown. Or it’s complex to type, but then I’d prefer something like generic directives. For a recommendation, I dunno.
I’m not quite sure what you‘re saying, to phrase it differently: I am in favor of implementation adding options to use I don’t think we need to add that everywhere in the spec though. I don’t think we need to describe that implementations are free to use Can you clarify what you don’t like about that example? |
Sure, thanks for the question, that's illustrative of the issue so it's good to dig in a li'l deeper: The example is:
That is not correct HTML, which should be:
or, depending on the context, maybe even:
Linnaean names, like that Latin name for balloon plant used in that example, are always marked italics, cursive, oblique, or otherwise text-decorated, but not emphasized;
Refering to a poodle as "a dog" is slightly weird but not that bad and it's technically correct (and that's what we're doing when we're using Refering to a collie as "a poodle" is, on the other hand, quite wack (but that's what we're doing when we're using And before someone asks: "But we should express semantics, not style. I heard someone back in the nineties say that a lot of the time people are wrongly using Yes, it's true that I backed off from this argument a few years ago because of this argument: "We support raw HTML so people can type out But two things are becoming clear to me. 1A. People are using CommonMark-derived converters in places where raw HTML is (and should be) turned off, like on public forums and comment \2. Not everyone is, wants to be, or needs to be a linguistics nerd. People shouldn't have to learn the specifics minutia of when to use em, cite, or i. They just want the text to look slanted so they jam stars or underscores around. Making That's why my recommendation is this: Sites where raw HTML is turned off (as it should be, for public text inputs) should emit Installations where markdown is used as a tool for writers, where it's a shortcut for HTML as opposed to a replacement for it, and raw HTML is allowed, may optionally continue to emit That's what I would use for my own blog where I can type out Not everyone should have to learn this stuff but that doesn't mean it's OK that the web is littered with wrong semantics like That's more wrong than
Yeah. It became clear upthread that that particular idea (elevating
I want to be able to use italics and bold on Reddit, StackExchange, here on GitHub, and dozens of other sites that pass the buck by saying "We're only doing what CommonMark says".
It's becoming clearer and clearer to me that we do need to be explicit about that. Summary: It should be i and b instead of em and strong (at least on most of the websites out there like GitHub, Reddit, StackExchange, wikis etc). I like that |
OK, so there is a typographic convention of how things should look.
I don’t see any reason to conclude that Latin names must never be marked as stress emphasis as described by the HTML spec: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element. I can understand that it might be better to remove things with certain typographic conventions here. Perhaps: My **cats (*Cheddar* and *Whiskers*)**. <p>My <strong>cats (<em>Cheddar</em> and <em>Whishers</em>)</strong>.</p> …might be an improvement. I am sure there’s an example we can think of that you accept that embeds
You can type
You answer this yourself: “But we should express semantics, not style.” HTML is about semantics, not about presentation.
Sure
Using your own terms around poodles and collies: just because 90% of people don’t give a hoot about semantics, doesn’t mean we need to remove all semantics and go with HTML 2 again.
I have supported this: #652 (comment)
I am not sure why you deem one more or less wrong than the other. Both can be right. Both can be wrong.
If you’re interested in a markdown-like language that does make separate tags a part of the language, you might enjoy https://mdxjs.com.
That’s not what your PR there does. You break CommonMark there by changing everything for everyone. |
Yes, that's a really good way to phrase what the semantics of
Right, offsetting from the normal prose is what we want, as opposed to a specific visual representation of that offset.
I mean, they can be part of stress, like you could say I've told this story before but I remember an old social media site (now defunct) in the early 00s and I saw someone had managed to center an image on their profile, something that the dinky markup of the time didn't allow. But when I looked under the hood, I saw to my shock & horror that they had marked the image as a h1, which had cause the site CSS to center it. That's not what h1 means. And that's as bad as using stress to mark latin names. Using italics to mark them, sure. Because the point is to offset them from prose. (I wrote as much in my previous post, saying " italics, cursive, oblique, or otherwise text-decorated". I've seen them underlined in old type-written manuscripts and that's fine too, for example. And
That is not better. That'd be super weird, semantically, to stress their names that way.
Yeah, I can think of a few. The other examples use text like foo and bar and that'd be fine here. Using names is not good.
If that's true, GitHub is not an applicable example for this problem. But there are many other sites out there where that's not possible because they have turned off raw HTML, and, there are also users who don't understand (and shouldn't have to understand) when to use which.
Cite and em was added to HTML at the same time as i and b was, with HTML 2 (as you know, since you linked to the HTML 2 RFC which does mention em and strong).
Being overly broad is less wrong than being specific-but-wrong. Calling a poodle a "dog" is less wrong than calling a collie a "poodle".
The problem isn't my own websites where I have control over what Markdown implementations to use. I personally already have a setup that lets me write my choice of em, cite, i, strong, or b. The problem is sites like Reddit, StackExchange and many, many others where A: users have no way to type i or cite as distinct from em, and B: they shouldn't have to, they shouldn't have to learn to do that nor to learn to understand hyper nitty-gritty semantics perfectly. And there's no way to automatically detect when they mean cite or em or i so I propose we use i. Offsetting from prose is what they want, even though they might mean to do that offsetting for emphasis purposes 70% of the time. In hindsight it was a bad idea for HTML to create em and cite and strong tags because they presupposed every single formatted text online needs to go through an editor with enough linguistics chops to distinguish between which to properly use when. That's fine for institutions but an unreasonable requirement for a discussion site or other public-writable spaces. I'm a linguist—I can nerd out enough to know when to use em and when to use cite and when neither is applicable and I need to use the superset, i. And even then I make mistakes every now and then—but I'm not a biologist so, if to reuse the poodle/collie example: if every website like Reddit or StackExchange required me to use one syntax when talking about poodles, one syntax when talking about collies, and another when talking about non-poodle non-collie dogs, I'd be in trouble. Letting
The maintainer hadn't written that response yet when I posted here. I think the default should be i and b, with em and strong being tucked away as an option (only to be turned on by people who know exactly what they are doing and who can emit cite and i and b by other means, such as raw HTML). Changing everything for everyone is the point. There's a lot of collies marked "poodle" out there on the web. If they can be turned into "dogs" that'd be a win for semantics. These sites look to CommonMark as an authority on this. They're like "we're emitting em and strong because that's what CommonMark tells us to do". That wasn't necessarily CommonMark's intent—which was more to clarify the specifics of nesting and overlapping and so on—but that's what has happened and that gives CommonMark a responsibility clear this up. |
Whatever tool you give by default people are bound to conflate them. But if you give them a tool that is inherently presentational rather than semantic what you will end up with is a bunch of presentational markup and no structure. If you give them a semantic tool and they use it wrong they will produce the opposite problem. As a dude producing books the reason I force authors and editors to use Markdown and then convert the results is specifically to take away the formatting tools and make them concentrate on the content. I don't want them fiddling with what font size to use for a heading, I want them to think about whether this heading is a section or a subsection. Typesetting and will take care of the style. Markdown is especially useful in this context because it gives almost exclusively structural markup options. For the few cases where more is needed, divs or spans with classes can be used. For example I have authors use I would suggest passing out the semantic markup tooling by default and making everybody else adapt is better than giving people presentational markup in Markdown. Either way you will get mistakes and stuff marked up wrong, but one fits the pattern of other provided tooling in outputting semantic markup. |
Again, If Markdown had been written by a level-headed never-emphatic often-book-citing literature nerd,
Yes! (I also produce books, for what that's worth.) That's what I want too! Correct semantics. And these authors use (many are forced to use, even)
The solution is to give them very few things. We can't fully get away from this. I've seen Markdown end users use But one thing exists that can mean emphasis, citation, or other offset prose and that thing is If you write
Right. And 99% of people writing text in Markdown online don't understand semantics. They do understand that if they slam asterisks around a phrase, it looks like this, so that's what they do.
That is great for real Markdown, and CommonMark also specifies how raw HTML. But the problem is all those sites that turn off all that fancy stuff but still say "We emit (The secondary problem is that even if that stuff was left on, 99% of people writing Markdown into forum sites, question sites, chats like Slack or Discord or Matrix, they wouldn't know how to use it correctly.)
So you have authors who are brilliant enough to be able to use spans (which isn't necessarily correct, That's great for you and your publishing pipeline. That doesn't help me who has to wade though waist-high levels of misapplied I don't want to take away your toolset. You have editors and typesetters polishing the author's texts. All those sites who use "markdown-derived" markup in a wikilike-way open to the public. Those who don't have an editor team and a typesetting team.
Yes. That's exactly the problem. You've hit the nail on the head why this is an issue. Epubs and screenreaders getting forcefed In a printed book, or PDF, all semantics are replaced and rendered into style anyway.
Yes, but UX and conveyance and good defaults can make a big difference, and over-broad (marking poodles as "dogs") is less wrong than wrong-specific (marking collies as "poodles").
A kind of irrelevant response to the specific request at hand here, which is to have the libraries these sites and apps use emit i and b instead of em and strong. Tooling can still handle i and b. |
I’m kinda lost what you are proposing. You’ve before said that you don’t always want Many ideas have been thrown around over the years. Perhaps it’s better to open new issues for ideas that do have consensus? |
OK. I'll repeat it here. Thanks, it's good to be super clear. \1. On websites/chat apps/forums/question sites etc for the general public , The CommonMark spec is technically already ambiguous about this. It's mostly about how specifically nesting & parsing works. You can then use it to generate TeX or HTML or SVG or RTF or anything you want. It's already not a violation of CommonMark to emit b and i. However a lot of implementers out there who incorrectly emit em and strong say "We are only following what CommonMark says". \2. On installations used only by technical editors who know what the heck whey are doing, it's OK that the shorthands This pull request is a very mild first step in the right direction. I would like to go much further.'
Yes, but I then wrote three years later:
The solution has kept being really bad and wrong these past years; the argument that convinced me back then has not borne out in practice. It's been a mess of wrong ems out there.
Some of the people protesting have been misunderstanding what I ask, or at least have expressed their objections in a way that came across as if they hadn't understood what I was asking for and why.
That doesn't fix the problem that a lot of website out there fill the world with wrong em and strong in the name of CommonMark. |
Re \1.: I think you’re free to tell people that the output you prefer is semantically better for user generated content.
Do you have links other than your comrak PR? I’ll try to summarize my personal opinions again:
|
Aw yes! Such an appendix would be awesome, to the extent that it reflects my position that i and b are a better default for many installations (admittedly not all installations). It seems like we are on the same page.
You're right. The PR doesn't solve it and if we find a solution that solves it better, the PR isn't necessary. So the PR is not good. I guess I was desperate and over-compromising in a way that didn't bring the issue all the way over to "solved".
I don't know about that but practical differences isn't the entirety of the issue. It's also that what's right is right.
I don't agree that, since i semantically indicates "offset from prose", which is the intent rather than a specific styling. (Yeah, that's a shift in my position compared what I've been saying in some posts upthread, a shift that happened here. I think the new position is even stronger.)
It doesn't to the extent that they emit i and b, like all wiki formats and bbcode style formats I have ever seen up until Markdown broke the trend (as far as I know) and changed it to em and strong. In these other formats can semantically express "offset from prose" and "strongly offset from prose" in a way that covers emphasis, citation, and other uses such as Linnaean names. This isn't a slag on upstream Markdown since it, for its particular use case at the time, was just a tool for HTML writers, a shorthand, a complement that made it easier to write common things like paragraphs and emphasis and links and blockquotes, but still let you drop down to HTML for anything special and unusual, like tables and citations and other non-emphatic use of i.
Half of the problem exists.
Maybe. Header level issues and other issues with ugc is beyond the scope of this specific issue #652, which is that i and b, being "supersets" in a way of em/cite and strong respectively, does solve a lot of the "wrong semantics" out there. In an overly broad way—a lot of poodles will be relabled from “poodles” to “dogs”, but with the upshot that a lot of collies will also be (correctly) relabled from “poodles” to “dogs”. |
Our use of I agree that If I could be convinced that
There's also the "conservativeness" argument noted by @woorm: if we made this change, it could cause substantial breakage and inconvenience. (E.g. style sheets would no longer work as expected.) So we'd want to be sure that there's an equally substantial benefit. In short, I'm not yet convinced we should make this change, but I am sympathetic to the idea. Perhaps others can tell us new things that might bear on the issue. |
Right, and that was appropriate for Markdown which was made for technical bloggers who had access to fallback i, cite, and b when needed.
That makes sense to me. Thank you. |
I still think optional separation makes the most sense.
|
As far as the HTML standard goes, I'd also like to add that " There is however one reason why you might want to keep As a counter example, imagine rendering all italics with |
It certainly doesn't help that the quick reference and crash course on commonmark.org itself states that |
Em and strong aren't always the right thing. Is there a way to get i or b?
I know we love semantics over rendering but em and strong are sometimes hypercorrect. Looking at the html source of someone who wrote in Markdown and seeing em and strong when they clearly meant i or b, or sometimes cite.
For example when introducing a new name or term in another language.
The text was updated successfully, but these errors were encountered: