Move slug generation into FedWiki.slug #248

harlantwood · 2012-06-06T19:43:06Z

We had a good discussion today on slug generation, and a scale of "permissivity". Our current slugs have a low permissivity, call it 10 out of 100.

Github's Gollum wiki has very permissive slug generation, call it 90 out of 100.

It's easy to imagine that we could have it both ways -- as long as everything we cut out of a more permissive slug is always cut out of the less permissive slug, we can always use the less permissive conversion to compare two slugs.

See also the slug discussion in #156.

To get us started, I've moved the slug generation into a module, which mostly mirrors the way slugs currently work. The difference is that multiple dashes in a row are removed, as well as leading and trailing dashes. THIS WILL BREAK EXISTING DATABASES, SO WE SHOULD THINK CAREFULLY BEFORE INTEGRATING IT.

To run the specs:

bundle exec rspec spec/slug_reference.rb

To run: bundle exec rspec spec/slug_reference.rb

…ug specs

harlantwood · 2012-06-07T04:24:00Z

In a separate commit (93ad271), I added multilingual support:

www.example.com/les-misérables
www.example.com/تماس-با-ما
www.example.com/ƒåø

Which is optional, but I advocate strongly for. Otherwise the examples above are damaged or disappear entirely.

WardCunningham · 2012-06-13T20:04:08Z

I like the approach but agree that this code is not complete enough to be deployed. We have constraints that come from three desirable compatibilities:

a server's db must be compatible with the server itself.
a server (and its db) must be compatible with every client ever.
a client must be compatible with every server ever.

The second (every-client) and third (every-server) constraints arise because any random federated wiki client could requests pages from any random federated wiki server.

A server can apply what ever search algorithms it wants so long as it delivers the proper page when requested and delivers a 404 when that is the correct response. A server stores a more complete version of the page title which can be used in this search.

Its also possible that we could reliably convert a slug from one algorithm to a slug from another. This is the basis of the permissivity discussion above. For example, if we could show that for slug functions F and G, that if F(x) == F(G(x)) for all x, then we could say G is as or more permissive than F. Intuitively, G permits more characters through than F.

It has been suggested that we could allow more permissive slug functions into clients so long as servers that use (or might have used) a less permissive slug function try applying that function and repeating a query before issuing a 404.

Now I am worried that the repeated-query approach is not sufficient to handle all three constraints enumerated above. Further, I am not sure that our new function is always more permissive. Specifically, the desire to eliminate some redundant hyphens makes it less permissive while permitting international alphabetic characters makes it more permissive. Yikes.

My feeling now is that we won't be able to meet every constraint all the time. However, if we could characterize the pages that will suffer, and under what circumstances they do so, well, that would be awesome. Then we can move ahead with confidence.

(Aside: Have we given up case insensitivity or is that handled properly for all alphabets that have case?)

harlantwood · 2012-06-20T06:44:08Z

I agree that the less+more permissive changes are an issue. Note that all of the changes are in the area of the slug specs that was marked as "problematic".

When I run the tests, these are the pages that would break if they had been stored already on existing servers:
'Welcome Visitors'
' Welcome Visitors'
'Welcome Visitors '
'Pride & Prejudice'
' - - - - '
' '

'Pride & Prejudice' is concerning because it is a legitimate title (old servers would have saved this as 'pride--prejudice').

Strings like ' Welcome Visitors' (--welcome-visitors) are unlikely, but could have been passed in as titles from converter scripts reading from other sources.

The ways forward as I see them:

We could write a converter script to upgrade the slugs on the filesystems (and possibly CouchDBs) on servers
We could accept that a few servers may break on a few pages
We could leave the slug generation allowing multiple hyphens, as it does now on existing servers

I vote for 2 or 3.

(Aside answered in babc03c.)

harlantwood · 2012-06-20T06:50:12Z

Does anyone know if JS supports POSIX character classes in regexps offhand, such as [:alnum:]?

Does anyone want to take on the Coffee side of this upgrade, once we finalize the details of the way forward?

harlantwood added 4 commits June 6, 2012 12:21

Move slug generation into FedWiki.slug.

709041d

To run: bundle exec rspec spec/slug_reference.rb

Remove out of date specs -- see spec/slug_reference.rb for current sl…

6b47177

…ug specs

Include slug_reference.rb in specs to run

ae467d9

Allow alphanumeric characters of any language in slugs; spec tweaks

93ad271

Clarify intention in multilingual spec

6fc8a23

Specify that alphanumeric extended chars are not lowercased

babc03c

harlantwood mentioned this pull request Dec 20, 2012

Do We Want some Common Name Processing? #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move slug generation into FedWiki.slug #248

Move slug generation into FedWiki.slug #248

harlantwood commented Jun 6, 2012

harlantwood commented Jun 7, 2012

WardCunningham commented Jun 13, 2012

harlantwood commented Jun 20, 2012

harlantwood commented Jun 20, 2012

Move slug generation into FedWiki.slug #248

Are you sure you want to change the base?

Move slug generation into FedWiki.slug #248

Conversation

harlantwood commented Jun 6, 2012

harlantwood commented Jun 7, 2012

WardCunningham commented Jun 13, 2012

harlantwood commented Jun 20, 2012

harlantwood commented Jun 20, 2012