[PATCH] changeset: add filename derived anchors to file diffs (and use them)

Mon Feb 22 17:04:56 UTC 2016

On Mon, Feb 22, 2016 at 3:16 PM, Søren Løvborg <sorenl at unity3d.com> wrote:
> Please change "path2anchor" to "path_to_anchor", we're not that pressed
> for space. ;-)

OK

> Defining anchors using <a name="XYZ"> is deprecated; the code should be
> using "id" attribute. (Since there can only be one id per element, it
> can be placed on the next div.)

OK

> As to why browsers choke on special characters in anchor names, it's
> probably related to this requirement in the HTML standard:
>
>     ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
>     followed by any number of letters, digits ([0-9]), hyphens ("-"),
>     underscores ("_"), colons (":"), and periods (".").
>
> (When I follow the rule about starting with a letter, I observe no
> problems with periods across browsers. Can you elaborate on where
> you saw problems with periods?)

I will explain below.

> Besides introducing illegal characters into the attribute, the ad-hoc
> escaping can also cause problems if filenames contain actual ! or \
> characters. I'd prefer a standard solution that doesn't have weird
> corner cases.
>
> As for '/', that character only appears in the quoted output because
> urllib.quote considers it "safe" by standard. If we disable that, then
> replace '%' with a valid character and unconditionally add a letter in
> front of the filename, we get a guaranteed valid ID:
>
>     id = 'A' + urllib.quote(filename, '').replace('%', ':')
>
> Alternatively, if we don't care about making the string semi-readable,
> we could simply use the SHA-1 hash of the filename:
>
>     id = hashlib.sha1(filename).hexdigest()
>
> This eliminates all issues with special characters, and also guarantees
> a consistent id length independent of filename length.

Let me start by saying that this is exactly what I want to avoid.
There are already links that seems to be based on some sort of
hashing. These are basically impossible for an enterprising user to
construct or guess. What I want is to make it possible (and easy!) for
a user to correctly guess the link that it would need to manually
create to open Kallithea on the right changeset on the right file. I
plan to use this to let a user quickly start reviewing a file from
TortosieHg. That is why I want to make anchors based on the filenames.
Ideally the anchors would be the filenames themselves, but for various
reasons that does not work.

My first try was to just use urllib.quote. However I found that when
the files have periods and slashes in them the behavior of all the
browsers that I tried (Firefox, Chrome and IE11!) is really odd. It is
a bit hard to explain though. The problem is that if the anchor name
has a or a forward slash period, the link to the anchor works fine,
_but_ for some reason the URL that the browser displays once you click
on the anchor link is not the one that you put on the source anchor
href! For some reason the hash part of the URL is lost. Because of
that, if you reload the page after you navigate to a file using the
file link, the reload fails. As I said it is a little hard to explain
so let me give you an example:

- Let's say that I modify path_to_anchor() to:

    return "file:" + urllib.quote(path, '').replace('%', ':')

as you suggest, then:

- I have a file "src/downlink/MY_FILE.cpp" on a repository called "MY_REPO"
- I navigate to changeset 2e4374a993906c15df4646af0619dfda72d42462 on
my repository by going to the following URL:
http://localhost:5000/MY_REPO/changeset/2e4374a993906c15df4646af0619dfda72d42462
- In that page, at the top, there is a list of files. The one named
"src/downlink/MY_FILE.cpp" has the following link to an anchor:
href="#file:src:2Fdownlink:2FTRCH.cpp"
- A little below that page, on the diff of the
"src/downlink/MY_FILE.cpp" file kallithea created an anchor with the
following name: "file:src:2Fdownlink:2F/MY_FILE.cpp"
- The user clicks on that file link
    - The browser nagivates to the diff of the selected file correctly
    - In theory the browser should show the following URL:
http://localhost:5000/MY_REPO/changeset/2e4374a993906c15df4646af0619dfda72d42462#file:src:2Fdownlink:2F/MY_FILE.cpp
    - HOWEVER, the browser shows the following on its URL:
http://localhost:5000/MY_REPO/changeset/file:src:2Fdownlink:2F/MY_FILE.cpp
    - Notice the missing changeset hash
(2e4374a993906c15df4646af0619dfda72d42462).
- If the user click the reload page button at this point, it will get
a "404 Not Found" error (because of the missing hash).

It seems that the problem comes from the presence of periods and
slashes on the name.

Now, let's change path_to_anchor to do:

    return 'file:' + urllib.quote(path.replace('/',':')).replace('.', ':')

this is similar to what my patch proposed, but I have improved it
following your suggestions. We now prepend "file:" to the anchor name
to ensure that the anchor begins with a letter. We also replace '.'
with ':' rather than with '!' as you suggested to avoid adding
additional ilegal characters to the name. If we use this new version
and repeat the steps above, we see that the anchor that is created is:

    file:src%3Adownlink%3A/MY_FILE:cpp

and the link to it points to:

    http://localhost:5000/MY_REPO/changeset/2e4374a993906c15df4646af0619dfda72d42462#file:src%3Adownlink%3A/MY_FILE:cpp

Clicking on the file name takes to the file and puts the correct URL
on the browser address bar. You can reload the page after clicking on
a file and everything works fine.

Note that my solution has several nice attributes:

1. It fixes the problem above
2. The links are easy to construct by a user
3. The links are unique if you assume that they do not contain any colons.

#3 is important. I think it is unlikely that paths will have colons
(since these are always relative paths). Note that urlib.quote encodes
colons as "%3A". Because of that, you are guaranteed to have no colons
after calling urllib.quote. Hence the final replace('.', ':') creates
no ambiguity. However, even if we assume that files with colons are
rare, we can improve my patch by adding a simple "if ':' in path"
check. If the check is true we could just fall back to the FID, for
example. This would mean that on the particular corner case of a file
containing colons it would not be possible to guess the file URL, but
I think we could live with that.

I also note that my proposed solution results in names that contain
"%" which apparently are not legal. However I have not seen any
problems with any browsers...

What do you think? I know that the encoding is slightly more
complicated that I wished, but I find it is still quite easy to guess
and would let us easily create links to individual files within a
changeset, which is a very useful feature.

Cheers,

Angel