Skip to content

Update XMLStringifier.attEscape method #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

iterpugov
Copy link

@oozcitak
Copy link
Owner

oozcitak commented Sep 9, 2015

TY. Manually merged.

@oozcitak oozcitak closed this Sep 9, 2015
@jscheid
Copy link
Contributor

jscheid commented Jan 20, 2016

Ah, here's the issue. I've just added a comment on the commit: cf22542#commitcomment-15572748

It's true that these characters are valid in attribute values, but that doesn't mean it isn't a good idea to escape them. Only when they are escaped will they survive without getting converted to spaces.

Can we revert this change please?

@jscheid
Copy link
Contributor

jscheid commented Jan 20, 2016

Here's an example with JavaScript (libxmljs):

$ node
> require("libxmljs").parseXml("<test attr='abc\tdef'/>").root().attr('attr').value().charCodeAt(3)
32

Ruby (Nokogiri):

$ irb
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri::XML("<test attr='abc\tdef'/>").xpath("//test/@attr").text[3].ord
=> 32

Python (ElementTree):

$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> ord(ET.fromstring("<test attr='abc\tdef'/>").attrib['attr'][3])
32

As you can see, the tab character doesn't survive. With some parsers (e.g. xmldoc, xml2js) it does survive, but only because those parsers don't conform to the spec.

oozcitak referenced this pull request Jan 21, 2016
@oozcitak
Copy link
Owner

I like the current behavior since whitespace chars are valid according to the spec. However, I am not against escaping whitespace if it is done explicitly, i.e. controlled with a flag which defaults to false. Would you like to send another pull request if that is OK with you?

@jscheid
Copy link
Contributor

jscheid commented Jan 21, 2016

Please reconsider:

  1. That's like saying that the text in the following code doesn't have to be escaped because it's valid XML:

    builder.create({ root: { '#text': '<foo/>' } }).toString()
    

    Without escaping it would be valid, but still wrong because what you put in (a text that happens to be a valid element) is not what you get out (an element, not the text). Same here: not escaping these characters gives you valid XML, but what you put in (a tab, CR or LF) is not what you get out (a space character).

  2. Escaping these characters is the canonical representation.

  3. Mature XML serializers all escape these characters, such as libxml2, Ruby's Nokogiri, Python's lxml, Java's DocumentBuilder, Perl's XML::Writer. (ElementTree doesn't but they are working on a fix). They do because it's the right thing to do. None of these serializers offer an option to disable escaping of these characters.

@oozcitak
Copy link
Owner

oozcitak commented Feb 5, 2016

Can you take a look at 272b53f and let me know what you think? This should be close to your previous PR #54 with one difference: Both the two char newline \r\n and the single \r is replaced by a single \n before escaping. This satisfies 2.11 of the spec.

@jscheid
Copy link
Contributor

jscheid commented Feb 6, 2016

Hi, on a mobile phone so being brief.

I don't think replacing CR like this is right. Firstly it can still leave newlines (as opposed to escaped newlines) in the result. Secondly the canonical representation spec (linked to above) says:

In attribute values, the character information items TAB (#x9), newline (#xA), and carriage-return (#xD) are represented by " ", " ", and " " respectively.

To me this seems pretty straightforward, just direct replacement.

The spec you linked to applies only at the entity level as far as I can tell.

I can't check right now how all the other implementations behave but I'd be surprised if they mangled CRLF like that. I can check early next week if it helps.

oozcitak added a commit that referenced this pull request Mar 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants