From ec4a1898916b9caaad230da8518e3fd65bd169a8 Mon Sep 17 00:00:00 2001 From: Aryeh Gregor Date: Fri, 11 Mar 2011 20:50:17 +0000 Subject: [PATCH] Normalize named entities to numeric We should never be outputting named entities other than the ones in XML, < > & ", because that will break well-formedness unless we have a DTD in the doctype, which we don't in HTML5 mode. I stuck with outputting numeric entities here instead of UTF-8 because some characters are hard to read in UTF-8 (e.g.,  ). Maybe it would be nicer if we decoded to UTF-8 except for whitespace and control characters, or something like that, but it's a detail. I'll backport to 1.17 and add RELEASE-NOTES there, which is why I added the line to HISTORY instead of RELEASE-NOTES. --- HISTORY | 1 + includes/Sanitizer.php | 15 ++++++++++----- tests/parser/parserTests.txt | 6 +++--- 3 files changed, 14 insertions(+), 8 deletions(-) diff --git a/HISTORY b/HISTORY index e897e8860c..3772349c2b 100644 --- a/HISTORY +++ b/HISTORY @@ -455,6 +455,7 @@ general notes. * (bug 20244) Installer does not validate SQLite database directory for stable path * (bug 1379) Installer directory conflicts with some hosts' configuration panel. * (bug 12070) After Installation MySQL was blocked +* Fix XML well-formedness on a few pages when $wgHtml5 is true (the default) === API changes in 1.17 === * (bug 22738) Allow filtering by action type on query=logevent. diff --git a/includes/Sanitizer.php b/includes/Sanitizer.php index e26c86d861..4c99e82d55 100644 --- a/includes/Sanitizer.php +++ b/includes/Sanitizer.php @@ -1093,7 +1093,8 @@ class Sanitizer { * for XML and XHTML specifically. Any stray bits will be * &-escaped to result in a valid text fragment. * - * a. any named char refs must be known in XHTML + * a. named char refs can only be < > & ", others are + * numericized (this way we're well-formed even without a DTD) * b. any numeric char refs must be legal chars, not invalid or forbidden * c. use &#x, not &#X * d. fix or reject non-valid attributes @@ -1130,9 +1131,10 @@ class Sanitizer { /** * If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, - * return the named entity reference as is. If the entity is a - * MediaWiki-specific alias, returns the HTML equivalent. Otherwise, - * returns HTML-escaped text of pseudo-entity source (eg &foo;) + * return the equivalent numeric entity reference (except for the core < + * > & "). If the entity is a MediaWiki-specific alias, returns + * the HTML equivalent. Otherwise, returns HTML-escaped text of + * pseudo-entity source (eg &foo;) * * @param $name String * @return String @@ -1141,8 +1143,11 @@ class Sanitizer { global $wgHtmlEntities, $wgHtmlEntityAliases; if ( isset( $wgHtmlEntityAliases[$name] ) ) { return "&{$wgHtmlEntityAliases[$name]};"; - } elseif( isset( $wgHtmlEntities[$name] ) ) { + } elseif ( in_array( $name, + array( 'lt', 'gt', 'amp', 'quot' ) ) ) { return "&$name;"; + } elseif ( isset( $wgHtmlEntities[$name] ) ) { + return "&#{$wgHtmlEntities[$name]};"; } else { return "&$name;"; } diff --git a/tests/parser/parserTests.txt b/tests/parser/parserTests.txt index 32d5a3cbc5..6c2f9ed9aa 100644 --- a/tests/parser/parserTests.txt +++ b/tests/parser/parserTests.txt @@ -1264,7 +1264,7 @@ Multiplication table Multiplication table - × + × 1 2 3 @@ -1351,7 +1351,7 @@ Nested table !! result -
α + α @@ -1730,7 +1730,7 @@ Non-breaking spaces in title !! input [[  Main   Page  ]] !! result -

  Main   Page   +

  Main   Page  

!!end -- 2.20.1