Currently, newlines in DjVu text layer are stored as the literal
string '\n'. Its up to the consumer to unescape that into
a real newline. Other formats like pdfs return newlines
as an actual \n character when getPageText() is called.
I think getPageText() should not require callers to do this.
Change-Id: Ie1a438bbce5444c53ff6b7b3aaf2b5267ba3c8b4
function pageTextCallback( $matches ) {
# Get rid of invalid UTF-8, strip control characters
- return '<PAGE value="' . htmlspecialchars( UtfNormal::cleanUp( $matches[1] ) ) . '" />';
+ $val = htmlspecialchars( UtfNormal::cleanUp( stripcslashes( $matches[1] ) ) );
+ $val = str_replace( array( "\n", '�' ), array( ' ', '' ), $val );
+ return '<PAGE value="' . $val . '" />';
}
/**