3Answers

In the process of looking for solutions to help sanitise some output, I came across code that does the following.

preg_replace('|[^a-z0-9-~+_.?#=!&;,/:%@$\|*\'()\\x80-\\xff]|i', '', $some_url)

Now, I think it's basically trying to remove anything other than the above mentioned characters. But doesn't \\x80-\\xff refer to some form of non-printable ascii characters ? If so, why would the code possibly be trying NOT to remove them ?

Any indications/pointers/help would be appreciated. Thanks.

Answer

Okay, all the answers given so far lead me in the right direction and allowed me to find the following in the documentation.

After \x, up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, \x{...} is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

So, as a summary :-

i) '\x' allows for a hexadecimal escape sequence, after which, up to two hexadecimal digits are read

ii) '\xhh' the two 'hh' letters can be in upper or lower case

iii) '\xhh' specifies a code-point in the range 0-FF

iv) '\x80-\xFF' refers to a character range outside ASCII

  • 10
Reply Report

x80-xFF are non-ASCII character ranges. They're still printable, both in Latin-1, or encode higher code points for UTF-8.

Using \\x80 over \x80 is slightly more correct. The backslash escapes itself in strings. In single quoted strings too, albeit it's effectively irrelevant there.

In double quoted strings however using just \x80 would be interpreted by PHP, whereas \\x80 would be seen and interpreted by the regex engine.

  • 3
Reply Report
      • 1
    • After re-reading the answer, I would also like to ask whether '|x80 - xFF|i' without any backslashes is valid syntax to mean the same thing as the above.
      • 1
    • Thank you for taking the time to answer. I understand the first part of your answer. But ... "In double quoted strings however using just Ä would be interpreted by PHP, whereas \x80 would be seen and interpreted by the regex engine." ... lost me. Besides, shouldn't the double backslash end up escaping the backslash itself, forcing it to be treated as a separate character... and leaving x80 and xFF to be treated individually without any backslashes ?

You don't need to use double backslash in a pattern with PHP, however even if you use it, it is ignored and read as an escape (like a simple backslash).

One exception, if you use the heredoc or nowdoc syntax to enclose the pattern, a double backslash is seen as a literal backslash.

  • 1
Reply Report

Warm tip !!!

This article is reproduced from Stack Exchange / Stack Overflow, please click

Trending Tags

Related Questions