7Answers
  • 7
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

preg_match and UTF-8 in PHP

I'm trying to search a UTF8-encoded string using preg_match.

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.

I have the following settings in my php.ini, and other UTF8 functions are working:

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

Any ideas?

Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

You can use mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));
  • 40
Reply Report
      • 2
    • @tomalak and next ones. Of course, php doesn't manage unicode, because it works on bytes if you use old functions like substr, strlen, etc., but it is fully managed since a very long time via the extension mbstring, enabled by default in many distributions and servers. This is a choice to maintain backward compatibility.
    • "The u modifier is only to get the pattern interpreted as UTF-8, not the subject." This is not true. Compare e.g. preg_split('//', .) with preg_split('//u', .). Since this "x is interpret as UTF-8" is a bit vague, see this for the actual effects of the unicode mode.

Try adding this (*UTF8) before the regex:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

Magic, thanks to a comment in http://www.php.net/manual/es/function.preg-match.php#95828

  • 25
Reply Report
      • 1
    • Interesting, although I think you need the initial / before the (*UTF8). This doesn't work on my system, but it might on others. What does this output when you do echo $a_matches[0][1];?
      • 1
    • I used it like this on PHP 5.4.29, works like a charm: preg_match_all('/(*UTF8)[^A-Za-z0-9s]/', $txt, $matches);
      • 2
    • Doesn't work for me on either PHP 5.6 or PHP 7 on Ubuntu 16.04. (*UTF8) before delimiter is an error, after has no effect. I suspect that it depends on how/where you got your php, specifically the settings that libpcre* was compiled with.

Looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

  • 21
Reply Report

Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correct offset for UTF8-encoded strings.

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = '????????? ??????? ?????? ??? ?????';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/???/', $s1));
    var_dump(pregMatchCapture(false, '/???/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

Output of my example:

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "???"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "???"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }
  • 7
Reply Report
    • Can you explain what your code does instead of just pasting a code dump? And how does this answer the question?
      • 2
    • It does exactly what described in comments and returns CORRECT string offsets. It is the subject of the question. No idea why I had -2 for my answer. It is working for me.
      • 2
    • Well, that's why you should include an explanation of what your code does. People don't get what you are trying to do here.
      • 2
    • To use $offset as (characters) instead of (bytes), you can add this near the top of the function: if ($offset) { $offset = strlen(mb_substr($subject, 0, $offset)); }

I wrote small class to convert offsets returned by preg_match to proper utf offsets:

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

You can use it like that:

$content = 'a? ba? d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(ba?)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

  • 1
Reply Report

You might want to look at T-Regx library.

pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();     // characters
    echo $match->byteOffset(); // bytes
});

This $match->offset() is UTF-8 safe offset.

  • 0
Reply Report

Warm tip !!!

This article is reproduced from Stack Exchange / Stack Overflow, please click

Trending Tags

Related Questions