• 12
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

Replace special characters with ASCII equivalent

Is there any lib that can replace special characters to ASCII equivalents, like:

"Cze??"

to:

"Czesc"

I can of course create map:

{'?':'s', '?': 'c'}

and use some replace function. But I don't want to hardcode all equivalents into my program, if there is some function that already does that.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata
text = u'Cze??'
print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')
  • 36
Reply Report
      • 1
    • it doesnt work for all cases i.e. (VW Polo) - Zap?on Jak sprawdzi? czy dzia?a pompa wspomagania? converts to (VW Polo) - Zapon jak sprawdzic czy dziaa pompa wspomagania?

You can get most of the way by doing:

import unicodedata

def strip_accents(text):
    return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')

Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:

  • Æ ? AE
  • Ð ? D
  • Ø ? O
  • Þ ? TH
  • ß ? ss
  • æ ? ae
  • ð ? d
  • ø ? o
  • þ ? th
  • Œ ? OE
  • œ ? oe
  • ƒ ? f
  • 18
Reply Report

I did it this way:

POLISH_CHARACTERS = {
    50309:'a',50311:'c',50329:'e',50562:'l',50564:'n',50099:'o',50587:'s',50618:'z',50620:'z',
    50308:'A',50310:'C',50328:'E',50561:'L',50563:'N',50067:'O',50586:'S',50617:'Z',50619:'Z',}

def encodePL(text):
    nrmtxt = unicodedata.normalize('NFC',text)
    i = 0
    ret_str = []
    while i < len(nrmtxt):
        if ord(text[i])>128: # non ASCII character
            fbyte = ord(text[i])
            sbyte = ord(text[i+1])
            lkey = (fbyte << 8) + sbyte
            ret_str.append(POLISH_CHARACTERS.get(lkey))
            i = i+1
        else: # pure ASCII character
            ret_str.append(text[i])
        i = i+1
    return ''.join(ret_str)

when executed:

encodePL(u'?????ó??? ?????Ó???')

it will produce output like this:

u'acelnoszz ACELNOSZZ'

This works fine for me - ;D

  • 4
Reply Report

The package unidecode worked best for me:

from unidecode import unidecode
text = "Björn, ?ukasz and ????????."
print(unidecode(text))
# ==> Bjorn, Lukasz and Sokrates.

You might need to install the package:

pip install unidecode

The above solution is easier and more robust than encoding (and decoding) the output of unicodedata.normalize(), as suggested by other answers.

# This doesn't work as expected:
ret = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(ret)
# ==> b'Bjorn, ukasz and .'
# Besides not supporting all characters, the returned value is a
# bytes object in python3. To yield a str type:
ret = ret.decode("utf8") # (not required in python2)
  • 2
Reply Report

Trending Tags