• 4
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

I then plotted the results with R:

enter image description here

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

Result: enter image description here

The Difflib / Levenshtein similarity really is quite interesting.

2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext

  • 152
Reply Report
      • 1
    • I think some of the disagreement between difflib and levenshtein may be explained because of the autojunk heuristic used by difflib. What happens if you disable it?
      • 1
    • That's a good question. The autojunk filter only takes effect if the number of observations is >200, so I'm not sure if this particular dataset (book titles) would have been greatly affected, but it's worth investigation...
    • @duhaime, thank you for this detailed analysis. I'm new to these kinds of plots and have no idea how to interpret them. What are the plots called, so that I may look them up and learn about them?
  • difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.

  • Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance

According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much faster.

  • 104
Reply Report
      • 1
    • Thanks alot for the info. I have added more details. here it is: I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance. Do you think both perform alike in this case.

Trending Tags