Wednesday, 17 October 2018

Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

Let's say we have 1 million of rows like this:

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

Background:

I know how to do this with Sqlite:

  • Find a row with a single-word query, up to a few spelling mistakes with the spellfix module and Levenshtein distance (I have posted a detailed answer here about how to compile it, how to use it, ...):

    db.enable_load_extension(True)
    db.load_extension('./spellfix')
    c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()
    
    #Query: 'Riehmand'
    #Answer: [(1, u'Riemann')]
    
    

    With 1M rows, this would be super slow! As detailed here, postgresql might have an optimization with this using trigrams. A fast solution, available with Sqlite, is to use a VIRTUAL TABLE USING spellfix:

    c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
    c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
    c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()
    
    #Query: 'Riehmand'
    #Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!
    
    
  • Find an expression with a query matching one or multiple words with FTS ("Full Text Search"):

    c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
    c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
    c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()
    
    #Query: 'NUMBERS carmichael'
    #Answer: [(2, u'All the Carmichael numbers')]
    
    

    It is case insensitive and you can even use a query with two words in the wrong order, etc.: FTS is quite powerful indeed. But the drawback is that each of the query-keyword must be correctly spelled, i.e. FTS alone doesn't allow spelling mistakes.

Question:

How to do a Full Text Search (FTS) with Sqlite and also allow spelling mistakes? i.e. "FTS + spellfix" together

Example:

  • row in the DB: "All the Carmichael numbers"
  • query: "NUMMBER carmickaeel" should match it!

How to do this with Sqlite?

It is probably possible with Sqlite since this page states:

Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

Linked question: String similarity with Python + Sqlite (Levenshtein distance / edit distance)



from Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

No comments:

Post a Comment