Hemant Vishwakarma: How to use BK-tree on a corpus with 10 billion words of DNA?

Thursday, 14 January 2021

I am trying to use the BK-tree data structure in python on a corpus with ~10 billion entries.

Once I add over ~10 million values to a single BK-tree, there is a serious degradation in the performance of querying.

I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.

How unrealistic is it to query 1,000 BK-trees? What else can I do in order to use BK-tree for this corpus.

I use pybktree.py and my queries are intended to find all entries within an edit distance d.

Is there some architecture or database which will allow me to store those trees?

Note: I don’t run out of memory, rather the tree begins to be inefficient (maybe each node has too many children).

Hemant Vishwakarma