Hemant Vishwakarma: SQL(ite) fast retrieval of multiple substrings of larger string (genome)

Friday 10 January 2020

SQL(ite) fast retrieval of multiple substrings of larger string (genome)

I have a sql database consisting of a genome, its chromosomes, and "interesting" regions (in BED format). A genome (e.g. 4GB) consists of +/- 20 chromosomes, so each string is around 200MB large. The chromosome in a genome for instance consists of the string:

NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB large

Let's say I have around 1.000.000 regions ATAC-seq peaks and I would like to get the sequences of 100 bp of only chromosome 3. My SQL query would look like this:

SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'

The problem with these kind of lookups is that the Chr.Sequence is loaded for every single hit, making the ram usage excessively large, and the lookup really slow. The way I 'fixed' this is only using the SQL database to store the positions of the interesting locations, and then using pyfaidx to quickly get the corresponding sequences out of the chromosome.

I am wondering if a fast lookup is possible in SQL(ite), since the current solution feels a bit ad-hoc to me.

from SQL(ite) fast retrieval of multiple substrings of larger string (genome)

Hemant Vishwakarma

Friday 10 January 2020

SQL(ite) fast retrieval of multiple substrings of larger string (genome)

No comments:

Post a Comment