Friday 10 January 2020

SQL(ite) fast retrieval of multiple substrings of larger string (genome)

I have a sql database consisting of a genome, its chromosomes, and "interesting" regions (in BED format). A genome (e.g. 4GB) consists of +/- 20 chromosomes, so each string is around 200MB large. The chromosome in a genome for instance consists of the string:

NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB large

Let's say I have around 1.000.000 regions ATAC-seq peaks and I would like to get the sequences of 100 bp of only chromosome 3. My SQL query would look like this:

SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'

The problem with these kind of lookups is that the Chr.Sequence is loaded for every single hit, making the ram usage excessively large, and the lookup really slow. The way I 'fixed' this is only using the SQL database to store the positions of the interesting locations, and then using pyfaidx to quickly get the corresponding sequences out of the chromosome.

I am wondering if a fast lookup is possible in SQL(ite), since the current solution feels a bit ad-hoc to me.



from SQL(ite) fast retrieval of multiple substrings of larger string (genome)

No comments:

Post a Comment