Thursday 27 January 2022

Filter data from zipped databases in python

I have a following problem. I have two large databases. The first database is called an_visit_ab. It is large and zipped as an_visit_ab.sql.bz2. The second database is large too, but I am able to store it on my computer as an_visit.db.

Structure of databases:

an_visit_ab is ordered by visit_id in ascending order:

visit_id    ab_test_id  variant
594817698   1   0
594817704   1   1
594817705   1   1
594817706   1   0
594817709   1   0

an_visit is ordered by visit_id in ascending order, too:

visit_id    datetime_add    is_bot
1   2017-06-21 18:10:38 0
3   2017-06-21 18:10:39 1

I need to filter out from an_visit_ab all visit_id where is_bot is 1 in an_visit. Normally, I would to a simple JOIN in SQL. But because I am not able to even unzip an_visit_ab.sql.bz2 I need a different strategy.

What I came with until now:

  1. I need to read an_visit_ab.sql.bz2 line by line. Because both databases are ordered by visit_id, I can look to the an_visit to a corresponding visit_id and check if is_bot is 1.
  2. I do it in python. I have a following code that reads an_visit_ab.sql.bz2 line by line:
from bz2 import BZ2File as bzopen

AN_VISIT_AB = "an_visit_ab.sql.bz2"

with bzopen(AN_VISIT_AB, "r") as bzfile:
    for i, line in enumerate(bzfile):
        print(line)

The output looks like this:

b'-- MySQL dump 10.13  Distrib 5.7.33, for Linux (x86_64)\n'
b'--\n'
b'-- Host: localhost    Database: XXX\n'
b'-- ------------------------------------------------------\n'
b'-- Server version\t5.7.33-0ubuntu0.16.04.1\n'
b'\n'
b'/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;\n'
b'/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;\n'
b'/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;\n'
b'/*!40101 SET NAMES utf8 */;\n'
b'/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;\n'
b"/*!40103 SET TIME_ZONE='+00:00' */;\n"
b'/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;\n'
b'/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;\n'
b"/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;\n"
b'/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;\n'
b'\n'
b'--\n'
b'-- Table structure for table `an_visit_ab`\n'
b'--\n'
b'\n'
b'DROP TABLE IF EXISTS `an_visit_ab`;\n'
b'/*!40101 SET @saved_cs_client     = @@character_set_client */;\n'
b'/*!40101 SET character_set_client = utf8 */;\n'
b'CREATE TABLE `an_visit_ab` (\n'
b'  `visit_id` int(11) unsigned NOT NULL,\n'
b'  `ab_test_id` int(10) unsigned NOT NULL,\n'
b'  `variant` tinyint(4) NOT NULL,\n'
b'  PRIMARY KEY (`visit_id`,`ab_test_id`,`variant`),\n'
b'  UNIQUE KEY `visit_id_ab_test_id` (`visit_id`,`ab_test_id`),\n'
b'  KEY `ab_test_id_ab_test_variant` (`ab_test_id`,`variant`),\n'
b'  KEY `ab_test_id` (`ab_test_id`),\n'
b'  CONSTRAINT `an_visit_ab_ibfk_1` FOREIGN KEY (`visit_id`) REFERENCES `an_visit` (`visit_id`),\n'
b'  CONSTRAINT `an_visit_ab_ibfk_4` FOREIGN KEY (`ab_test_id`, `variant`) REFERENCES `an_new_ab_test_variant` (`ab_test_id`, `variant`)\n'
b') ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;\n'
b'/*!40101 SET character_set_client = @saved_cs_client */;\n'
b'\n'
b'--\n'
b'-- Dumping data for table `an_visit_ab`\n'
b'--\n'
b'\n'
b'LOCK TABLES `an_visit_ab` WRITE;\n'
b'/*!40000 ALTER TABLE `an_visit_ab` DISABLE KEYS */;\n'

(594878944,1,1),(594878945,1,0),(594878946,1,0),(594878947,1,1),(594878948,1,0),(594878949,1,1),(594878950,1,0),(594878951,1,0),(594878952,1,1),(594878953,1,1),(594878954,1,1),(594878955,1,1),(594878956,1,0),(594878957,1,1),(594878958,1,1),(594878959,1,1),(594878960,1,0),(594878961,1,0),(594878962,1,0),(594878963,1,1),(594878964,1,1),(594878965,1,0),(594878966,1,1),(594878967,1,0),(594878968,1,1),(594878969,1,1),(594878970,1,1),(594878971,1,0),(594878972,1,1),(594878973,1,0),(594878974,1,0),(594878975,1,1),(594878976,1,1),(594878977,1,0),(594878978,1,1),(594878979,1,1),(594878980,1,0),(594878981,1,1),(594878982,1,1),(594878983,1,0),(594878984,1,1),(594878985,1,1),(594878986,1,0),(594878987,1,1),(594878988,1,1),(594878989,1,0),(594878990,1,1),(594878991,1,0),(594878992,1,0),(594878993,1,1),(594878994,1,0),(594878995,1,1),(594878996,1,1),(594878997,1,1),(594878998,1,1),(594878999,1,0),(594879000,1,0),(594879001,1,0),(594879002,1,0),(594879003,1,0),(594879004,1,1),(594879005,1,1),(594879006,1,0),(594879007,1,1),(594879008,1,1),(594879009,1,1),(594879010,1,0),(594879011,1,1),(594879012,1,1),(594879013,1,0),(594879014,1,0),(594879015,1,0),(594879016,1,1),(594879017,1,0),(594879018,1,0),(594879019,1,1),(594879020,1,1),(594879021,1,1),(594879022,1,1),(594879023,1,0),(594879024,1,1),(594879025,1,0),(594879026,1,0),(594879027,1,1),(594879028,1,0),(594879029,1,1),(594879030,1,0),(594879031,1,1),(594879032,1,0),(594879033,1,1),(594879034,1,1),(594879035,1,0),(594879036,1,1),(594879037,1,1),(594879038,1,0),(594879039,1,1),(594879040,1,1),(594879041,1,1),(594879042,1,1),(594879043,1,0),(594879044,1,0),(594879045,1,0),(594879046,1,0),(594879047,1,1),(594879048,1,0),(594879049,1,1),(594879050,1,0),(594879051,1,1),(594879052,1,1),(594879053,1,1),(594879054,1,0),(594879055,1,1),(594879056,1,0),(594879057,1,0),(594879058,1,0),(594879059,1,0),(594879060,1,1),(594879061,1,1),(594879062,1,0),(594879063,1,0),(594879064,1,0),(594879065,1,1),(594879066,1,1),(594879067,1,0),(594879068,1,1),(594879069,1,0),(594879070,1,0),(594879071,1,0),(594879072,1,0),(594879073,1,0),(594879074,1,0),(594879075,1,0),(594879076,1,0),(594879077,1,0),(594879078,1,1),(594879079,1,1),(594879080,1,1),(594879081,1,1),(594879082,1,1),(594879083,1,0),(594879084,1,0),(594879085,1,0),(594879086,1,0),(594879087,1,1),(594879088,1,1),(594879089,1,0),(594879090,1,1),(594879091,1,0),(594879092,1,1),(594879093,1,0),(594879094,1,0),(594879095,1,0),(594879096,1,1),(594879097,1,1),(594879098,1,0),(594879099,1,1),(594879100,1,0),(594879101,1,1),(594879102,1,1),(594879103,1,0),(594879104,1,0),(594879105,1,1),(594879106,1,1),(594879107,1,0),(594879108,1,1),(594879109,1,1),(594879110,1,0),(594879111,1,1),(594879112,1,1),(594879113,1,0),(594879114,1,0),(594879115,1,1),(594879116,1,1),(594879117,1,1),(594879118,1,1),(594879119,1,0),(594879120,1,1),(594879121,1,1),(594879122,1,0),(594879123,1,1),(594879124,1,1),(594879125,1,0),(594879126,1,1),(594879127,1,1),(594879128,1,1),(594879129,1,1),(594879130,1,0),(594879131,1,1),(594879132,1,0),(594879133,1,0),(594879134,1,0),(594879135,1,1),(594879136,1,0),(594879137,1,0),(594879138,1,0),(594879139,1,1),(594879140,1,1),(594879141,1,0),(594879142,1,0),(594879143,1,1),(594879144,1,0),(594879145,1,0),(594879146,1,1),(594879147,1,0),(594879148,1,0),(594879149,1,1),(594879150,1,0),(594879151,1,0),(594879152,1,0),(594879153,1,1),(594879154,1,1),(594879155,1,1),(594879156,1,1),(594879157,1,1),(594879158,1,1),(594879159,1,1),(594879160,1,1),(594879161,1,0),(594879162,1,1),(594879163,1,1),(594879164,1,0),(594879165,1,0),(594879166,1,0),(594879167,1,1),(594879168,1,1),(594879169,1,1),(594879170,1,0),(594879171,1,0),(594879172,1,1),(594879173,1,0),(594879174,1,1),(594879175,1,1),(594879176,1,1),(594879177,1,0),(594879178,1,1),(594879179,1,1),(594879180,1,0),(594879181,1,1),(594879182,1,0),(594879183,1,1),(594879184,1,0),(594879185,1,0),(594879186,1,0),(594879187,1,1),(594879188,1,0),(594879189,1,0),(594879190,1,1),(594879191,1,0),(594879192,1,0),(594879193,1,1),(594879194,1,1),(594879195,1,1),(594879196,1,1),(594879197,1,1),(594879198,1,0),(594879199,1,1),(594879200,1,1),(594879201,1,0),(594879202,1,0),(594879203,1,0),(594879204,1,1),(594879205,1,1),(594879206,1,1),(594879207,1,0),(594879208,1,1),(594879209,1,1),(594879210,1,1),(594879211,1,0),(594879212,1,0),(594879213,1,0),(594879214,1,0),(594879215,1,0),(594879216,1,1),(594879217,1,1),(594879218,1,0),(594879219,1,1),(594879220,1,0),(594879221,1,1),(594879222,1,0),(594879223,1,0),(594879224,1,1),(594879225,1,1),(594879226,1,0),(594879227,1,0),(594879228,1,0),(594879229,1,1),(594879230,1,1),(594879231,1,1),(594879232,1,1),(594879233,1,1),(594879234,1,0),(594879235,1,0),(594879236,1,0),(594879237,1,1),(594879238,1,0),(594879239,1,0),(594879240,1,0),(594879241,1,1),(594879242,1,0),(594879243,1,0),(594879244,1,1),(594879245,1,0),(594879246,1,1),(594879247,1,0),(594879248,1,1),(594879249,1,1),(594879250,1,0),(594879251,1,0),(594879252,1,1),(594879253,1,0),(594879254,1,1),(594879255,1,0),(594879256,1,0),(594879257,1,1),(594879258,1,1),(594879259,1,1),(594879260,1,0),(594879261,1,1),(594879262,1,0),(594879263,1,0),(594879264,1,0);\n'

But I got stuck here and I don`t know how to continue. I would like to ask you:

  1. Is my strategy good? Or is there something better than reading it line by line?
  2. Can you show me, how can I filter is_bot from an_visit_ab.sql.bz2 as I have tried above?

Please, if there is something unclear in my question, do not downgrade immediately and give my chance to explain it more preciously :-). Thanks



from Filter data from zipped databases in python

No comments:

Post a Comment