Sunday, 25 July 2021

Different behavior while reading DataFrame from parquet using CLI Versus executable on same environment

I have following program:

import pandas as pd
import pyarrow
from pyarrow import parquet

def foo():
    print(pyarrow.__file__)
    print('version:',pyarrow.cpp_version)
    print('-----------------------------------------------------')
    df = pd.DataFrame({'A': [1,2,3], 'B':['dummy']*3})
    print('Orignal DataFrame:\n', df)
    print('-----------------------------------------------------')
    _table = pyarrow.Table.from_pandas(df)
    parquet.write_table(_table, 'foo')
    _table = parquet.read_table('foo', columns=[])    #passing empty list to columns arg
    df = _table.to_pandas()
    print('After reading from file with columns=[]:\n', df)
    print('-----------------------------------------------------')
    print('Not passing [] to columns parameter')
    _table = parquet.read_table('foo')                #Not passing any list
    df = _table.to_pandas()
    print(df)
    print('-----------------------------------------------------')
    x = input('press any key to exit: ')

if __name__=='__main__':
    foo()

When I run it from console/IDE, it reads the entire data for columns=[]:

(env) D:\foo>python foo.py
D:\foo\env\lib\site-packages\pyarrow\__init__.py
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
After reading from file with columns=[]:
    A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
Not passing [] to columns parameter
   A      B
0  1  dummy
1  2  dummy
2  3  dummy
-----------------------------------------------------
press any key to exit:

But When I run it from executable created using Pyinstaller, it reads no data for columns=[]:

enter image description here

As you can see passing columns=[] gives empty dataframe in executable file but this behavior is not there while running the python file directly, and I'm not sure if it's a bug and if it's a bug, then it is in pyarrow, or the pyinstaller.

Looking at docstring of parquet.read_table in source code at GitHub:

columns: list
If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 'a.c', and 'a.d.e'.

The read_table further calls dataset.read that calls _dataset.to_table which returns call to self.scanner which then returns call to static method from_dataset of Scanner class.

Everywhere, None has been used as default value to columns parameter, if None and [] are directly converted to Boolean in python, both of them will indeed be False, but if [] is checked against None, then it will be False, but it is nowhere mentioned should it fetch all the columns for columns=[] because it evaluates to be False for Boolean value, or should it read no columns at all since the list is empty.

But why the behavior is different while running it from the Command line/IDE, than to running it from the executable created using Pyinstaller for the same version of Pyarrow?

The environment I'm on:

  • Python Version: 3.7.6
  • Pyinstaller Verson: 4.2
  • Pyarrow Version: 3.0.0
  • Windows 10 64 bit OS

Here is the spec file for your reference if you want to give it a try:

foo.spec

# -*- mode: python ; coding: utf-8 -*-
import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)
block_cipher = None


a = Analysis(['foo.py'],
             pathex=['D:\\foo'],
             binaries=[],
             datas=[],
             hiddenimports=[],
             hookspath=[],
             runtime_hooks=[],
             excludes=[],
             win_no_prefer_redirects=False,
             win_private_assemblies=False,
             cipher=block_cipher,
             noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
             cipher=block_cipher)
exe = EXE(pyz,
          a.scripts,
          [],
          exclude_binaries=True,
          name='foo',
          debug=False,
          bootloader_ignore_signals=False,
          strip=False,
          upx=True,
          console=True )
coll = COLLECT(exe,
               a.binaries,
               a.zipfiles,
               a.datas,
               strip=False,
               upx=True,
               upx_exclude=[],
               name='foo')


from Different behavior while reading DataFrame from parquet using CLI Versus executable on same environment

No comments:

Post a Comment