Thursday, 29 September 2022

Why use importlib.resources over __file__?

I have a package which is like

mypkg
    |-mypkg
        |- data
            |- data.csv
            |- __init__.py  # Required for importlib.resources 
        |- scripts
            |- module.py
        |- __init__.py

The module module.py requires data.csv to perform a certain task.

The first naive approach I used to access data.csv was

# module.py - Approach 1
from pathlib import Path

data_path = Path(Path.cwd().parent, 'data', 'data.csv')

but this obviously breaks when we have imported module.py via from mypkg.scripts import module or similar. I need a way to access data.csv regardless of where mypkg is imported from.

The next naive approach is to use __file__ attribute to get access to the path wherever the module.py module is located.

# module.py - Approach 2
from pathlib import Path

data_path = Path(Path(__file__).resolve().parents[1], 'data', 'data.csv')

However, researching around about this problem I find that this approach is discouraged. See, for example, How to read a (static) file from inside a Python package?.

Though there doesn't seem to be total agreement as to the best solution to this problem, it looks like importlib.resources is maybe the most popular. I believe this would look like:

# module.py - Approach 3
from pathlib import Path
import importlib.resources

data_path_resource = importlib.resources('mypkg.data', 'data.csv')
with data_path_resources as resource:
    data_path = resource

Why is this final approach better than __file__? It seems like __file__ won't work if the source code is zipped. This is the case I'm not familiar with and which also sounds a bit fringe. I don't think my code will ever be run zipped..

The added overhead from importlib seems a little ridiculous. I need to add an empty __init__.py in the data folder, I need to import importlib, and I need to use a context manager just to access a relative path.

What am I missing about the benefits of the importlib strategy? Why not just use __file__?

edit: One possible justification for the importlib approach is that it has slightly improved semantics. That is data.csv should be thought of as part of the package, so we should access it using something like from mypkg import data.csv but of course this syntax only works for importing .py python modules. But importlib.resources is sort of porting the "import something from some package" semantics to more general file types.

By contrast, the syntax of building a relative path from __file__ is sort of saying: this module is incidentally close to the data file in the file structure so let's take advantage of that to access it. The fact that the data file is part of the package isn't leveraged.



from Why use importlib.resources over __file__?

No comments:

Post a Comment