Monday, 14 December 2020

Among the many Python file copy functions, which ones are safe if the copy is interrupted?

As seen in How do I copy a file in Python?, there are many file copy functions:

  • shutil.copy

  • shutil.copy2

  • shutil.copyfile (and also shutil.copyfileobj)

  • or even a naive method:

    with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g:
        while True:
            block = f.read(16*1024*1024)  # work by blocks of 16 MB
            if not block:  # EOF
                break
            g.write(block)
    

Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok.

By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should:

  • either report the size the file had when the last bytes were written (e.g. 400MB)
  • or be deleted

The worst would be that the final filesize is written first (internally with an fallocate or ftruncate?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written.

Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here).

So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the fsize comparison), before copying the actual file content.


Note: I'm asking the question because, while shutil.filecopy was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.



from Among the many Python file copy functions, which ones are safe if the copy is interrupted?

No comments:

Post a Comment