Friday, 31 March 2023

Read a file line by line in Pyodide

The code below reads the user-selected input file entirely. This requires a lot of memory for very large (> 10 GB) files. I need to read a file line by line.

How can I read a file in Pyodide one line at a time?


<!doctype html>
<html>
  <head>
      <script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
  </head>
  <body>
    <button>Analyze input</button>
    <script type="text/javascript">
      async function main() {
        // Get the file contents into JS
        const [fileHandle] = await showOpenFilePicker();
        const fileData = await fileHandle.getFile();
        const contents = await fileData.text();

        // Create the Python convert toy function
        let pyodide = await loadPyodide();
        let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
    return to_js(contents.lower())
convert
      `);

        let result = convert(contents);
        console.log(result);

        const blob = new Blob([result], {type : 'application/text'});

        let url = window.URL.createObjectURL(blob);

        var downloadLink = document.createElement("a");
        downloadLink.href = url;
        downloadLink.text = "Download output";
        downloadLink.download = "out.txt";
        document.body.appendChild(downloadLink);

      }
      const button = document.querySelector('button');
      button.addEventListener('click', main);
    </script>
  </body>
</html>

The code is from this answer to question "Select and read a file from user's filesystem".


Based on the answer by rth, I used the code below. It still has 2 issues:

  • The chunks break some lines into parts, as shown on the example input file, which has 100 chars per line. The console log (below) shows that this is not always the case for chunks (thus, lines in chunks are broken not at the newline).
  • I cannot get the variable result to be written into the output file, which is available for download to the user (see below, where for the example purposes it is replaced by a dummy string 'result').
<!doctype html>
<html>
  <head>
    <script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
  </head>
  <body>
    <button>Analyze input</button>
    <script type="text/javascript">
      async function main() {
          
          // Create the Python convert toy function
          let pyodide = await loadPyodide();
          let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
    for line in contents.split('\\n'):
        print(len(line))
    return to_js(contents.lower())
convert
      `);
          
          // Get the file contents into JS
          const bytes_func = pyodide.globals.get('bytes');                                               
          
          const [fileHandle] = await showOpenFilePicker();  
          let fh = await fileHandle.getFile()  
          const stream = fh.stream();  
          const reader = stream.getReader();
          // Do a loop until end of file


          while( true ) {
              const { done, value } = await reader.read();
              if( done ) { break; }
              handleChunk( value );
          }
          console.log( "all done" );


          function handleChunk( buf ) {
              console.log( "received a new buffer", buf.byteLength );
              let result = convert(bytes_func(buf).decode('utf-8'));
          }
          
          const blob = new Blob(['result'], {type : 'application/text'});
          
          let url = window.URL.createObjectURL(blob);
          
          var downloadLink = document.createElement("a");
          downloadLink.href = url;
          downloadLink.text = "Download output";
          downloadLink.download = "out.txt";
          document.body.appendChild(downloadLink);
          
      }
      const button = document.querySelector('button');
      button.addEventListener('click', main);
    </script>
  </body>
</html>

Given this input file with 100 characters per line:

perl -le 'for (1..1e5) { print "0" x 100 }' > test_100x1e5.txt

I am getting this console log output, indicating that lines are broken not at the newline:

received a new buffer 65536
648pyodide.asm.js:10 100
pyodide.asm.js:10 88
read_write_bytes_func.html:41 received a new buffer 2031616
pyodide.asm.js:10 12
20114pyodide.asm.js:10 100
pyodide.asm.js:10 89
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 11
20763pyodide.asm.js:10 100
pyodide.asm.js:10 77
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 23
20763pyodide.asm.js:10 100
pyodide.asm.js:10 65
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 35
20763pyodide.asm.js:10 100
pyodide.asm.js:10 53
read_write_bytes_func.html:41 received a new buffer 1711392
pyodide.asm.js:10 47
16944pyodide.asm.js:10 100
pyodide.asm.js:10 0
read_write_bytes_func.html:37 all done

If I change from this:

const blob = new Blob(['result'], {type : 'application/text'});

to that:

const blob = new Blob([result], {type : 'application/text'});

then I get the error:

Uncaught (in promise) ReferenceError: result is not defined
    at HTMLButtonElement.main (read_write_bytes_func.html:45:34)


from Read a file line by line in Pyodide

No comments:

Post a Comment