Friday, 19 March 2021

DOMParser for large html

I have a large amount of html clipboard data from Excel, about 250MB (though it contains a lot of formatting, so when actually pasting it in, the data is much, much smaller than that).

Currently I am using the following DOMParser, which is just one line of code and everything happens behind the scenes:

const doc3 = parser.parseFromString(htmlString, "text/html");

However, it takes ~18s to parse this, and during this time the page is entirely blocking until it finishes -- or, if offloaded to a webworker, an action that gives no progress and just 'waits' for 18s until something ends up happening -- which I would argue is almost the same as freezing even though yes the user can literally interact with the page.

Is there an alternative way to parse a large html/xml file? Perhaps using something that doesn't load everything at once and so can be responsive, or what might be a good solution for this? I suppose the following might be inline with it? But not really sure: https://github.com/isaacs/sax-js.


Update: here is a sample Excel file: https://drive.google.com/file/d/1GIK7q_aU5tLuDNBVtlsDput8Oo1Ocz01/view?usp=sharing. You can download the file, open it in Excel, press Cmd-A (select-all), and Cmd-C (Copy), and it'll paste the data into your clipboard. For me copying it takes up 249MB for the text/html format in the clipboard.

Yes, it is also available in teext/plain (which we use as a backup), but the point of grabbing it from the text/html is to capture the formatting (both data formatting, for example numberType=Percent, 3 decimals and stylistic, for example, background color=red). Please use that as a test for any sample code. Here is the actual test/html content (in asci) when it's in the clipboard here: https://drive.google.com/file/d/1ZUL2A4Rlk3KPqO4vSSEEGBWuGXj7j5Vh/view?usp=sharing



from DOMParser for large html

No comments:

Post a Comment