In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.
In my specific case I have a page of HTML and need to extract all the content between
<w-block-content><span><div>
and
</div></span></w-block-content>
where
- the elements might have attributes
- the HTML might be formatted or not - there might be extra white space and newlines
- there may be other content between any of the above tags, including inner
div
elements within the above outerdiv
. But you can assume each<w-block-content>
element- contains ONLY ONE direct child
<span>
child (i.e. it may contain other non-span children)- which contains ONLY ONE direct
<div>
child- which wraps the content that must be extracted
- which contains ONLY ONE direct
- contains ONLY ONE direct child
- 🚩 the match must extend all the way to the last
</div>
within the<span>
within the<w-block-content>
, even if it is unmatched with an opening<div>
. - the solution must be pure ECMAScript-spec Regex. No Javascript code can be used
Thus the problem stated in the question at the top.
The following regex successfully matches as long as there are NO internal </div>
tags:
(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)
❌ But if there are additional </div>
tags, the match ends prematurely, not including the entirety of the block.
I use [\s\S]*?
to match against arbitrary content, including extra whitespace and newlines.
Here is sample test data:
</tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
<div>
Další master<br><div><b>Master č. 2</b> </div><br>
</div>
</span></w-block-content>
</div>
</td>
</tr>
</tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
<div>
Další master<br><b>Master č. 2</b><br>
</div>
</span></w-block-content>
</div>
</td>
</tr>
which I've been testing here: (https://regex101.com/r/jekZhr/3
The first extracted chunk should be:
Další master<br><div><b>Master č. 2</b> </div><br>
I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.
from Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )
No comments:
Post a Comment