Tuesday 27 December 2022

How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example


I have the following paragraph:

text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא
בגו"ר  - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי 

this paragraph is combined with Hebrew words and their acronyms.

A word contains quotation marks (").

So for example, some words would be:


Now, I'm able to match all the words with this regex:


enter image description here


But how can I also match all the corresponding acronyms as a separate group? (the acronyms are what's not matched, so not the green in the picture).

Example acronyms are:

    'בבית הכנסת',
    'דין וחשבון',
    'התיקוני דיקנא'

Expected output

The expected output should be a dictionary with the Words as keys and the Acronyms as values:

    'בביהכנס': 'בבית הכנסת',
    'דו"ח': 'דין וחשבון',
    'הת"ד': 'התיקוני דיקנא'

My attempt

What I tried was to match all the words (as above picture):


and then match everything until the pattern appears again with .*\1, so the entire regex would be:


But as you can see, that doesn't work:

enter image description here

  • How can I match the words and acronyms to compose a dictionary with the words/acronyms?


When you print the output, it might be printed in Left-to-right order. But it should really be from Right to left. So if you want to print from right to left, see this answer:

right-to-left languages in Python

from How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example

No comments:

Post a Comment