Tuesday, 27 December 2022

How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example

Context

I have the following paragraph:

text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד -  התיקוני דיקנא
בגו"ר  - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי 
"""

this paragraph is combined with Hebrew words and their acronyms.

A word contains quotation marks (").

So for example, some words would be:

[
    'בביהכנ"ס',
     'דו"ח',
     'הת"ד'
 ]

Now, I'm able to match all the words with this regex:

(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)

enter image description here

Question

But how can I also match all the corresponding acronyms as a separate group? (the acronyms are what's not matched, so not the green in the picture).

Example acronyms are:

[
    'בבית הכנסת',
    'דין וחשבון',
    'התיקוני דיקנא'
]

Expected output

The expected output should be a dictionary with the Words as keys and the Acronyms as values:

{
    'בביהכנס': 'בבית הכנסת',
    'דו"ח': 'דין וחשבון',
    'הת"ד': 'התיקוני דיקנא'
}

My attempt

What I tried was to match all the words (as above picture):

(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)

and then match everything until the pattern appears again with .*\1, so the entire regex would be:

(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b).*\1

But as you can see, that doesn't work:

enter image description here

  • How can I match the words and acronyms to compose a dictionary with the words/acronyms?

Note

When you print the output, it might be printed in Left-to-right order. But it should really be from Right to left. So if you want to print from right to left, see this answer:

right-to-left languages in Python



from How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example

No comments:

Post a Comment