Context
I have the following paragraph:
text = """
בביהכנ"ס - בבית הכנסת דו"ח - דין וחשבון הת"ד - התיקוני דיקנא
בגו"ר - בגשמיות ורוחניות ה"א - ה' אלוקיכם התמי' - התמיהה
בהנ"ל - בהנזכר לעיל ה"א - ה' אלקיך ואח"כ - ואחר כך
בהשי״ת - בהשם יתברך ה"ה - הרי הוא / הוא הדין ואת"ה - ואיגוד תלמידי
"""
this paragraph is combined with Hebrew words and their acronyms.
A word contains quotation marks ("
).
So for example, some words would be:
[
'בביהכנ"ס',
'דו"ח',
'הת"ד'
]
Now, I'm able to match all the words with this regex:
(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)
Question
But how can I also match all the corresponding acronyms as a separate group? (the acronyms are what's not matched, so not the green in the picture).
Example acronyms are:
[
'בבית הכנסת',
'דין וחשבון',
'התיקוני דיקנא'
]
Expected output
The expected output should be a dictionary with the Words as keys
and the Acronyms as values
:
{
'בביהכנס': 'בבית הכנסת',
'דו"ח': 'דין וחשבון',
'הת"ד': 'התיקוני דיקנא'
}
My attempt
What I tried was to match all the words (as above picture):
(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b)
and then match everything until the pattern appears again with .*\1
, so the entire regex would be:
(\b[\u05D0-\u05EA]*\"\b[\u05D0-\u05EA]*\b).*\1
But as you can see, that doesn't work:
- How can I match the words and acronyms to compose a dictionary with the words/acronyms?
Note
When you print the output, it might be printed in Left-to-right order. But it should really be from Right to left. So if you want to print from right to left, see this answer:
right-to-left languages in Python
from How can I match a pattern, and then everything upto that pattern again? So, match all the words and acronyms in my below example
No comments:
Post a Comment