Skip to content Skip to sidebar Skip to footer

Regex: Match A Specific Pattern, Exclude If Match Is In A Specific Context

I am a beginner in regex and wanted to ask how you can solve this problem with regex. At the moment I am trying to preprocess german text. German has a few specific characters in i

Solution 1:

You may use

import re
import pandas as pd
dct = {'ae' : 'ä', 'Ae' : 'Ä', 'oe' : 'ö', 'Oe' : 'Ö', 'ue' : 'ü', 'Ue' : 'Ü'}
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df['text'].str.replace(r'[AaÄäEe]ue|([aouAOU]e)', lambda x: dct[x.group(1)] if x.group(1) else x.group())
# => 0    Übergang#    1        euer#    Name: text, dtype: object

The [AaÄäEe]ue|([aouAOU]e) pattern matches:

  • [AaÄäEe]ue - A, a, Ä, ä, E or e followed with ue substring
  • | - or
  • ([aouAOU]e) - Group 1: a, o, u, A, O or U and then e

The lambda x: dct[x.group(1)] if x.group(1) else x.group() lambda expression does the following: once Group 1 matches, dct[x.group(1)] will return the replacement string. Else, the match found is pasted back.

Solution 2:

Should do the trick:

df["text"] = df["text"].str.replace("[^AaÄäEe](ue)", "ü")

The '^' means not in regex

Post a Comment for "Regex: Match A Specific Pattern, Exclude If Match Is In A Specific Context"