How To Extract University/school/college Name From String In Python Using Regular Expression?
SAMPLE CODE import re line = 'should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol' match = re.search(r'/([A-Z][^\s,.]+[.]?\s[(]?)
Solution 1:
Instead of search ,Try the re.sub to print your expected output
import re
i = "should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol"
line = re.sub(r"[\w\W]* ((Hospital|University|Centre|Law School|School|Academy|Department)[\w -]*)[\w\W]*$", r"\1", i)
print line
Solution 2:
The test string you've given is a made up one since the University name is immediately followed by a line terminator '.' while the other examples in your pastebin sample do not (they are followed by a comma).
line = should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol
I have managed to extract the names using a simple regex for examples in your pastebin
you can see details here: regex101.com
Logic
Since the institute name is separated by a comma (except the first case where it starts with the university name), you can see that the match string will either lie in group1
or group2
.
Then you can iterate through group1
& group2
to see if it matches anything in the pre-defined match list & return the value.
Code
I have used two examples to show it works.
line1 = 'The George Washington University, Washington, DC, USA.'
line2 = 'Department of Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, USA. adekunle-adesina@ouhsc.edu'
matchlist = ['Hospital','University','Institute','School','School','Academy'] # define all keywords that you need look up
p = re.compile('^(.*?),\s+(.*?),(.*?)\.') # regex pattern to match# We use a list comprehension using 'any' function to check if any of the item in the matchlist can be found in either group1 or group2 of the pattern match results
line1match = [m.group(1) ifany(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line1)]
line2match = [m.group(1) ifany(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line2)]
print (line1match)
[Out]: ['The George Washington University']
print (line2match)
[Out]: ['University of Oklahoma Health Sciences Center']
Post a Comment for "How To Extract University/school/college Name From String In Python Using Regular Expression?"