Skip to content Skip to sidebar Skip to footer

How To Extract University/school/college Name From String In Python Using Regular Expression?

SAMPLE CODE import re line = 'should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol' match = re.search(r'/([A-Z][^\s,.]+[.]?\s[(]?)

Solution 1:

Instead of search ,Try the re.sub to print your expected output

import re
i = "should we use regex more often, University of Pennsylvania. let me know at  321dsasdsa@dasdsa.com.lol"
line = re.sub(r"[\w\W]* ((Hospital|University|Centre|Law School|School|Academy|Department)[\w -]*)[\w\W]*$", r"\1", i)
print line

Solution 2:

The test string you've given is a made up one since the University name is immediately followed by a line terminator '.' while the other examples in your pastebin sample do not (they are followed by a comma).

line = should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa@dasdsa.com.lol

I have managed to extract the names using a simple regex for examples in your pastebin you can see details here: regex101.com

Logic

Since the institute name is separated by a comma (except the first case where it starts with the university name), you can see that the match string will either lie in group1 or group2.

Then you can iterate through group1 & group2to see if it matches anything in the pre-defined match list & return the value.

Code

I have used two examples to show it works.

line1 = 'The George Washington University, Washington, DC, USA.'
line2 = 'Department of Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, USA. adekunle-adesina@ouhsc.edu'

matchlist = ['Hospital','University','Institute','School','School','Academy'] # define all keywords that you need look up
p = re.compile('^(.*?),\s+(.*?),(.*?)\.')   # regex pattern to match# We use a list comprehension using 'any' function to check if any of the item in the matchlist can be found in either group1 or group2 of the pattern match results
line1match = [m.group(1) ifany(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line1)]
line2match = [m.group(1) ifany(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line2)]

print (line1match)
[Out]: ['The George Washington University']

print (line2match)
[Out]: ['University of Oklahoma Health Sciences Center']

Post a Comment for "How To Extract University/school/college Name From String In Python Using Regular Expression?"