python - Positions of substrings in string -
i need know positions of word in text - substring in string. solution far use regex, not sure if there not better, may builtin standard library strategies. ideas?
import re text = "the quick brown fox jumps on lazy dog. fox. redfox." links = {'fox': [], 'dog': []} re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys()) iterator = re.finditer(re_capture, text) if iterator: match in iterator: # fix position context # (' ', 'fox', ' ') m_groups = match.groups() start, end = match.span() start = start + len(m_groups[0]) end = end - len(m_groups[2]) key = m_groups[1] links[key].append((start, end)) print links
{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}
edit: partial words not allowed match - see fox of redfox not in links.
thanks.
if want match actual words , strings contain ascii:
text = "fox quick brown fox jumps on fox! lazy dog. fox!." links = {'fox': [], 'dog': []} string import punctuation def yield_words(s,d): = 0 ele in s.split(" "): tot = len(ele) + 1 ele = ele.rstrip(punctuation) ln = len(ele) if ele in d: d[ele].append((i, ln + i)) += tot return d
this unlike find solution won't match partial words , in o(n)
time:
in [2]: text = "the quick brown fox jumps on lazy dog. fox. redfox." in [3]: links = {'fox': [], 'dog': []} in [4]: yield_words(text,links) out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
this 1 case regex approach, can simpler:
def reg_iter(s,d): r = re.compile("|".join([r"\b{}\b".format(w) w in d])) match in r.finditer(s): links[match.group()].append((match.start(),match.end())) return d
output:
in [6]: links = {'fox': [], 'dog': []} in [7]: text = "the quick brown fox jumps on lazy dog. fox. redfox." in [8]: reg_iter(text, links) out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
Comments
Post a Comment