python - Positions of substrings in string -


i need know positions of word in text - substring in string. solution far use regex, not sure if there not better, may builtin standard library strategies. ideas?

import re  text = "the quick brown fox jumps on lazy dog. fox. redfox." links = {'fox': [], 'dog': []} re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys())  iterator = re.finditer(re_capture, text)  if iterator:     match in iterator:          # fix position context          # (' ', 'fox', ' ')         m_groups = match.groups()         start, end = match.span()         start = start + len(m_groups[0])         end = end - len(m_groups[2])          key = m_groups[1]         links[key].append((start, end))  print links 

{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}

edit: partial words not allowed match - see fox of redfox not in links.

thanks.

if want match actual words , strings contain ascii:

text = "fox quick brown fox jumps on fox! lazy dog. fox!." links = {'fox': [], 'dog': []}  string import punctuation def yield_words(s,d):     = 0     ele in s.split(" "):         tot = len(ele) + 1         ele = ele.rstrip(punctuation)         ln = len(ele)         if ele in d:             d[ele].append((i, ln + i))         += tot     return d 

this unlike find solution won't match partial words , in o(n) time:

in [2]: text = "the quick brown fox jumps on lazy dog. fox. redfox."  in [3]: links = {'fox': [], 'dog': []}  in [4]: yield_words(text,links) out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]} 

this 1 case regex approach, can simpler:

def reg_iter(s,d):     r = re.compile("|".join([r"\b{}\b".format(w) w in d]))     match in r.finditer(s):         links[match.group()].append((match.start(),match.end()))     return d 

output:

in [6]: links = {'fox': [], 'dog': []}  in [7]: text = "the quick brown fox jumps on lazy dog. fox. redfox."  in [8]: reg_iter(text, links) out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]} 

Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -