python - Optimizing code to graph word counts -
i finished program reads text books , graphs word count x-axis being count of 1 book , y-axis being count of second book. works, it's surprisingly slow , i'm hoping tips on how optimize it. think biggest concern creating dictionary similar words between books , dictionary words in 1 book not other. implementation added lot of runtime program , i'd find pythonic way improve this. below code:
import re # regular expressions import io import collections matplotlib import pyplot plt # xs=[x1,x2,...,xn] # number of occurences of word in book 1 # use # ys=[y1.y2,...,yn] # number of occurences of word in book 2 # plt.plot(xs,ys) # save svg or pdf files word_pattern = re.compile(r'\w+') # version ensures closing if there failures io.open("swannsway.txt") f: text = f.read() # read single large string book1 = word_pattern.findall(text) # pull out words book1 = [w.lower() w in book1 if len(w)>=3] io.open("moby_dick.txt") f: text = f.read() # read single large string book2 = word_pattern.findall(text) # pull out words book2 = [w.lower() w in book2 if len(w)>=3] #convert these relative percentages/total book length wordcount_book1 = {} word in book1: if word in wordcount_book1: wordcount_book1[word]+=1 else: wordcount_book1[word]=1 ''' word in wordcount_book1: wordcount_book1[word] /= len(wordcount_book1) word in wordcount_book2: wordcount_book2[word] /= len(wordcount_book2) ''' wordcount_book2 = {} word in book2: if word in wordcount_book2: wordcount_book2[word]+=1 else: wordcount_book2[word]=1 common_words = {} in wordcount_book1: j in wordcount_book2: if == j: common_words[i] = [wordcount_book1[i], wordcount_book2[j]] break book_singles= {} in wordcount_book1: if not in common_words: book_singles[i] = [wordcount_book1[i], 0] in wordcount_book2: if not in common_words: book_singles[i] = [0, wordcount_book2[i]] wordcount_book1 = collections.counter(book1) wordcount_book2 = collections.counter(book2) # how many words of different lengths? word_length_book1 = collections.counter([len(word) word in book1]) word_length_book2 = collections.counter([len(word) word in book2]) print(wordcount_book1) #plt.plot(list(word_length_book1.keys()),list(word_length_book1.values()), list(word_length_book2.keys()), list(word_length_book2.values()), 'bo') in range(len(common_words)): plt.plot(list(common_words.values())[i][0], list(common_words.values())[i][1], 'bo', alpha = 0.2) in range(len(book_singles)): plt.plot(list(book_singles.values())[i][0], list(book_singles.values())[i][1], 'ro', alpha = 0.2) plt.ylabel('swannsway') plt.xlabel('moby dick') plt.show() #key:value
the bulk of code had minor inefficiencies i've tried address. largest delay in plotting book_singles
believe i've fixed. details: switched this:
word_pattern = re.compile(r'\w+')
to:
word_pattern = re.compile(r'[a-za-z]{3,}')
as book_singles
large enough without including numbers too! including minimum size in pattern, eliminate need loop:
book1 = [w.lower() w in book1 if len(w)>=3]
and matching 1 book2. here:
book1 = word_pattern.findall(text) # pull out words book1 = [w.lower() w in book1 if len(w)>=3]
i moved .lower()
once, rather on every word:
book1 = word_pattern.findall(text.lower()) # pull out words book1 = [w w in book1 if len(w) >= 3]
since it's implemented in c, can win. this:
wordcount_book1 = {} word in book1: if word in wordcount_book1: wordcount_book1[word]+=1 else: wordcount_book1[word]=1
i switched use defaultdict
since have collections imported already:
wordcount_book1 = collections.defaultdict(int) word in book1: wordcount_book1[word] += 1
for these loops:
common_words = {} in wordcount_book1: j in wordcount_book2: if == j: common_words[i] = [wordcount_book1[i], wordcount_book2[j]] break book_singles= {} in wordcount_book1: if not in common_words: book_singles[i] = [wordcount_book1[i], 0] in wordcount_book2: if not in common_words: book_singles[i] = [0, wordcount_book2[i]]
i rewrote first loop disaster , made double duty since had done work of second loop already:
common_words = {} book_singles = {} in wordcount_book1: if in wordcount_book2: common_words[i] = [wordcount_book1[i], wordcount_book2[i]] else: book_singles[i] = [wordcount_book1[i], 0] in wordcount_book2: if not in common_words: book_singles[i] = [0, wordcount_book2[i]]
finally, these plotting loops horribly inefficient both in way walk common_words.values()
, book_singles.values()
on , on again , in way plot 1 point @ time:
for in range(len(common_words)): plt.plot(list(common_words.values())[i][0], list(common_words.values())[i][1], 'bo', alpha = 0.2) in range(len(book_singles)): plt.plot(list(book_singles.values())[i][0], list(book_singles.values())[i][1], 'ro', alpha = 0.2)
i changed them simply:
counts1, counts2 = zip(*common_words.values()) plt.plot(counts1, counts2, 'bo', alpha=0.2) counts1, counts2 = zip(*book_singles.values()) plt.plot(counts1, counts2, 'ro', alpha=0.2)
the complete reworked code leaves out things calculated never used in example:
import re # regular expressions import collections matplotlib import pyplot plt # xs=[x1,x2,...,xn] # number of occurrences of word in book 1 # use # ys=[y1.y2,...,yn] # number of occurrences of word in book 2 # plt.plot(xs,ys) # save svg or pdf files word_pattern = re.compile(r'[a-za-z]{3,}') # ensures closing of file if there failures open("swannsway.txt") f: text = f.read() # read single large string book1 = word_pattern.findall(text.lower()) # pull out words open("moby_dick.txt") f: text = f.read() # read single large string book2 = word_pattern.findall(text.lower()) # pull out words # convert these relative percentages/total book length wordcount_book1 = collections.defaultdict(int) word in book1: wordcount_book1[word] += 1 wordcount_book2 = collections.defaultdict(int) word in book2: wordcount_book2[word] += 1 common_words = {} book_singles = {} in wordcount_book1: if in wordcount_book2: common_words[i] = [wordcount_book1[i], wordcount_book2[i]] else: book_singles[i] = [wordcount_book1[i], 0] in wordcount_book2: if not in common_words: book_singles[i] = [0, wordcount_book2[i]] counts1, counts2 = zip(*common_words.values()) plt.plot(counts1, counts2, 'bo', alpha=0.2) counts1, counts2 = zip(*book_singles.values()) plt.plot(counts1, counts2, 'ro', alpha=0.2) plt.xlabel('moby dick') plt.ylabel('swannsway') plt.show()
output
you might eliminate stop words reduce high scoring words , bring out interesting data.
Comments
Post a Comment