python - Optimizing code to graph word counts -


i finished program reads text books , graphs word count x-axis being count of 1 book , y-axis being count of second book. works, it's surprisingly slow , i'm hoping tips on how optimize it. think biggest concern creating dictionary similar words between books , dictionary words in 1 book not other. implementation added lot of runtime program , i'd find pythonic way improve this. below code:

import re   # regular expressions import io import collections matplotlib import pyplot plt  # xs=[x1,x2,...,xn] # number of occurences of word in book 1  # use  # ys=[y1.y2,...,yn] # number of occurences of word in book 2  # plt.plot(xs,ys) # save svg or pdf files  word_pattern = re.compile(r'\w+') # version ensures closing if there failures io.open("swannsway.txt") f:     text = f.read() # read single large string     book1 = word_pattern.findall(text)  # pull out words     book1 = [w.lower() w in book1 if len(w)>=3]  io.open("moby_dick.txt") f:     text = f.read() # read single large string     book2 = word_pattern.findall(text)  # pull out words     book2 = [w.lower() w in book2 if len(w)>=3]   #convert these relative percentages/total book length  wordcount_book1 = {} word in book1:     if word in wordcount_book1:         wordcount_book1[word]+=1     else:         wordcount_book1[word]=1  ''' word in wordcount_book1:     wordcount_book1[word] /= len(wordcount_book1)  word in wordcount_book2:     wordcount_book2[word] /= len(wordcount_book2) '''  wordcount_book2 = {} word in book2:     if word in wordcount_book2:         wordcount_book2[word]+=1     else:         wordcount_book2[word]=1   common_words = {}  in wordcount_book1:     j in wordcount_book2:         if == j:             common_words[i] = [wordcount_book1[i], wordcount_book2[j]]             break  book_singles= {} in wordcount_book1:     if not in common_words:         book_singles[i] = [wordcount_book1[i], 0] in wordcount_book2:     if not in common_words:         book_singles[i] = [0, wordcount_book2[i]]  wordcount_book1 = collections.counter(book1) wordcount_book2 = collections.counter(book2)  # how many words of different lengths?  word_length_book1 = collections.counter([len(word) word in book1]) word_length_book2 = collections.counter([len(word) word in book2])  print(wordcount_book1)  #plt.plot(list(word_length_book1.keys()),list(word_length_book1.values()), list(word_length_book2.keys()), list(word_length_book2.values()), 'bo') in range(len(common_words)):     plt.plot(list(common_words.values())[i][0], list(common_words.values())[i][1], 'bo', alpha = 0.2) in range(len(book_singles)):     plt.plot(list(book_singles.values())[i][0], list(book_singles.values())[i][1], 'ro', alpha = 0.2) plt.ylabel('swannsway') plt.xlabel('moby dick') plt.show() #key:value 

the bulk of code had minor inefficiencies i've tried address. largest delay in plotting book_singles believe i've fixed. details: switched this:

word_pattern = re.compile(r'\w+') 

to:

word_pattern = re.compile(r'[a-za-z]{3,}') 

as book_singles large enough without including numbers too! including minimum size in pattern, eliminate need loop:

book1 = [w.lower() w in book1 if len(w)>=3] 

and matching 1 book2. here:

book1 = word_pattern.findall(text)  # pull out words book1 = [w.lower() w in book1 if len(w)>=3] 

i moved .lower() once, rather on every word:

book1 = word_pattern.findall(text.lower())  # pull out words book1 = [w w in book1 if len(w) >= 3] 

since it's implemented in c, can win. this:

wordcount_book1 = {} word in book1:     if word in wordcount_book1:         wordcount_book1[word]+=1     else:         wordcount_book1[word]=1 

i switched use defaultdict since have collections imported already:

wordcount_book1 = collections.defaultdict(int) word in book1:     wordcount_book1[word] += 1 

for these loops:

common_words = {}  in wordcount_book1:     j in wordcount_book2:         if == j:             common_words[i] = [wordcount_book1[i], wordcount_book2[j]]             break  book_singles= {} in wordcount_book1:     if not in common_words:         book_singles[i] = [wordcount_book1[i], 0] in wordcount_book2:     if not in common_words:         book_singles[i] = [0, wordcount_book2[i]] 

i rewrote first loop disaster , made double duty since had done work of second loop already:

common_words = {} book_singles = {}  in wordcount_book1:     if in wordcount_book2:         common_words[i] = [wordcount_book1[i], wordcount_book2[i]]     else:         book_singles[i] = [wordcount_book1[i], 0]  in wordcount_book2:     if not in common_words:         book_singles[i] = [0, wordcount_book2[i]] 

finally, these plotting loops horribly inefficient both in way walk common_words.values() , book_singles.values() on , on again , in way plot 1 point @ time:

for in range(len(common_words)):     plt.plot(list(common_words.values())[i][0], list(common_words.values())[i][1], 'bo', alpha = 0.2) in range(len(book_singles)):     plt.plot(list(book_singles.values())[i][0], list(book_singles.values())[i][1], 'ro', alpha = 0.2) 

i changed them simply:

counts1, counts2 = zip(*common_words.values()) plt.plot(counts1, counts2, 'bo', alpha=0.2)  counts1, counts2 = zip(*book_singles.values()) plt.plot(counts1, counts2, 'ro', alpha=0.2) 

the complete reworked code leaves out things calculated never used in example:

import re  # regular expressions import collections matplotlib import pyplot plt  # xs=[x1,x2,...,xn] # number of occurrences of word in book 1  # use  # ys=[y1.y2,...,yn] # number of occurrences of word in book 2  # plt.plot(xs,ys) # save svg or pdf files  word_pattern = re.compile(r'[a-za-z]{3,}')  # ensures closing of file if there failures open("swannsway.txt") f:     text = f.read() # read single large string     book1 = word_pattern.findall(text.lower())  # pull out words  open("moby_dick.txt") f:     text = f.read() # read single large string     book2 = word_pattern.findall(text.lower())  # pull out words  # convert these relative percentages/total book length  wordcount_book1 = collections.defaultdict(int) word in book1:     wordcount_book1[word] += 1  wordcount_book2 = collections.defaultdict(int) word in book2:     wordcount_book2[word] += 1  common_words = {} book_singles = {}  in wordcount_book1:     if in wordcount_book2:         common_words[i] = [wordcount_book1[i], wordcount_book2[i]]     else:         book_singles[i] = [wordcount_book1[i], 0]  in wordcount_book2:     if not in common_words:         book_singles[i] = [0, wordcount_book2[i]]  counts1, counts2 = zip(*common_words.values()) plt.plot(counts1, counts2, 'bo', alpha=0.2)  counts1, counts2 = zip(*book_singles.values()) plt.plot(counts1, counts2, 'ro', alpha=0.2)  plt.xlabel('moby dick') plt.ylabel('swannsway') plt.show() 

output

enter image description here

you might eliminate stop words reduce high scoring words , bring out interesting data.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -