Python Tagger speed -


i'm trying extract firm's name sentences(like millions of sentences).

for example have bunch of pairs of firm's name , sentences below, (in excel files!

        column 1     column 2  row 1   firm       sentence 1 row 2   firm b       sentence 2 row 3   firm c       sentence 3 

the examples of sentences below,

  1. verizon accounted 12.6%, 17.8% , 17.9% of our net sales in fiscal 2010, 2009 , 2008, respectively.

  2. sbc communications, inc., accounted 11.2% of our sales in fiscal 2002.

  3. in fiscal 2006 att, bellsouth , cingular (who combined in merger) collectively represented approximately 14.9% of our net sales.

  4. sales krone customers represented 21.8% of our net sales in fiscal 2004.

i want extract, verizon eg1), sbc communications, inc eg2), att, bellsouth , cingular eg3), krone eg4)

(hopefully, if extract year data , % of sales accounted firms, best!!)

however, there many variations sentences,

some of them contain regions' name,

  1. our emea region (europe, middle east , africa) accounted largest percentage of sales outside of north america , represented 20.6%, 19.0% , 22.6% of our net sales in fiscal 2008, 2007 , 2006, respectively.

and of them not contain proper noun

  1. we estimate products obtained outsourced manufacturers accounted approximately 19% of our net sales broadband

to achieve goal, i'm using stanfordtagger,

and extracting words tagged "organization"

however, recall , precision rate of tagger not quiet

and biggest problem took long time.

i think because standfordtagger loaded java scripts

it takes time every time use standfordtagger analyze every single sentence.

so, questions are

q1. there better way achieve goal?

(i used nltk pos tagger, however, tagger not provide information exact type of nnp(for example, organization, people). precision rate worse)

q2. there way analyze sentences @ once?

(i considered of adding sentences 1 string, however, division between sentences(and information extracted firm names) abstracted cannot link extracted data , firm's name on column 1..!

thanks reading long stupid question...! thanks!!

my code below,

java_path = "c:/program files (x86)/java/jdk1.8.0_131/bin/java.exe" os.environ['javahome'] = java_path st = stanfordnertagger('c:/python/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz',                    'c:/python/stanford-ner-2017-06-09/stanford-ner.jar') data2 = nltk.word_tokenize(sentence) tags = st.tag(data2) cp = nltk.regexpparser('organization: {<organization>+}') tree = cp.parse(tags) iob = nltk.chunk.tree2conlltags(tree) comcount = 0 = '' (word, chunk, iob_tag) in iob:     if iob_tag == "b-organization":         if comcount == 0:             company.append(word)             comcount += 1             = word         else:             company.append(a)             = word     elif iob_tag == "i-organization":         if word == ',':             = + word         else:             = + " " + word     else :          if comcount != 0:             company.append(a) 


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -