python - Issues in getting trigrams using Gensim -


i want bigrams , trigrams example sentences have mentioned.

my code works fine bigrams. however, not capture trigrams in data (e.g., human computer interaction, mentioned in 5 places of sentences)

approach 1 mentioned below code using phrases in gensim.

from gensim.models import phrases documents = ["the mayor of new york there", "human computer interaction , machine learning has become trending research area","human computer interaction interesting","human computer interaction pretty interesting subject", "human computer interaction great , new subject", "machine learning can useful sometimes","new york mayor present", "i love machine learning because new subject area", "human computer interaction helps people user friendly applications"] sentence_stream = [doc.split(" ") doc in documents]  bigram = phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ') trigram = phrases(bigram_phraser[sentence_stream])  sent in sentence_stream:     bigrams_ = bigram_phraser[sent]     trigrams_ = trigram[bigrams_]      print(bigrams_)     print(trigrams_) 

approach 2 tried use phraser , phrases both, didn't work.

from gensim.models import phrases gensim.models.phrases import phraser documents = ["the mayor of new york there", "human computer interaction , machine learning has become trending research area","human computer interaction interesting","human computer interaction pretty interesting subject", "human computer interaction great , new subject", "machine learning can useful sometimes","new york mayor present", "i love machine learning because new subject area", "human computer interaction helps people user friendly applications"] sentence_stream = [doc.split(" ") doc in documents]  bigram = phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ') bigram_phraser = phraser(bigram) trigram = phrases(bigram_phraser[sentence_stream])  sent in sentence_stream:     bigrams_ = bigram_phraser[sent]     trigrams_ = trigram[bigrams_]      print(bigrams_)     print(trigrams_) 

please me fix issue of getting trigrams.

i following example documentation of gensim.

i able bigrams , trigrams few modifications code:

from gensim.models import phrases documents = ["the mayor of new york there", "human computer interaction , machine learning has become trending research area","human computer interaction interesting","human computer interaction pretty interesting subject", "human computer interaction great , new subject", "machine learning can useful sometimes","new york mayor present", "i love machine learning because new subject area", "human computer interaction helps people user friendly applications"] sentence_stream = [doc.split(" ") doc in documents]  bigram = phrases(sentence_stream, min_count=1, delimiter=b' ') trigram = phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')  sent in sentence_stream:     bigrams_ = [b b in bigram[sent] if b.count(' ') == 1]     trigrams_ = [t t in trigram[bigram[sent]] if t.count(' ') == 2]      print(bigrams_)     print(trigrams_) 

i removed threshold = 1 parameter bigram phrases because otherwise seems form weird digrams allow construction of weird trigrams (notice bigram used build trigram phrases); parameter come useful when have more data. trigrams, min_count parameter needs specified because defaults 5 if not provided.

in order retrieve bigrams , trigrams of each document, can use list comprehension trick filter elements aren't formed 2 or 3 words, respectively.


edit - few details threshold parameter:

this parameter used estimator determine if 2 words a , b form phrase, , if:

(count(a followed b) - min_count) * n/(count(a) * count(b)) > threshold 

where n total vocabulary size. default parameter value 10 (see docs). so, higher threshold, harder constraints words form phrases.

for example, in first approach trying use threshold = 1, ['human computer','interaction is'] digrams of 3 out of 5 sentences begin "human computer interaction"; weird second digram result of more relaxed threshold.

then, when try trigrams default threshold = 10 ['human computer interaction is'] 3 sentences, , nothing remaining 2 (filtered threshold); , because 4-gram instead of trigram filtered if t.count(' ') == 2. in case that, example, lower trigram threshold 1, can ['human computer interaction'] trigram 2 remaining sentences. doesn't seem easy combination of parameters, here's more it.

i'm not expert, take conclusion grain of salt: think it's better firstly digram results (not 'interaction is') before moving on, weird digrams can add confusion further trigrams, 4-gram...


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -