python - Issues in getting trigrams using Gensim -
i want bigrams , trigrams example sentences have mentioned.
my code works fine bigrams. however, not capture trigrams in data (e.g., human computer interaction, mentioned in 5 places of sentences)
approach 1 mentioned below code using phrases in gensim.
from gensim.models import phrases documents = ["the mayor of new york there", "human computer interaction , machine learning has become trending research area","human computer interaction interesting","human computer interaction pretty interesting subject", "human computer interaction great , new subject", "machine learning can useful sometimes","new york mayor present", "i love machine learning because new subject area", "human computer interaction helps people user friendly applications"] sentence_stream = [doc.split(" ") doc in documents] bigram = phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ') trigram = phrases(bigram_phraser[sentence_stream]) sent in sentence_stream: bigrams_ = bigram_phraser[sent] trigrams_ = trigram[bigrams_] print(bigrams_) print(trigrams_)
approach 2 tried use phraser , phrases both, didn't work.
from gensim.models import phrases gensim.models.phrases import phraser documents = ["the mayor of new york there", "human computer interaction , machine learning has become trending research area","human computer interaction interesting","human computer interaction pretty interesting subject", "human computer interaction great , new subject", "machine learning can useful sometimes","new york mayor present", "i love machine learning because new subject area", "human computer interaction helps people user friendly applications"] sentence_stream = [doc.split(" ") doc in documents] bigram = phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ') bigram_phraser = phraser(bigram) trigram = phrases(bigram_phraser[sentence_stream]) sent in sentence_stream: bigrams_ = bigram_phraser[sent] trigrams_ = trigram[bigrams_] print(bigrams_) print(trigrams_)
please me fix issue of getting trigrams.
i following example documentation of gensim.
i able bigrams , trigrams few modifications code:
from gensim.models import phrases documents = ["the mayor of new york there", "human computer interaction , machine learning has become trending research area","human computer interaction interesting","human computer interaction pretty interesting subject", "human computer interaction great , new subject", "machine learning can useful sometimes","new york mayor present", "i love machine learning because new subject area", "human computer interaction helps people user friendly applications"] sentence_stream = [doc.split(" ") doc in documents] bigram = phrases(sentence_stream, min_count=1, delimiter=b' ') trigram = phrases(bigram[sentence_stream], min_count=1, delimiter=b' ') sent in sentence_stream: bigrams_ = [b b in bigram[sent] if b.count(' ') == 1] trigrams_ = [t t in trigram[bigram[sent]] if t.count(' ') == 2] print(bigrams_) print(trigrams_)
i removed threshold = 1
parameter bigram phrases
because otherwise seems form weird digrams allow construction of weird trigrams (notice bigram
used build trigram phrases
); parameter come useful when have more data. trigrams, min_count
parameter needs specified because defaults 5 if not provided.
in order retrieve bigrams , trigrams of each document, can use list comprehension trick filter elements aren't formed 2 or 3 words, respectively.
edit - few details threshold
parameter:
this parameter used estimator determine if 2 words a , b form phrase, , if:
(count(a followed b) - min_count) * n/(count(a) * count(b)) > threshold
where n total vocabulary size. default parameter value 10 (see docs). so, higher threshold
, harder constraints words form phrases.
for example, in first approach trying use threshold = 1
, ['human computer','interaction is']
digrams of 3 out of 5 sentences begin "human computer interaction"; weird second digram result of more relaxed threshold.
then, when try trigrams default threshold = 10
['human computer interaction is']
3 sentences, , nothing remaining 2 (filtered threshold); , because 4-gram instead of trigram filtered if t.count(' ') == 2
. in case that, example, lower trigram threshold 1, can ['human computer interaction'] trigram 2 remaining sentences. doesn't seem easy combination of parameters, here's more it.
i'm not expert, take conclusion grain of salt: think it's better firstly digram results (not 'interaction is') before moving on, weird digrams can add confusion further trigrams, 4-gram...
Comments
Post a Comment