In R text2vec package - LDA model can show the topic distribution for each tokens in document? -
library (text2vec) library (parallel) library (doparallel) n <- parallel::detectcores() cl <- makecluster (n) registerdoparallel (cl) ky_young <- read.csv("./ky_young.csv") <- itoken_parallel (ky_young$textinfo, ids = ky_young$id, tokenizer = word_tokenizer, progressbar = f) ##stopword stop_words = readlines("./stopwrd1.txt", encoding="utf-8") vocab <- create_vocabulary ( it, stopwords = stop_words ngram = c(1, 1)) %>% prune_vocabulary (term_count_min = 5) vocab.order <- vocab[order((vocab$term_count), decreasing = t),] vectorizer <- vocab_vectorizer (vocab) dtm <- create_dtm (it, vectorizer, distributed = f) lda_model <- latentdirichletallocation$new (n_topics = 200, #vocabulary = vocab, <= error doc_topic_prior = 0.1, topic_word_prior = 0.01) ##topic-document distribution lda_fit <- lda_model$fit_transform ( x = dtm, n_iter = 50, convergence_tol = -1, n_check_convergence = 10) #topic-word distribution topic_word_prior = lda_model$topic_word_distribution i create test lda code in text2vec, , can word-topic distribution , document-topic distribution. (and crazy fast)
by way, wondering possible topic distribution each tokens in document text2vec's lda model?
i understand lda analysis process result each tokens in document belong specific topics, , each document has topics distribution.
if can each token's topic distribution, check each topic's top word changes classfified documents(like period). possible?
if there way, grateful let me know.
unfortunately impossible distribution of topics each token in given document. document-topic counts calculated/aggregated "on fly", document-token-topic distribution not stored anywhere.
Comments
Post a Comment