Skip to content

Instantly share code, notes, and snippets.

@ululh
Last active February 1, 2023 09:32
Show Gist options
  • Save ululh/c3edda2497b8ff9d4f70e63b0c9bd78c to your computer and use it in GitHub Desktop.
Save ululh/c3edda2497b8ff9d4f70e63b0c9bd78c to your computer and use it in GitHub Desktop.
LDA (Latent Dirichlet Allocation) predicting with python scikit-learn
# derived from http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html
# explanations are located there : https://www.linkedin.com/pulse/dissociating-training-predicting-latent-dirichlet-lucien-tardres
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pickle
# create a blank model
lda = LatentDirichletAllocation()
# load parameters from file
with open ('outfile', 'rb') as fd:
(features,lda.components_,lda.exp_dirichlet_component_,lda.doc_topic_prior_) = pickle.load(fd)
# the dataset to predict on (first two samples were also in the training set so one can compare)
data_samples = ["I like to eat broccoli and bananas.",
"I ate a banana and spinach smoothie for breakfast.",
"kittens and dogs are boring"
]
# Vectorize the training set using the model features as vocabulary
tf_vectorizer = CountVectorizer(vocabulary=features)
tf = tf_vectorizer.fit_transform(data_samples)
# transform method returns a matrix with one line per document, columns being topics weight
predict = lda.transform(tf)
print(predict)
Copy link

ghost commented Jun 8, 2018

I am analyzing & building an analytics application to predict the theme of upcoming Customer Support Text Data. But I have come across few challenges on which I am requesting you to share your inputs.

Challenges: -
Using Latent Dirichlet Allocations (LDA) from ScikitLearn with almost default hyper-parameters except few essential parameters. But LDA is splitting inconsistent result i.e. topic distribution for the documents, jumbled up keywords across different topic after every re-run. What parameters should I tune to stabilize the LDA? . I tried two parameters i.e. No. of topics, Learning Decay but tweaking these parameters doesn’t give much improvements & consistency on result. How to deal the situation when a document has very low probability across all the different topics or same probability among all the topics for a given document. Please correct if my understanding is not correct on this . I infer this situation as the documents has some new keywords which were not seen by the model previous.

To overcome this issues, I came up with below solutions: -
I can think of either to update No. of topic and rerun the script again or retrained the model with new documents . But the challenges I can see is it will jumbled up again all the topics again and have to rework with domain experts to inter the topic/theme again & again. It doesn’t seem very practical approach from business perspectives.
However, I came up with another approach to resolve such issues is to accumulate such document over period of times and build another module say Module 2. And, if module 1(previously built) come cross such documents pass to Module 2. So this is way I am not going to disturb the inference of the topics in every iterations because of new topics or theme. I am sure there must be some better way to overcome this issue but I am not aware of those solutions. So I came up with these layman approach to deal the problem in-hand.
Kindly share your valuables inputs.

@ChristopherDaigle
Copy link

But LDA is splitting inconsistent result i.e. topic distribution for the documents, jumbled up keywords across different topic after every re-run.

Something that stands out is that on line 9, you aren't setting a random state. That is likely the top reason with each successive run of the model you are getting different outcomes when you are fitting the model. Below are some exampoles

You have:

lda = LatentDirichletAllocation()

Try:

lda = LatentDirichletAllocation(random_state=0)

With the random state set, you will have deterministic outcomes of the probabilistic models from scikit-learn

@ChristopherDaigle
Copy link

I see now that you have another module where you are fitting but this module is supposed to be for predicting alone.

Instead of using line 8 at all, you can load the pickled model object as it seems you're trying to do on line 12 by assigning the attributes of the fitted model to an empty class of lda. Instead, you can simply use the file you reference, fd instead of assigning those attributes. You don't need to import the ldia package, only the fitted model.

here's an example:

import pickle

with open('model', 'rb') as model:
    ldia_model = pickle.load(model)
# assuming you pickled the vectorizer
with open('vectorizer', 'rb') as vectorizer:
    count_vec = pickle.load(vectorizer)

data_samples = [...]
vec_data = count_vec.transform(data_samples)

predict = ldia_model.transform(vec_data)

@ChristopherDaigle
Copy link

I came across this gist while looking for inputs on my own LDiA work, so if it seems odd your comments on this 4 years after you wrote them from a random person, my apologies!

@nikbpetrov
Copy link

I came across this gist while looking for inputs on my own LDiA work, so if it seems odd your comments on this 4 years after you wrote them from a random person, my apologies!

For what it's worth, I found your reply quite helpful in my project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment