Skip to main content

NLP - Topic Modeling - LDA - Latent Dirichlet Allocation

 NLP - processing natural text (or speech) - find patterns 

https://www.youtube.com/watch?v=xvqsFTUsOmc

Can be used for 

1. sentiment analysis - if a document/text is a positive response or a negative one.

2. topic modeling - finding topic(s) in a document, example in email, if it is financial, personal or project email. - a document can be mixture of topics - for example 80%financial, 15% project, 5% personal.

3. Text generation - Example: autocomplete - Markov Chains (only looks at previous state), LSTMs (look at a lot of previous states)


2. Topic Modeling

Popular method - LDA (Latent Dirichlet Allocation)

Latent = Hidden

Dirichlet = Probability distribution


A document or a text can be considered a distribution of topics.

Identifying topic and it's percentage is done by LDA

A topic can be a distribution of words.

So, every document is a mix of topics and every topic is a mix of words

LDA doesn't tell what is the topic. It just gives number of topics and percentage of topics in a document.

For each topic it also gives distribution of words.


How many topics? Who decides? - you do during initialization

So we chose the number of topics.

LDA then goes through each word in a document and randomly assigns each word a topic.



    
SpaCy is like a good version of NLTK might replace it one day


So that's how NLP helps.
We are trying to find out why I like someone.
We do these techniques to find cluster the different movies into groups
Find out what you liked about the group
Then recommend some other from the same group