LDA模型构建与可视化

正在学习人工智能自然语言处理，学校布置的作业分享出来

文章目录

1. 原理

（参考相关博客与教材）
隐含狄利克雷分布（Latent Dirichlet Allocation，LDA），是一种主题模型（topic model），典型的词袋模型，即它认为一篇文档是由一组词构成的一个集合，词与词之间没有顺序以及先后的关系。一篇文档可以包含多个主题，文档中每一个词都由其中的一个主题生成。它可以将文档集中每篇文档的主题按照概率分布的形式给出，对文章进行主题归纳，属于无监督学习。
需要区分的是，另外一种经典的降维方法线性判别分析（Linear Discriminant Analysis, 简称也为LDA）。此LDA在模式识别领域（比如人脸识别，舰艇识别等图形图像识别领域）中有非常广泛的应用
LDA在训练时不需要手工标注的训练集，需要的仅仅是文档集以及指定主题的数量k即可。此外LDA的另一个优点则是，对于每一个主题均可找出一些词语来描述它。选择模型中topic的数量——人为设置参数，之后输入的每篇文章都给一个topic的概率每个topic再给其下单词概率，topic的具体实现由自己来定。
LDA 假设文档的生成过程如下:
在这里插入图片描述

(1)对每个主题:
生成“主题-词项”分布参数：生成“主题-词项”分布参数筏筏 ;(2)对每个文档:生成“文档-主题”分布参数：在这里插入图片描述
;
(3)对当前文档的每个位置。
(a）生成当前位置的所属主题:
;
(b)根据当前位置的主题，以及“主题-词项”分布参数，生成当前位置对应的词项：

2. 代码实现

2.1.导入的包

from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfTransformer,CountVectorizer
import pandas as pd
import jieba
from sklearn.decomposition import LatentDirichletAllocation

2.2.分词去停用词

with open('train.csv','r',encoding='utf-8') as f:
    data = pd.read_csv(f)# print(data.info())
print(data.content.head())#content列的前5行
with open('中文stopwords.txt','r',encoding='utf-8') as f:
    stopwords=[x.strip()forxin f.readlines()]# print(stopwords[:5])

def cut_stop(content):#分词和去停用词listx=[]forrowin content:
        row=row.replace(" ","")crow=jieba.lcut(row)# print(jieba.lcut(row))forwordsin crow:
            if words in stopwords:
                crow.remove(words)
        listx.append(crow)return listx
lcontent=cut_stop(data.content[:1000])#生成分词列表

Lcontent[0]:
在这里插入图片描述
生成语料

#语料corpus=[]forrowin lcontent:
    text=" ".join(row)
    corpus.append(text)

corpus[0]:
在这里插入图片描述

2.3.Tfidf

#tfidf
vectorizer = CountVectorizer()vector=vectorizer.fit_transform(corpus)#转化为词频矩阵
transformer = TfidfTransformer()#该类会统计每个词语的tf-idf权值
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
tfidf_weight = tfidf.toarray()
word = vectorizer.get_feature_names()#获取词袋模型中的所有词语
print(f'查看矩阵维度:{vector.shape}')#查看矩阵维度

在这里插入图片描述

2.4.计算困惑度

可以简单理解为指定不同聚类数，然后通过lda.perplexity()函数进行处理，或者通过lda.score()进行评分绘制。困惑度理论上选取最低点，或者第一个拐点。绘制可以使用最常用的matplotlib库。

plexs =[]
scores =[]
n_max_topics =10foriin range(1,n_max_topics):
    print('正在进行第',i,'轮计算')
    lda = LatentDirichletAllocation(n_components=i, max_iter=5,
                                    learning_method='online',
                                    learning_offset=50,random_state=0)
    lda.fit(tfidf_weight )
    plexs.append(lda.perplexity(tfidf_weight ))
    scores.append(lda.score(tfidf_weight ))n_t=9#区间最右侧的值。注意：不能大于等于n_max_topicsx=list(range(1,n_t))
plt.plot(x,plexs[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.show()

见下图，因此选择主题数为2
在这里插入图片描述

2.5.LDA模型构建

#LDA构建
n_topics =2lda=LatentDirichletAllocation(n_components=n_topics,#主题个数max_iter=5,#EM算法最大迭代次数learning_method='online',#只在fit方法中使用，总体说来，当数据尺寸特别大的时候，在线online更新会比批处理batch更新快得多learning_offset=50.,#一个（正）参数，可以减轻online在线学习中的早期迭代的负担。random_state=0)
lda.fit(tfidf_weight )#拟合topicc=lda.components_#主题-词项分布topics=lda.transform(tfidf_weight )#文档-主题分布

2.6.主题与分词

2.6.1.权重值

#每个单词的主题权重值id=0fortt_min topicc:
    tt_dict =[(name, tt)for name, tt in zip(word, tt_m)]
    tt_dict = sorted(tt_dict, key=lambda x: x[1], reverse=True)# 打印权重值大于0.6的主题词：# tt_dict = [tt_threshold for tt_threshold in tt_dict if tt_threshold[1] > 0.6]# 打印每个类别前5个主题词：
    tt_dict = tt_dict[:8]
    print('主题%d:' % (id), tt_dict)id+=1

在这里插入图片描述

2.6.2.每个主题前25个词

#输出每个主题对应词语
def print_top_words(model, feature_names, n_top_words):
    tword =[]for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        topic_w =" ".join([feature_names[i]foriin topic.argsort()[:-n_top_words - 1:-1]])
        tword.append(topic_w)
        print(topic_w)return tword
n_top_words =25#前几个自己指定
topic_word = print_top_words(lda, word, n_top_words)

在这里插入图片描述

3.可视化

import pyLDAvis.sklearn
pic = pyLDAvis.sklearn.prepare(lda, vector, vectorizer)
pyLDAvis.save_html(pic, 'lda_pass'+str(n_topics)+'.html')
pyLDAvis.show(pic)

在这里插入图片描述

标签：人工智能 python nlp

本文转载自: https://blog.csdn.net/weixin_55595295/article/details/127886718
版权归原作者 上午早起了吗225 所有，如有侵权，请联系我们删除。

LDA模型构建与可视化

文章目录

1. 原理

2. 代码实现

2.1.导入的包

2.2.分词去停用词

2.3.Tfidf

2.4.计算困惑度

2.5.LDA模型构建

2.6.主题与分词

2.6.1.权重值

2.6.2.每个主题前25个词

3.可视化

发表评论

“LDA模型构建与可视化”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航