0


Python中7种主要关键词提取算法的基准测试

点击上方“Deephub Imba”,关注公众号,好文章不错过 !

我一直在寻找有效关键字提取任务算法。目标是找到一种算法,能够以有效的方式提取关键字,并且能够平衡提取质量和执行时间,因为我的数据语料库迅速增加已经达到了数百万行。我对于算法一个主要的要求是提取关键字本身总是要有意义的,即使脱离了上下文的语境也能够表达一定的含义。

本篇文章使用 2000 个文档的语料库对几种著名的关键字提取算法进行测试和试验。

使用的库列表

我使用了以下python库进行研究

  • NLTK,以帮助我在预处理阶段和一些辅助函数
  • RAKE
  • YAKE
  • PKE
  • KeyBERT
  • Spacy

Pandas 和Matplotlib还有其他通用库

实验流程

基准测试的工作方式如下

我们将首先导入包含我们的文本数据的数据集。然后,我们将为每个算法创建提取逻辑的单独函数

algorithm_name(str: text) → [keyword1, keyword2, ..., keywordn]

然后,我们创建的一个函数用于提取整个语料库的关键词。

extract_keywords_from_corpus(algorithm, corpus) → {algorithm, corpus_keywords, elapsed_time}

下一步,使用Spacy帮助我们定义一个匹配器对象,用来判断关键字是否对我们的任务有意义,该对象将返回 true 或 false。

最后,我们会将所有内容打包到一个输出最终报告的函数中。

数据集

我使用的是来自互联网的小文本数数据集。这是一个样本

['To follow up from my previous questions. . Here is the result!\n',
 'European mead competitions?\nI’d love some feedback on my mead, but entering the Mazer Cup isn’t an option for me, since shipping alcohol to the USA from Europe is illegal. (I know I probably wouldn’t get caught/prosecuted, but any kind of official record of an issue could screw up my upcoming citizenship application and I’m not willing to risk that).\n\nAre there any European mead comps out there? Or at least large beer comps that accept entries in the mead categories and are likely to have experienced mead judges?', 'Orange Rosemary Booch\n', 'Well folks, finally happened. Went on vacation and came home to mold.\n', 'I’m opening a gelato shop in London on Friday so we’ve been up non-stop practicing flavors - here’s one of our most recent attempts!\n', "Does anyone have resources for creating shelf stable hot sauce? Ferment and then water or pressure can?\nI have dozens of fresh peppers I want to use to make hot sauce, but the eventual goal is to customize a recipe and send it to my buddies across the States. I believe canning would be the best way to do this, but I'm not finding a lot of details on it. Any advice?", 'what is the practical difference between a wine filter and a water filter?\nwondering if you could use either', 'What is the best custard base?\nDoes someone have a recipe that tastes similar to Culver’s frozen custard?', 'Mold?\n'

大部分是与食物相关的。我们将使用2000个文档的样本来测试我们的算法。

我们现在还没有对文本进行预处理,因为有一些算法的结果是基于stopwords和标点符号的。

算法

让我们定义关键字提取函数。

# initiate BERT outside of functions
bert = KeyBERT()

# 1. RAKE
defrake_extractor(text):
    """
    Uses Rake to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    r = Rake()
    r.extract_keywords_from_text(text)
    returnr.get_ranked_phrases()[:5]

# 2. YAKE
defyake_extractor(text):
    """
    Uses YAKE to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)
    results = []
    forscored_keywordsinkeywords:
        forkeywordinscored_keywords:
            ifisinstance(keyword, str):
                results.append(keyword) 
    returnresults

# 3. PositionRank
defposition_rank_extractor(text):
    """
    Uses PositionRank to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    # define the valid Part-of-Speeches to occur in the graph
    pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
    extractor = pke.unsupervised.PositionRank()
    extractor.load_document(text, language='en')
    extractor.candidate_selection(pos=pos, maximum_word_number=5)
    # 4. weight the candidates using the sum of their word's scores that are
    #    computed using random walk biaised with the position of the words
    #    in the document. In the graph, nodes are words (nouns and
    #    adjectives only) that are connected if they occur in a window of
    #    3 words.
    extractor.candidate_weighting(window=3, pos=pos)
    # 5. get the 5-highest scored candidates as keyphrases
    keyphrases = extractor.get_n_best(n=5)
    results = []
    forscored_keywordsinkeyphrases:
        forkeywordinscored_keywords:
            ifisinstance(keyword, str):
                results.append(keyword) 
    returnresults

# 4. SingleRank
defsingle_rank_extractor(text):
    """
    Uses SingleRank to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
    extractor = pke.unsupervised.SingleRank()
    extractor.load_document(text, language='en')
    extractor.candidate_selection(pos=pos)
    extractor.candidate_weighting(window=3, pos=pos)
    keyphrases = extractor.get_n_best(n=5)
    results = []
    forscored_keywordsinkeyphrases:
        forkeywordinscored_keywords:
            ifisinstance(keyword, str):
                results.append(keyword) 
    returnresults

# 5. MultipartiteRank
defmultipartite_rank_extractor(text):
    """
    Uses MultipartiteRank to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    extractor = pke.unsupervised.MultipartiteRank()
    extractor.load_document(text, language='en')
    pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
    extractor.candidate_selection(pos=pos)
    # 4. build the Multipartite graph and rank candidates using random walk,
    #    alpha controls the weight adjustment mechanism, see TopicRank for
    #    threshold/method parameters.
    extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
    keyphrases = extractor.get_n_best(n=5)
    results = []
    forscored_keywordsinkeyphrases:
        forkeywordinscored_keywords:
            ifisinstance(keyword, str):
                results.append(keyword) 
    returnresults

# 6. TopicRank
deftopic_rank_extractor(text):
    """
    Uses TopicRank to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    extractor = pke.unsupervised.TopicRank()
    extractor.load_document(text, language='en')
    pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
    extractor.candidate_selection(pos=pos)
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=5)
    results = []
    forscored_keywordsinkeyphrases:
        forkeywordinscored_keywords:
            ifisinstance(keyword, str):
                results.append(keyword) 
    returnresults

# 7. KeyBERT
defkeybert_extractor(text):
    """
    Uses KeyBERT to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    keywords = bert.extract_keywords(text, keyphrase_ngram_range=(3, 5), stop_words="english", top_n=5)
    results = []
    forscored_keywordsinkeywords:
        forkeywordinscored_keywords:
            ifisinstance(keyword, str):
                results.append(keyword)
    returnresults

每个提取器将文本作为参数输入并返回一个关键字列表。对于使用来讲非常简单。

注意:由于某些原因,我不能在函数之外初始化所有提取器对象。每当我这样做时,TopicRank和MultiPartiteRank都会抛出错误。就性能而言,这并不完美,但基准测试仍然可以完成。

我们已经通过传递 pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'} 来限制一些可接受的语法模式——这与 Spacy 一起将确保几乎所有的关键字都是从人类语言视角来选择的。我们还希望关键字包含三个单词,只是为了有更具体的关键字并避免过于笼统。

从整个语料库中提取关键字

现在让我们定义一个函数,该函数将在输出一些信息的同时将单个提取器应用于整个语料库。

defextract_keywords_from_corpus(extractor, corpus):
    """This function uses an extractor to retrieve keywords from a list of documents"""
    extractor_name = extractor.__name__.replace("_extractor", "")
    logging.info(f"Starting keyword extraction with {extractor_name}")
    corpus_kws = {}
    start = time.time()
    # logging.info(f"Timer initiated.") <-- uncomment this if you want to output start of timer
    foridx, textintqdm(enumerate(corpus), desc="Extracting keywords from corpus..."):
        corpus_kws[idx] = extractor(text)
    end = time.time()
    # logging.info(f"Timer stopped.") <-- uncomment this if you want to output end of timer
    elapsed = time.strftime("%H:%M:%S", time.gmtime(end-start))
    logging.info(f"Time elapsed: {elapsed}")
    
    return {"algorithm": extractor.__name__, 
            "corpus_kws": corpus_kws, 
            "elapsed_time": elapsed}

这个函数所做的就是将传入的提取器数据和一系列有用的信息组合成一个字典(比如执行任务花费了多少时间)来方便我们后续生成报告。

语法匹配函数

这个函数确保提取器返回的关键字始终(几乎?)意义。例如,

我们可以清楚地了解到,前三个关键字可以独立存在,它们完全是有意义的。我们不需要更多信息来理解关键词的含义,但是第四个就毫无任何意义,所以需要尽量避免这种情况。

Spacy 与 Matcher 对象可以帮助我们做到这一点。我们将定义一个匹配函数,它接受一个关键字,如果定义的模式匹配,则返回 True 或 False。

def match(keyword):
    """This function checks if a list of keywords match a certain POS pattern"""
    patterns = [
        [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}],
        [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
        [{'POS': 'VERB'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],  
        [{'POS': 'NOUN'}, {'POS': 'VERB'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
        [{'POS': 'NOUN'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
        [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
        [{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}],
        [{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}],
        [{'POS': 'VERB'}, {'POS': 'ADV'}],
        [{'POS': 'PROPN'}, {'POS': 'NOUN'}],
        ]
    matcher = Matcher(nlp.vocab)
    matcher.add("pos-matcher", patterns)
    # create spacy object
    doc = nlp(keyword)
    # iterate through the matches
    matches = matcher(doc)
    # if matches is not empty, it means that it has found at least a match
    if len(matches) > 0:
        return True
    return False

基准测试函数

我们马上就要完成了。这是启动脚本和收集结果之前的最后一步。

我们将定义一个基准测试函数,它接收我们的语料库和一个布尔值,用于对我们的数据进行打乱。对于每个提取器,它调用extract_keywords_from_corpus 函数返回一个包含该提取器结果的字典。我们将该值存储在列表中。

对于列表中的每个算法,我们计算

  • 平均提取关键词数
  • 匹配关键字的平均数量
  • 计算一个分数表示找到的平均匹配数除以执行操作所花费的时间

我们将所有数据存储在 Pandas DataFrame 中,然后将其导出为 .csv。

defget_sec(time_str):
    """Get seconds from time."""
    h, m, s = time_str.split(':')
    returnint(h) *3600+int(m) *60+int(s)
defbenchmark(corpus, shuffle=True):
    """This function runs the benchmark for the keyword extraction algorithms"""
    logging.info("Starting benchmark...\n")
    
    # Shuffle the corpus
    ifshuffle:
        random.shuffle(corpus)

    # extract keywords from corpus
    results = []
    extractors = [
        rake_extractor, 
        yake_extractor, 
        topic_rank_extractor, 
        position_rank_extractor,
        single_rank_extractor,
        multipartite_rank_extractor,
        keybert_extractor,
    ]
    forextractorinextractors:
        result = extract_keywords_from_corpus(extractor, corpus)
        results.append(result)

    # compute average number of extracted keywords
    forresultinresults:
        len_of_kw_list = []
        forkwsinresult["corpus_kws"].values():
            len_of_kw_list.append(len(kws))
        result["avg_keywords_per_document"] = np.mean(len_of_kw_list)

    # match keywords
    forresultinresults:
        foridx, kwsinresult["corpus_kws"].items():
            match_results = []
            forkwinkws:
                match_results.append(match(kw))
                result["corpus_kws"][idx] = match_results

    # compute average number of matched keywords
    forresultinresults:
        len_of_matching_kws_list = []
        foridx, kwsinresult["corpus_kws"].items():
            len_of_matching_kws_list.append(len([kwforkwinkwsifkw]))
        result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list)
        # compute average percentange of matching keywords, round 2 decimals
        result["avg_percentage_matched_keywords"] = round(result["avg_matched_keywords_per_document"] /result["avg_keywords_per_document"], 2)
        
    # create score based on the avg percentage of matched keywords divided by time elapsed (in seconds)
    forresultinresults:
        elapsed_seconds = get_sec(result["elapsed_time"]) +0.1
        # weigh the score based on the time elapsed
        result["performance_score"] = round(result["avg_matched_keywords_per_document"] /elapsed_seconds, 2)
    
    # delete corpus_kw
    forresultinresults:
        delresult["corpus_kws"]

    # create results dataframe
    df = pd.DataFrame(results)
    df.to_csv("results.csv", index=False)
    logging.info("Benchmark finished. Results saved to results.csv")
    returndf

结果

results = benchmark(texts[:2000], shuffle=True)

下面是产生的报告

我们可视化一下:

根据我们定义的得分公式(avg_matched_keywords_per_document/time_elapsed_in_seconds), Rake 在 2 秒内处理 2000 个文档,尽管准确度不如 KeyBERT,但时间因素使其获胜。

如果我们只考虑准确性,计算为 avg_matched_keywords_per_document 和 avg_keywords_per_document 之间的比率,我们得到这些结果

从准确性的角度来看,Rake 的表现也相当不错。如果我们不考虑时间的话,KeyBERT 肯定会成为最准确、最有意义关键字提取的算法。Rake 虽然在准确度上排第二,但是差了一大截。

如果需要准确性,KeyBERT 肯定是首选,如果要求速度的话Rake肯定是首选,因为他的速度块,准确率也算能接受吧。

引用

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257–289

Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26–29). Lecture Notes in Computer Science, vol 10772, pp. 684–691.

Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26–29). Lecture Notes in Computer Science, vol 10772, pp. 806–810.

Csurfer. (n.d.). CSURFER/Rake-nltk: Python implementation of the rapid automatic keyword extraction algorithm using NLTK. Retrieved November 25, 2021, from https://github.com/csurfer/rake-nltk

Liaad. (n.d.). Liaad/Yake: Single-document unsupervised keyword extraction. Retrieved November 25, 2021, from https://github.com/LIAAD/yake

Boudinfl. (n.d.). BOUDINFL/pke: Python keyphrase extraction module. Retrieved November 25, 2021, from https://github.com/boudinfl/pke

MaartenGr. (n.d.). MAARTENGR/Keybert: Minimal keyword extraction with bert. Retrieved November 25, 2021, from https://github.com/MaartenGr/KeyBERT

Explosion. (n.d.). Explosion/spacy: 💫 industrial-strength natural language processing (NLP) in Python. Retrieved November 25, 2021, from https://github.com/explosion/spaCy

本篇文章作者没提供源码地址,有兴趣的可以去原文问问作者:https://medium.com/@theDrewDag/keyword-extraction-a-benchmark-of-7-algorithms-in-python-8a905326d93f

作者:Andrea D'Agostino

喜欢就关注一下吧:

点个 在看 你最好看!********** **********

标签:

“Python中7种主要关键词提取算法的基准测试”的评论:

还没有评论