0


【NLP_命名实体识别】CRF++使用流程

重要参考

用CRF做命名实体识别(一) - 简书 (jianshu.com)https://www.jianshu.com/p/12f2cdd86679(8条消息) 【windows下CRF++的安装与使用】_feng_zhiyu的博客-CSDN博客_crf++安装https://blog.csdn.net/feng_zhiyu/article/details/80793316

代码实践

  • {B, M, E, S} 格式:B表示实体首字,M表示实体中字,E表示实体尾字,S表示单字
  • 注意:各种编码/解码细节

生成训练/测试数据

  • 生成训练数据/测试数据均为适合CRF++的格式
  1. # -*- coding: utf8 -*-
  2. import sys
  3. home_dir = "D:\Desktop\新Asian-Elephant\毕业\CRF\CRF++-0.58\YWP\\199801\\"
  4. def splitWord(words):
  5. uni = words.encode('utf-8').decode('utf-8')
  6. li = list()
  7. for u in uni:
  8. li.append(str(u).encode('utf-8'))
  9. return li
  10. # 4 tag: {B, M, E, S}
  11. def get4Tag(li):
  12. length = len(li)
  13. # print length
  14. if length == 1:
  15. return ['S']
  16. elif length == 2:
  17. return ['B', 'E']
  18. elif length > 2:
  19. li = list()
  20. li.append('B')
  21. for i in range(0, length - 2):
  22. li.append('M')
  23. li.append('E')
  24. return li
  25. def saveDataFile(trainobj, testobj, isTest, word, handle, tag):
  26. if isTest:
  27. saveTrainFile(testobj, word, handle, tag)
  28. else:
  29. saveTrainFile(trainobj, word, handle, tag)
  30. def saveTrainFile(fiobj, word, handle, tag):
  31. if len(word) > 0:
  32. wordli = splitWord(word)
  33. tag == '4'
  34. tagli = get4Tag(wordli)
  35. for i in range(0, len(wordli)):
  36. w = wordli[i]
  37. h = handle
  38. t = tagli[i]
  39. w=w.decode('utf-8')
  40. fiobj.write(str(w) + '\t' + h + '\t' + t + '\n')
  41. else:
  42. # print 'New line'
  43. fiobj.write('\n')
  44. # B,M,M1,M2,M3,E,S
  45. def convertTag(tag):
  46. fiobj = open(home_dir + 'people-daily.txt', 'r')
  47. trainobj = open(home_dir + 'train.data', 'w',encoding='UTF-8')
  48. testobj = open(home_dir + 'test.data', 'w',encoding='UTF-8')
  49. arr = fiobj.readlines()
  50. i = 0
  51. for a in arr:
  52. i += 1
  53. a = a.strip('\r\n\t ')
  54. if a == "": continue
  55. words = a.split(" ")
  56. test = False
  57. if i % 10 == 0:
  58. test = True
  59. for word in words:
  60. # print "---->", word
  61. word = word.strip('\t ')
  62. if len(word) > 0:
  63. i1 = word.find('[')
  64. if i1 >= 0:
  65. word = word[i1 + 1:]
  66. i2 = word.find(']')
  67. if i2 > 0:
  68. w = word[:i2]
  69. word_hand = word.split('/')
  70. # print "----",word
  71. #print("word_hand[0]:",word_hand[0])
  72. #print("word_hand[1]:", word_hand[1])
  73. #print('word_hand:',word_hand)
  74. #print('len(word_hand):',len(word_hand))
  75. w, h = word_hand[0],word_hand[1] #w, h = word_hand
  76. # print w,h
  77. if h == 'nr': # ren min
  78. # print 'NR',w
  79. if w.find('·') >= 0:
  80. tmpArr = w.split('·')
  81. for tmp in tmpArr:
  82. saveDataFile(trainobj, testobj, test, tmp, h, tag)
  83. continue
  84. if h != 'm':
  85. saveDataFile(trainobj, testobj, test, w, h, tag)
  86. if h == 'w':
  87. saveDataFile(trainobj, testobj, test, "", "", tag) # split
  88. trainobj.flush()
  89. testobj.flush()
  90. # sys.argv[0]表示代码本身文件路径
  91. # Sys.argv[ ]其实就是一个列表,里边的项为用户输入的参数,关键就是要明白这参数是从程序外部输入的
  92. if __name__ == '__main__':
  93. #tag = sys.argv[0]
  94. convertTag(4)

创建特征模板

  • 创建由指定特征组成的模板→存至template文件


简易模板

模型训练与测试

简易版

  • 命令行下,训练模型→model
  1. crf_learn -a MIRA template train.data model
  • 命令行下,评估模型
  1. crf_test -m model test.data >> output.txt

完整版

  • 命令行下,训练模型→model
  • template:模板文件,train.data:生成的训练数据,4_model :模型
  1. crf_learn -f 3 -c 4.0 template train.data 4_model > 4_train.rst
  1. crf_test -m 4_model test.data > 4_test.rst

评估模型

  1. import sys
  2. if __name__ == "__main__":
  3. try:
  4. file = open(sys.argv[1], "r",encoding='UTF-8')
  5. except:
  6. print
  7. ("result file is not specified, or open failed!")
  8. sys.exit()
  9. wc_of_test = 0
  10. wc_of_gold = 0
  11. wc_of_correct = 0
  12. flag = True
  13. for l in file:
  14. if l == '\n': continue
  15. _, _, g, r = l.strip().split()
  16. if r != g:
  17. flag = False
  18. if r in ('E', 'S'):
  19. wc_of_test += 1
  20. if flag:
  21. wc_of_correct += 1
  22. flag = True
  23. if g in ('E', 'S'):
  24. wc_of_gold += 1
  25. print("WordCount from test result:", wc_of_test)
  26. print("WordCount from golden data:", wc_of_gold)
  27. print("WordCount of correct segs :", wc_of_correct)
  28. # 查全率
  29. P = wc_of_correct / float(wc_of_test)
  30. # 查准率,召回率
  31. R = wc_of_correct / float(wc_of_gold)
  32. print("P = %f, R = %f, F-score = %f" % (P, R, (2 * P * R) / (P + R)))
  • 命令行下,运行评估模型的文件F-value.py
  1. python F-value.py 4_test.rst


简易评估结果


本文转载自: https://blog.csdn.net/YWP_2016/article/details/122087783
版权归原作者 YWP_2016 所有, 如有侵权,请联系我们删除。

“【NLP_命名实体识别】CRF++使用流程”的评论:

还没有评论