1. 背景
json 字符串值数据示例:
{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}
现在这个 json 字符串形式的字段值在 hive 表的某个字段中,我需要获取到 “answers” 这个 json 数组,然后将其按照数组长度,列转行到多行数据,每行数据一个子 json ,并且从中获取到每个子 json 的 “tid” 和 “status” 值,理想情况下,我需要这行数据处理完之后,结果如下表所示。
tidstatus1122
2. 常见方案
通过
ge_json_object
函数,先获取到 “answers” 对应的 json 数组字符串,然后通过正则替换掉
[
和
]
符号,之后将
},{
符号替换为
}我是分隔符{
,最后将
我是分隔符
作为
split
函数的分隔符号,将字符串分割,再通过
lateral view explode()
语法,将数组放到多行。多行数据,都是处理好的 json 对象,之后通过
get_json_object
函数后取需要的字段值即可。具体代码示例如下。
- 获取 “answers” 对应的 json 数组。
with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select get_json_object(col,'$.answers')as answersfrom table1
结果如下所示,正常获取到 “answers” 下的 json 数组,结果为字符串。answers[{“value”:“[1, 2, 3]”,“ids”:[4,5,6],“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:[44,55,66],“isPic”:0,“duration”:22314,“status”:2,“tid”:2}] - 将最外层的
[
和]
符号去掉with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select regexp_replace(answers,'\\[|\\]','')as answersfrom(select get_json_object(col,'$.answers')as answers from table1 )as a
结果如下所示。answers{“value”:“1, 2, 3”,“ids”:4,5,6,“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:44,55,66,“isPic”:0,“duration”:22314,“status”:2,“tid”:2}由于第一步处理的结果最外层是 json 数组,左右有[
和]
符号,但是由于内层还有子 json 数组,这种直接全局替换的方式,会将内层子 json 数组的[
、]
符号也一并去掉,可以查看下面的结果,"ids":4,5,6,
,就是因为全局替换,造成整个 json 结构被破坏,之后将无法使用get_json_object()
函数来获取想要的 key 对应的值了。
由此可见,这种方式只适合于 “answers” 下的 json 数组内的每个 json 对象中都只包含 json 对象才行,不能再包含 json 数组,否则就会造成处理错误,拿不到想要的数据。
3. 推荐方案
3.1 具体步骤
- 获取 “answers” 对应的 json 数组。
with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select get_json_object(col,'$.answers')as answersfrom table1
结果如下所示,正常获取到 “answers” 下的 json 数组,结果为字符串。answers[{“value”:“[1, 2, 3]”,“ids”:[4,5,6],“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:[44,55,66],“isPic”:0,“duration”:22314,“status”:2,“tid”:2}] - 通过字符串截取的方式,将第一步的结果最前面和最后面的
[
、]
符号去掉with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select substring(answers,2, length(answers)-2)as answersfrom(select get_json_object(col,'$.answers')as answers from table1 )as a
结果如下所示,最前面和最后面的[
、]
符号已经被去掉。answers{“value”:“[1, 2, 3]”,“ids”:[4,5,6],“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:[44,55,66],“isPic”:0,“duration”:22314,“status”:2,“tid”:2} - 最重要的一步:通过正则替换,只要匹配到
[.*]
内容,就直接替换为数字 0。with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select substring(answers,1, length(answers)-3)as answersfrom(select get_json_object(col,'$.answers')as answers from table1 )as a
结果如下所示,字符串中所有[.*]
的部分,都已经被替换为数字 0answers{“value”:“0”,“ids”:0,“isPic”:0,“duration”:22314,“status”:1,“tid”:1},{“value”:“aabbcc”,“lessons”:0,“isPic”:0,“duration”:22314,“status”:2,“tid”:2}将 json 数字替换为数字 0,是为了兼容"[.*]"
和[.*]
两种情况,不管 json 数组是不是被英文双引号包围,替换为数字 0,都是没问题的,都可以保证 json 的格式不被破坏。 - 由于已经去掉了所有的子 json 数组,之后就可以按照传统的方式,将
},{
替换为}我是分隔符号{
。with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select regexp_replace(a.answers,'\\}\\,\\{','\\}我是分隔符号{')as answersfrom(select regexp_replace(answers,'\\[.*\\]','0')as answers from(select substring(answers,2, length(answers)-2)as answers from(select get_json_object(col,'$.answers')as answers from table1 )as a )as a )as a
answers{“value”:“0”,“ids”:0,“isPic”:0,“duration”:22314,“status”:1,“tid”:1}我是分隔符号{“value”:“aabbcc”,“lessons”:0,“isPic”:0,“duration”:22314,“status”:2,“tid”:2} - 通过
我是分隔符号
切换字符串,再通过lateral view explode()
语法将 json 数组展开,最后通过get_json_object()
函数,获取需要的值即可。with table1 as(select'{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}'as col)select get_json_object(b.answer,'$.tid')as tid, get_json_object(b.answer,'$.status')asstatusfrom(select regexp_replace(a.answers,'\\}\\,\\{','\\}我是分隔符号{')as answers from(select regexp_replace(answers,'\\[.*\\]','0')as answers from(select substring(answers,2, length(answers)-2)as answers from(select get_json_object(col,'$.answers')as answers from table1 )as a )as a )as a )as alateral view explode(split(regexp_replace(a.answers,'\\}\\,\\{','\\}我是分隔符号{'),'我是分隔符号')) b as answer
结果如下,可以看到,数据处理结果符合预期tidstatus1122
3.2 注意事项
上面的步骤 3,只适用于你想要展开最外层的这个 json 数组,并且完全不需要内部嵌套的子 json 数组才行,否则将内部的子 json 数组全部替换为数字 0 之后,你就获取不到子 json 数组数据了。
如果还想要内部的子 json 数组,单纯的 sql 应该是实现不了的,需要去自定义 udf,然后通过 java 代码一层一层解析了。
版权归原作者 第一片心意 所有, 如有侵权,请联系我们删除。