0


Presto: 是谁偷走了我的一天

Presto (including PrestoDB and PrestoSQL later re-branded to Trino) is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata. One can even query data from multiple data sources within a single query. Presto is community driven open-source software released under the Apache License.

起因

最近在数据迁移的过程中遇到一个好玩的

bug,

当我们要计算两个日期的间隔的是时候需要调用一些数据分析组件内置的函数, 如下所示: 代码表示的含义很简单 就是计算8月9日和8月8日之间相隔天数, 但是这样的计算的结果竟然不相同: 竟然出现同样的时间段里

hive 

presto

多一天这种情况. 即在

hive

分析中得到是相隔一天, 但是在

presto

分析中得到数据只有0天 .

(组件版本号:

Hive:2.3.7
 Presto:0.240.1

)

---hive: 1select datediff('2022-08-09','2022-08-08 08:00:01.0')as b
---presto: 0select
  date_diff('day',
   cast('2022-08-08 08:00:01.0'astimestamp),
   cast('2022-08-09'astimestamp))as b

探究

遇事不決 量子力学; 问题是小问题, 但是假如运用到生活的例子是你发工资的时候计算的时长, 这不就少计算一天. 为了打工人, 一定要找到是谁偷走了presto的一天.
首先, 看了找了一下

https://github.com/apache/hive

datediff()

/**
 * UDFDateDiff.
 *
 * Calculate the difference in the number of days. The time part of the string
 * will be ignored. If dateString1 is earlier than dateString2, then the
 * result can be negative.
 *
 */@Description(name ="datediff",
    value ="_FUNC_(date1, date2) - Returns the number of days between date1 and date2",
    extended ="date1 and date2 are strings in the format "+"'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored."+"If date1 is earlier than date2, the result is negative.\n"+"Example:\n "+"  > SELECT _FUNC_('2009-07-30', '2009-07-31') FROM src LIMIT 1;\n"+"  1")

官方文档说的很清楚,

yyyy-MM-dd HH:mm:ss

or

yyyy-MM-dd

. The time parts are ignored. 在计算的过程中, 会自动截取时间, 只用日期进行计算. 那么以此类推, 是不是

presto

没有进行截取呢? 可能吗

select cast('2022-08-09'astimestamp)as b

-- 2022-08-09 00:00:00.000-- 

果然,

2022-08-09

转成

timestamp

日期格式就是

2022-08-09 00:00:00.000

时间直接用午夜零点补齐了. 如果不转

timestamp

直接用日期可以吗?

select
  date_diff('day',
   cast('2022-08-08 08:00:01.0'astimestamp),'2022-08-09')as b

得报错了,貌似是格式不对,看来又得去官方源码找了, 从报错信息来看两者的日期类型需要对齐保持一样.

Unexpected parameters (varchar(3), timestamp, varchar(10)) for function date_diff. Expected:
date_diff(varchar(x), timestamp, timestamp) ,
date_diff(varchar(x), date, date) ,
date_diff(varchar(x), time, time) ,
date_diff(varchar(x), time with time zone, time with time zone) ,
date_diff(varchar(x), timestamp with time zone, timestamp with time zone)

https://github.com/prestodb/presto
......@Description("difference of the given times in the given unit")@ScalarFunction("date_diff")@LiteralParameters("x")@SqlType(StandardTypes.BIGINT)publicstaticlongdiffTimestamp(SqlFunctionProperties properties,@SqlType("varchar(x)")Slice unit,@SqlType(StandardTypes.TIMESTAMP)long timestamp1,@SqlType(StandardTypes.TIMESTAMP)long timestamp2){if(properties.isLegacyTimestamp()){returngetTimestampField(getChronology(properties.getTimeZoneKey()), unit).getDifferenceAsLong(timestamp2, timestamp1);}returngetTimestampField(UTC_CHRONOLOGY, unit).getDifferenceAsLong(timestamp2, timestamp1);}.....

从中定位到了

date_diff

函数, 这个其中当是

timestamp

的时候, 会调用这个

getDifferenceAsLong
/**
     * Computes the difference between two instants, as measured in the units
     * of this field. Any fractional units are dropped from the result. Calling
     * getDifference reverses the effect of calling add. In the following code:
     *
     * <pre>
     * long instant = ...
     * long v = ...
     * long age = getDifferenceAsLong(add(instant, v), instant);
     * </pre>
     *
     * The value 'age' is the same as the value 'v'.
     *
     * @param minuendInstant the milliseconds from 1970-01-01T00:00:00Z to
     * subtract from
     * @param subtrahendInstant the milliseconds from 1970-01-01T00:00:00Z to
     * subtract off the minuend
     * @return the difference in the units of this field
     */publicabstractlonggetDifferenceAsLong(long minuendInstant,long subtrahendInstant);

以上, 可以得出, 当不同

date 

类型 转成

timestamp

的时候 时间部分直接用

00:00:00.000

替代, 然后调用的

getDifferenceAsLong

是通过和

1970-01-01T00:00:00Z

进行比较 然后返回数值, 当其数值不大于24小时的部门, 自然不会计算当做一天. 所以当我们在计算时间间隔的时候 , 有时候精确度不需要那么苛刻的时候 , 建议

date_diff(varchar(x), date, date) 

参数进行计算.

由于时间仓促、水平有限,其中有许多不足之处在所难免,敬请各位大佬批评指正,给予多多留言指导。

[ 1 ] https://github.com/prestodb/presto
[ 2 ] https://github.com/apache/hive
[ 3 ] https://prestodb.io/

标签: java 开发语言

本文转载自: https://blog.csdn.net/jankin6/article/details/126193242
版权归原作者 孙中明 所有, 如有侵权,请联系我们删除。

“Presto: 是谁偷走了我的一天”的评论:

还没有评论