hive常用小技巧有哪些
这篇文章将为大家详细讲解有关hive常用小技巧有哪些,小编觉得挺实用的,因此分享给大家做个参考,希望大家阅读完这篇文章后可以有所收获。
成都创新互联是一家专注于网站建设、做网站与策划设计,武冈网站建设哪家好?成都创新互联做网站,专注于网站建设十载,网设计领域的专业建站公司;建站业务涵盖:武冈等地区。武冈做网站价格咨询:13518219792
1. parse_url如果查不到会返回 NULL
parse_url用来解析url中的数据。常用的就是解析HOST和QUERY。
string | parse_url(string urlString, string partToExtract [, string keyToExtract]) | Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url('http://facebook.com/path2/p.php?k1=v1&k2=v2#Ref1', 'HOST') returns 'facebook.com'. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url('http://facebook.com/path2/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'. |
select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page')<=0 limit 10;
Total MapReduce CPU Time Spent: 2 seconds 440 msec
OK
Time taken: 38.136 seconds
select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 660 msec
OK
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
Time taken: 40.133 seconds
select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is not NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
Time taken: 34.121 seconds
2. 默认分区会导致查询出错,删掉时要确认表中(show partitions visitlogs)是否还有。尤其是使用外部表时要注意删除数据。
ALTER TABLE visitlogs DROP PARTITION (dt='2012-12-25', vhour='__HIVE_DEFAULT_PARTITION__');
hive>
> select sessidmodex(sessid,10), count(*), count(distinct sessid), count(distinct visitip) from visitlogs where ((dt='2012-11-10' and vhour>=13) or (dt='2012-11-11' and vhour<13) )
> AND ((class_name='goods' AND method_name='goods_poster' and uri like '%page=0%') OR (class_name='goods' AND method_name='hot' and parse_url(concat('http://www.meilishuo.com',uri),'QUERY','page') is NULL) ) AND not is_spam(dt,sessid,'SESSID') group by sessidmodex(sessid, 10);
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.opAnd(PcrExprProcFactory.java:128)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory$GenericFuncExprProcessor.process(PcrExprProcFactory.java:267)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.walkExprTree(PcrExprProcFactory.java:450)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:149)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
at org.apache.hadoop.hive.ql.optimizer.pcr.PartitionConditionRemover.transform(PartitionConditionRemover.java:78)
at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:87)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7306)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
3. Array和Map的常用方法:
A[n] A is an Array and n is an int Returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' M[key] M is a Mapand key has type K Returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar' array map_keys(Map ) Returns an unordered array containing the keys of the input map array map_values(Map ) Returns an unordered array containing the values of the input map boolean array_contains(Array , value) Returns TRUE if the array contains value
简单demo:查询qzone渠道来过的所有用户的记录
select * from user_session_stat where dt='2012-10-15' and array_contains(map_keys(market_from),'tx_qzone') limit 10;
visitips格式{"172.0.0.1":100,"172.0.0.1":20,"172.0.0.2":5} #IP:访问量, 按访问量排序
map_keys(visitips)[0]:获取访问量最多的IP
4. explode 打散数组和字典
select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 2;
select tmp.cc,count(*) from (select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 10) tmp group by cc;
详细:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
第一区间:9.8~9.18
第二区间:10.2~10.12
第三区间:10.21~11.01
第四区间:11.07~11.17
select tmp2.cc,count(*) from (
select explode(map_keys(uss.market_from)) as cc from (
select sessid from user_session_stat where dt='2012-09-18' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt>='2012-09-08' and uss.dt<='2012-09-17' and uss.is_spam=0 ) tmp2 group by tmp2.cc;
PS:
select explode(map_keys(uss.market_from)) as cc,uss.sessid from (
select sessid from user_session_stat where dt='2012-11-01' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt='2012-10-30' and uss.dt<='2012-10-31' and uss.is_spam=0
不支持:Error in semantic analysis: 1:51 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sessid'
5. HIVE 的 LATERAL VIEW
===============================
http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive
This is the below Hive Table
CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable ( USER_ID BIGINT, NEW_ITEM ARRAY> ) And this is the data in the above table- 1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}] Is there any way I can get the below output from the HiveQL after exploding the array? **USER_ID** | **PRODUCT_ID** | **TIMESTAMPS** ------------+------------------+---------------- 1015826235 220003038067 1340321132000 1015826235 300003861266 1340271857000 SELECT user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps FROM SampleTable LATERAL VIEW explode(new_item) exploded_table as prod_and_ts;======================
select * from user_session_stat LATERAL VIEW explode(map_keys(market_from)) channel_cc as prod_and_ts where dt='2012-12-23' limit 10;
000048693e1e5f7cd12113011144502a 2012-11-30 ["0"] ["","tq_shoppingbag1105d_121223","out_link"] {"123.171.149.104":21,"182.35.12.232":9} /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER 8 organic norefer 0 SHANDONG JINAN 3079 {"norefer":1,"tx_weibo":1} 2012-12-23 norefer
000048693e1e5f7cd12113011144502a 2012-11-30 ["0"] ["","tq_shoppingbag1105d_121223","out_link"] {"123.171.149.104":21,"182.35.12.232":9} /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER 8 organic norefer 0 SHANDONG JINAN 3079 {"norefer":1,"tx_weibo":1} 2012-12-23 tx_weibo
select sessid,sub_channel,market_from,prod_and_ts from user_session_stat LATERAL VIEW explode(map_keys(market_from)) channel_cc as prod_and_ts where dt='2012-12-23';
00002b681cb0e0ad81212232313359e3 norefer {"norefer":1} norefer
0000480e5b010d80c121223160448bdb norefer {"norefer":1} norefer
000048693e1e5f7cd12113011144502a norefer {"norefer":1,"tx_weibo":1} norefer
000048693e1e5f7cd12113011144502a norefer {"norefer":1,"tx_weibo":1} tx_weibo
00007a56dbbe2cb1e1212230031058e6 norefer {"norefer":1} norefer
00007be99d947ee54121223135026115 norefer {"norefer":1} norefer
6. 替换MySQL-->show tables like '%%’
show tables '*cpc*';
7. regexp_extract(uri, '/share/([0-9]+)', 1) 正则截取
官方UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
8、hive本身强大的正则:
实例:表A,id= 110, 字段json
{"cnt":1,"value":[{"time":1390543558,"index":1,"level":190,"name":"A97M/AccessiV.B virus"},{"time":0,"index":0,"level":0,"name":" "},{"time":0,"index":0,"level":0,"name":" "}]}
查询, value第一个值里的 name :
select id, get_json_object(A.json, '$.value[0].name') from A id=110;
get_json_object
A limited version of JSONPath is supported:
$ : Root object
. : Child operator
[] : Subscript operator for array
* : Wildcard for []
Syntax not supported that's worth noticing:
: Zero length string as key
.. : Recursive descent
@ : Current object/element
() : Script expression
?() : Filter (script) expression.
[,] : Union operator
[start:end.step] : array slice operator
Example: src_json table is a single column (json), single row table:
+----+
json +----+ {
"store"
:
{
"fruit"
:\[{
"weight"
:
8
,
"type"
:
"apple"
},{
"weight"
:
9
,
"type"
:
"pear"
}],
"bicycle"
:{
"price"
:
19.95
,
"color"
:
"red"
}
},
"email"
:
"amy@only_for_json_udf_test.net"
,
"owner"
:
"amy" } +----+ |
The fields of the json object can be extracted using these queries:
hive> SELECT get_json_object(src_json.json,
'$.owner'
) FROM src_json; amy hive> SELECT get_json_object(src_json.json,
'$.store.fruit\[0]'
) FROM src_json; {
"weight"
:
8
,
"type"
:
"apple"
} hive> SELECT get_json_object(src_json.json,
'$.non_exist_key'
) FROM src_json; NULL |
关于“hive常用小技巧有哪些”这篇文章就分享到这里了,希望以上内容可以对大家有一定的帮助,使各位可以学到更多知识,如果觉得文章不错,请把它分享出去让更多的人看到。
文章标题:hive常用小技巧有哪些
地址分享:http://pcwzsj.com/article/gjsoeh.html