在 hive 中处理换行符 [英] handling newline character in hive
问题描述
我在 hive 中创建了一个表
创建表(id int, Description String)
我的数据如下所示:
<前>1|这将返回损坏的数据,因为第一个字符串中有一个,".一些文字更改数据2|读取数据有问题一些文本数据加载到 hive 后,由于默认的行终止符是 ,hive 无法读取描述列,因此它显示一个 NULL 值.任何人都可以建议如何在加载到配置单元之前处理换行符.
我知道这个问题很老了,但您有几个选择.你不能用 FIELDS TERMINATED BY
控制它,因为它只控制终止字段的内容,而不控制记录.Hive 中的记录被硬编码为由换行符终止(即使有 LINES TERMINATED BY
子句,它也没有实现).
- 编写一个使用
RecordReader
的自定义InputFormat
理解非换行符分隔的记录.看代码LineReader
/LineRecordReader
和TextInputFormat
. - 使用格式除了文本/ASCII,如 Parquet.我会推荐这个无论如何,因为文本可能是您可以存储数据的最糟糕的格式无论如何.
I have created a table in hive as
Create table(id int, Description String)
My data looks something as follows :
1|This will return corrupt data since there is a ',' in the first string. some text Change the data 2|There is prob in reading data sometext
After the data is loaded into hive since the default line terminator is , the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.
I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY
, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY
clause, it is not implemented).
- Write a custom
InputFormat
that uses aRecordReader
that understands non-newline delimited records. Look at the code forLineReader
/LineRecordReader
andTextInputFormat
. - Use a format other than text/ASCII, like Parquet. I would recommend this regardless, as text is probably the worst format you can store data in anyway.
这篇关于在 hive 中处理换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!