表情符号在R [UTF-8编码] [英] Emoji in R [UTF-8 encoding]
问题描述
我试图在R.
上做一个表情符号分析我已经存储了一些有emojis的tweets。
I'm trying to make an emoji analysis on R. I have stored some tweets where there are emojis.
以下是我要分析的推文之一:
Here is one of the tweet that I want to analyze :
> tweetn2
[1] "Programme du week-end: \xed\xa0\xbd\xed\xb2\x83\xed\xa0\xbc \xed\xbe\xb6\xed\xa0\xbc
\xed\xbd\xbb\xed\xa0\xbc\xed\xbd\xbb\xed\xa0\xbc \xed\xbd\xbb\xed\xa0\xbc\xed\xbd\xbb"
确保我有UTF-8:
> Encoding(tweetn2)
[1] "UTF-8
现在当我想要识别一些角色时,它不能正常工作
" Now when I'm trying to recognize some characters, it's not working fine
> grepl("\\xed",tweetn2)
[1] FALSE
> grepl("xed",tweetn2)
[1] FALSE
但是似乎emojis \xed\xa0\xbd不是UTF-8编码,因为我写下来时会收到一条错误消息:
But it seems that emojis "\xed\xa0\xbd" are not "UTF-8" encoding because I get an error message when I write :
> str(tweetn2)
Error in str.default(tweetn2) : invalid multibyte string, element 1
我通过使用iconv()函数和ASCII编码找到一种解决方案:
http://www.r-bloggers.com/emoticons-decoder-for-social-media-sentiment-analysis-in-r/
I find a kind of solution by using iconv( ) function and "ASCII" encoding there :
http://www.r-bloggers.com/emoticons-decoder-for-social-media-sentiment-analysis-in-r/
但是,我想继续使用UTF-8进行分析,因为它与法国特殊字母(à,é,è,ê, ë,û等等)
But I want to keep using "UTF-8" for my analysis because it works well with french special letters (à, é, è, ê, ë, û, etc.. )
所以你有什么想法可以超越吗?
So do you have an idea how I can get above it?
谢谢
推荐答案
字符串无效UTF-8,如图所示。你在那里有UTF-16编码UTF-8。所以 \xED\xA0\xBD
是高替代品 U + D83D , - 和 \xED\xB2\x83
是低代理 U + DC83
The string is invalid UTF-8, as indicated. What you have there is UTF-16 encoded with UTF-8. So \xED\xA0\xBD
is the high surrogate U+D83D, -- and \xED\xB2\x83
is the low surrogate U+DC83
如果你应用神奇的高,低 - >代码点公式,最终会得到实际的代码点:
If you apply the magical High,Low -> Codepoint formula, you'll end up with the actual codepoint:
(0xD83D - 0xD800) * 0x400 + 0xDC83 - 0xDC00 + 0x10000 = 0x1F483
你会看到这是舞者表情符号。不幸的是,我没有对你的建议,因为我不熟悉R,但我可以说你一定想让自己处于这个数据被双重编码的位置!希望有助于您沿着正确的方向碰撞你。
You'll see this is the dancer emoji. Unfortunately I don't have a suggestion for you, as I'm not that familiar with R. But I can say you'd certainly want to get yourself in a position where this data is double encoded! Hope that helps bump you along the correct direction.
这篇关于表情符号在R [UTF-8编码]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!