str_word_count用于UTF8文本
问题说明
我有这段文字:
$text = "Başka, küskün otomobil kaçtı buraya küskün otomobil neden kaçtı
kaçtı buraya, oraya KISMEN @here #there J.J.Johanson hep.
Danny:Where is mom? I don't know! Café est weiß for 2 €uros.
My 2nd nickname is mike18.";
最近我正在使用它.
$a1= array_count_values(str_word_count($text, 1, 'ÇçÖöŞşİIıĞğÜü@#é߀1234567890'));
arsort($a1);
您可以使用此小提琴进行检查:
http://ideone.com/oVUGYa
You can check with this fiddle:
http://ideone.com/oVUGYa
但是此解决方案不能解决所有UTF8问题.我无法将整个UTF8集写入str_word_count作为参数.
But this solution doesn't solve all UTF8 problems. I can't write whole UTF8 set into str_word_count as parameter.
所以我创建了这个:
$wordsArray = explode(" ",$text);
foreach ($wordsArray as $k => $w) {
$wordsArray[$k] = str_replace(array(",","."),"",$w);
}
$wordsArray2 = array_count_values($wordsArray);
arsort($wordsArray2);
输出应如下所示:
Array (
[kaçtı] => 3
[küskün] => 2
[buraya] => 2
[@here] => 1
[#there] => 1
[Danny] => 1
[mom] => 1
[don't] => 1
[know] => 1
...
...
)
这很好用,但不能涵盖所有句子单词问题.例如,我用str_replace删除了逗号和点.
This works well but it doesn't cover all sentence-word problems. For example I removed comma and dots with str_replace.
例如,此解决方案不包含以下单词:Hello Mike,how are you ?
Mike以及如何不被视为不同的单词.
For example this solution doesn't cover the words like this: Hello Mike,how are you ?
Mike and how won't be treated as different words.
str_word_count解决方案:KISMEN @here #there
中未涉及.在和破折号处不会被考虑.
This doesn't covered in str_word_count solution: KISMEN @here #there
. At and dash sign and won't be taken into consideration.
这将不包括在J.J.Johanson
中.虽然是一个字,但将被视为JJJohanson
This will not be covered J.J.Johanson
. Although it is a word, it will be treated as JJJohanson
问题,应该从单词中删除感叹号.
Question, exclamation signs should be removed from words.
是否有更好的方法通过UTF8
支持获得str_word_count
行为?问题顶部的$text
对我来说是参考.
Is there a better way to get str_word_count
behaviour with UTF8
support ? The $text
which exists in the top of this question is reference for me.
(如果您可以用小提琴来回答问题会更好)
(It would be better if you can provide a fiddle with your answer)
正确答案
您永远不会拥有完美的字数统计解决方案,因为某些语言中的字数统计概念不存在或太难.是否使用UTF8无关紧要.
You will never have a prefect solution of word-count, because word-count concept is not exists or too difficult in some languages. UTF8 or not does not matter.
日语和汉语不是空间象征主义语言.他们甚至没有静态的单词列表,您必须先阅读整个句子,然后才能找到动词和名词.
Japanese and Chinese are not space tokenism language. They even don't have a static word list, you have to read the whole sentence before find verb and noun.
如果要支持多种语言,则需要特定于语言的标记器引擎.您可以研究全文索引,令牌生成器,CJK令牌生成器,CJK分析器以获取更多信息.
If you want to support multiple languages, you will need language specific tokenizer engine. You may research full-text index, tokenizer, CJK-tokenizer, CJK-analyzer for more information.
如果您只想支持有限的所选语言,只需使用越来越多的案例来改善您的正则表达式模式即可.
If you only want to support limited selected languages, just improve your regex patters with more and more cases.
这篇好文章是转载于:学新通技术网
- 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
- 本站站名: 学新通技术网
- 本文地址: /reply/detail/tanhcgaeic
-
YouTube API 不能在 iOS (iPhone/iPad) 工作,但在桌面浏览器工作正常?
it1352 07-30 -
保持在后台运行的 iPhone 应用程序完全可操作
it1352 07-25 -
iPhone,一张图像叠加到另一张图像上以创建要保存的新图像?(水印)
it1352 07-17 -
使用 iPhone 进行移动设备管理
it1352 07-23 -
在android同时打开手电筒和前置摄像头
it1352 09-28 -
扫描 NFC 标签时是否可以启动应用程序?
it1352 08-02 -
检查邮件是否发送成功
it1352 07-25 -
Android微调工具-删除当前选择
it1352 06-20 -
希伯来语的空格句子标记化错误
it1352 06-22 -
Android App 和三星 Galaxy S4 不兼容
it1352 07-20