Quick Start Guide - jiebaR

Chinese Version

This is a package for Chinese text segmentation, keyword extraction and speech tagging.

Example

Text Segmentation

You can use worker() to initialize a worker, and then use [] or segment() to do the segmentation.

## Loading required package: jiebaRD
## [1] "This" "is"   "a"    "good" "day"

You can use file path as input.

## [1] "temp" "dat"

You can initialize multiple engines simultaneously.

The public settings of the model can be modified by $ cutter$symbol = T. Private settings are fixed when the engine is initialized, and you can get them by cutter$PrivateVarible.

## [1] "UTF-8"
## [1] TRUE
## [1] FALSE

You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.

## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/jiebaRD/dict"

Speech Tagging

Speech Tagging function [.tagger and tagging tag each word in a sentence after segmentation, using labels compatible with ictclas.

##     eng     eng 
## "hello" "world"

Keyword Extraction

Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.

## 11.7392 
##   "fun"

Simhash Distance

Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.

## $simhash
## [1] "3804341492420753273"
## 
## $keyword
## 11.7392 
## "hello"
## $distance
## [1] 0
## 
## $lhs
## 11.7392 
## "hello" 
## 
## $rhs
## 11.7392 
## "hello"

More Docs

See https://jiebaR.qinwf.com/

More Information and Issues

https://github.com/qinwf/jiebaR

https://github.com/yanyiwu/cppjieba