token_chars:生成的分词结果中包含的字符类型,默认是全部类型; 官方参数设置说明:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html 4、结论 Ngram分词的本质:用空间换时间,其能匹配的前提是写入的时候已经按照:min_gram、max_gram切词; 数据量非常少且不要求子串高亮...
token_chars:表示应包含在token中的字符类,ES将分割不属于指定类的字符,默认是[](保留所有字符),指定的类可以是: letter:字母,比如a, b, ï or 京; digit:数字,比如3或7; whitespace:空格符,比如空格或者换行符\n; punctuation:标点,比如!或"; symbol:符号,比如$或√; 通常会将min_gram和max_gram设置...
"tokenizer":"ngram_tokenizer" } }, "tokenizer":{ "ngram_tokenizer":{ "token_chars":[ "letter", "digit", "punctuation" ], "type":"ngram", "max_gram":"1" } } } } 添加别名或删除别名:POST /_aliases 或 PUT /index/_alias/name ,如:(remove表示删除别名,add表示添加别名) POST /_...
大小间隔需要指定index.max_ngram_diff后方法指定 使用示例: { "settings": { "index.max_ngram_diff": 4, "analysis": { "tokenizer": { "ngram_tokenizer": { "type": "nGram", "min_gram": 1, "max_gram": 5, "token_chars": [ "letter", "digit" ] } }, "analyzer": { "ngram_...
N-gram 分词器是一种常用于文本处理和搜索的分词技术,它将文本分解为固定长度的子串(n-grams)。这种分词方法特别适用于模糊搜索、拼写纠错和自动补全等场景。 分词器设置: "tokenizer": { "ngram_tokenizer": { "token_chars": [ "letter", "digit" ], "min_gram": "1", "type": "ngram", "max_...
"token_chars":[ "letter", "digit" ] } } } }, "mappings":{ "medicalrecord":{ "properties":{ "fullFieldName":{ "type":"keyword", "fields":{ "ngramFullFieldName":{ "type":"text", "analyzer":"ngram_analyzer" }, "ikFullFieldName":{ ...
If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters. 1.3 Char Filter# Character filters are used to preprocess the stream of characters before it is passed to the tokenizer. ...
"tokenizer": "my_ngram", "filter": [ "pinyin_filter" ] } }, "tokenizer": { "my_ngram": { "type": "ngram", "min_gram": 1, "max_gram": 50, "token_chars": [ "letter", "digit", "punctuation", "symbol" ] } }, "filter": { "pinyin_filter": { "type": "pinyin", ...
PUT my_index{"settings": {"analysis": {"analyzer": {"my_analyzer": {"tokenizer": "my_tokenizer"}},"tokenizer": {"my_tokenizer": {"type": "ngram","min_gram": 3,"max_gram": 3,"token_chars": ["letter","digit"]}}}POST my_index/_analyze{"analyzer": "my_analyzer","text"...
"token_chars": ["letter", "digit", "punctuation", "symbol"], "min_gram": "1", "type": "nGram", "max_gram": "1" } } } } }, "mappings": { "doc": { "properties": { "id": { "type": "long" }, "pd_name": { ...