ES analyzer 大类分析|Annabelle's Blog

Anatomy of an analyzer

An analyzer — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters.

The built-in analyzers pre-package these building blocks into analyzers suitable for different languages and types of text. Elasticsearch also exposes the individual building blocks so that they can be combined to define new custom analyzers.

Character filters

字符过滤器

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.

For instance, a character filter could be used to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like <b> from the stream.

An analyzer may have zero or more character filters, which are applied in order.

Tokenizer

分词器

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.

For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms [Quick, brown, fox!].

The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.

An analyzer must have exactly one tokenizer.

Token filters

分词过滤器

A token filter receives the token stream and may add, remove, or change tokens.

For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

Token filters are not allowed to change the position or character offsets of each token.

An analyzer may have zero or more token filters, which are applied in order.

Normalizers

规范器

Normalizers are similar to analyzers except that they may only emit a single token.

As a consequence, they do not have a tokenizer and only accept a subset of the available char filters and token filters.

Only the filters that work on a per-character basis are allowed.

For instance a lowercasing filter would be allowed, but not a stemming filter, which needs to look at the keyword as a whole.

The current list of filters that can be used in a normalizer is following:

arabic_normalization
asciifolding
bengali_normalization
cjk_width
decimal_digit
elision
german_normalization
hindi_normalization
indic_normalization
lowercase
persian_normalization
scandinavian_folding
serbian_normalization
sorani_normalization
uppercase

理解小结

Character filters 对数据源进行转换操作
Tokenizer 决定数据源的分词方式
Token filters 对分词后的块内容进行转换
Normalizers 只产出一个token, 针对 keyword字段（因为keyword字段不分词）

针对keyword字段忽略大小写


"analysis": {
    "normalizer": {
        "standard_lowercase": {
            "filter": [
                "lowercase"
            ],
            "type": "custom"
        }
    }
},

》mapping中定义

"application_name": {
    "type": "keyword",
    "normalizer": "standard_lowercase"
},