Search

[ES] Token filter ์ •๋ฆฌ

Token Filter๋ž€?

tokenizer์—์„œ term ๋ถ„๋ฆฌ ๊ณผ์ • ์ดํ›„์—๋Š” ๋ถ„๋ฆฌ๋œ ๊ฐ๊ฐ์˜ term๋“ค์ด ์ง€์ •๋œ ๊ทœ์น™์— ๋”ฐ๋ผ์„œ ์ฒ˜๋ฆฌ๋˜๋Š”๋ฐ ์ด ์—ญํ™œ์„ token filter๊ฐ€ ์ง„ํ–‰ํ•œ๋‹ค. token filter๋Š” filterํ•ญ๋ชฉ์— ๋ฐฐ์—ด๋กœ ์ง€์ •ํ•ด์•ผ ํ•˜๊ณ , ์ง€์ •ํ•œ ๋ฐฐ์—ด ์ˆœ์„œ๋Œ€๋กœ ํ•„ํ„ฐ๊ฐ€ ๋™์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ๋ฅผ ์ž˜ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด
"I'm learning Elasticsearch"
Plain Text
๋ณต์‚ฌ
whitespace tokenizer๋กœ ๋‚˜๋ˆ„๋ฉด
["I'm", "learning", "Elasticsearch"]
JSON
๋ณต์‚ฌ
์ด๊ฑธ lowercase, stop ํ† ํฐ ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•˜๋ฉด
["learning", "elasticsearch"]
JSON
๋ณต์‚ฌ

์ฃผ์š” Token Filter ์ข…๋ฅ˜

1. lowercase

๋ชจ๋“  term๋“ค์„ ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊พผ๋‹ค
POST _analyze { "tokenizer": "whitespace", "filter": ["lowercase"], "text": "Elasticsearch IS Awesome" }
JSON
๋ณต์‚ฌ
["elasticsearch", "is", "awesome"]
JSON
๋ณต์‚ฌ

2. stop

๋ถˆ์šฉ์–ด(stopwords)๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค (ex. is, the, a, an โ€ฆ)
POST _analyze { "tokenizer": "whitespace", "filter": ["stop"], "text": "this is a test" }
JSON
๋ณต์‚ฌ
["test"]
JSON
๋ณต์‚ฌ

3. stemmer

์–ด๊ทผ(stem)๋งŒ ๋‚จ๊ธด๋‹ค. (ex. running โ†’ run)
POST _analyze { "tokenizer": "standard", "filter": ["stemmer"], "text": "running runs runner" }
JSON
๋ณต์‚ฌ
["run", "run", "runner"]
JSON
๋ณต์‚ฌ

4. edge_ngram

term์„ ์•ž๋ถ€๋ถ„์—์„œ๋ถ€ํ„ฐ ์ž๋ฅธ๋‹ค. ๋ณดํ†ต ์ž๋™ ์™„์„ฑ์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค.
POST _analyze { "tokenizer": "standard", "filter": [ { "type": "edge_ngram", "min_gram": 2, "max_gram": 4 } ], "text": "search" }
JSON
๋ณต์‚ฌ
["se", "sea", "sear"]
JSON
๋ณต์‚ฌ