Token Filter๋?
tokenizer์์ term ๋ถ๋ฆฌ ๊ณผ์ ์ดํ์๋ ๋ถ๋ฆฌ๋ ๊ฐ๊ฐ์ term๋ค์ด ์ง์ ๋ ๊ท์น์ ๋ฐ๋ผ์ ์ฒ๋ฆฌ๋๋๋ฐ ์ด ์ญํ์ token filter๊ฐ ์งํํ๋ค. token filter๋ filterํญ๋ชฉ์ ๋ฐฐ์ด๋ก ์ง์ ํด์ผ ํ๊ณ , ์ง์ ํ ๋ฐฐ์ด ์์๋๋ก ํํฐ๊ฐ ๋์ํ๊ธฐ ๋๋ฌธ์ ์์๋ฅผ ์ ๊ณ ๋ คํด์ผ ํ๋ค.
์๋ฅผ ๋ค์ด
"I'm learning Elasticsearch"
Plain Text
๋ณต์ฌ
whitespace tokenizer๋ก ๋๋๋ฉด
["I'm", "learning", "Elasticsearch"]
JSON
๋ณต์ฌ
์ด๊ฑธ lowercase, stop ํ ํฐ ํํฐ๋ฅผ ์ ์ฉํ๋ฉด
["learning", "elasticsearch"]
JSON
๋ณต์ฌ
์ฃผ์ Token Filter ์ข
๋ฅ
1. lowercase
๋ชจ๋ term๋ค์ ์๋ฌธ์๋ก ๋ฐ๊พผ๋ค
POST _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase"],
"text": "Elasticsearch IS Awesome"
}
JSON
๋ณต์ฌ
["elasticsearch", "is", "awesome"]
JSON
๋ณต์ฌ
2. stop
๋ถ์ฉ์ด(stopwords)๋ฅผ ์ ๊ฑฐํ๋ค (ex. is, the, a, an โฆ)
POST _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": "this is a test"
}
JSON
๋ณต์ฌ
["test"]
JSON
๋ณต์ฌ
3. stemmer
์ด๊ทผ(stem)๋ง ๋จ๊ธด๋ค. (ex. running โ run)
POST _analyze
{
"tokenizer": "standard",
"filter": ["stemmer"],
"text": "running runs runner"
}
JSON
๋ณต์ฌ
["run", "run", "runner"]
JSON
๋ณต์ฌ
4. edge_ngram
term์ ์๋ถ๋ถ์์๋ถํฐ ์๋ฅธ๋ค. ๋ณดํต ์๋ ์์ฑ์์ ๋ง์ด ์ฌ์ฉ๋๋ค.
POST _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 4
}
],
"text": "search"
}
JSON
๋ณต์ฌ
["se", "sea", "sear"]
JSON
๋ณต์ฌ