Okapi BM25

Okapi BM25（當中 BM 係 best matching 嘅簡稱）係一種用嚟做資訊提取嘅函數。呢套演算法會攞用家問嘅嘢（ $Q$ ）做 input，然後同每份文件（ $D$ ）計個分數（ ${\text{score}}$ ）反映件文件對用家條問題嚟講幾有啦更^[1]^[2]

算式

Okapi BM25 條式係噉嘅：

{\text{score}}(D,Q)=\sum _{i=1}^{n}{\text{IDF}}(q_{i})\cdot {\frac {f(q_{i},D)\cdot (k_{1}+1)}{f(q_{i},D)+k_{1}\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}

，當中^{[註 1]}

$q_{1},q_{2},...q_{i}$ 係 $Q$ 入面嘅每隻關鍵字；
$f(q_{i},D)$ 係 $q_{i}$ 喺 $D$ 入面出現得有幾密（相對於 $D$ 嘅長度）；
$|D|$ 係 $D$ 嘅長度（以字數計）；
${\text{avgdl}}$ 係摷咗嗰啲文件嘅平均長度；

而 $k_{1}$ 同 $b$ 係參數，好多時冇做最佳化嘅話就設做 $k_{1}\in [1.2,2.0]$ ， $b=0.75$ ^[3]。

${\text{IDF}}(q_{i})$ 呢個分計法如下－

{\text{IDF}}(q_{i})=\ln \left({\frac {N-n(q_{i})+0.5}{n(q_{i})+0.5}}+1\right)

當中 $N$ 係摷咗嘅文件嘅數量，
當中 $n(q_{i})$ 摷咗嘅文件當中有幾多份係有 $q_{i}$ 喺裏面嘅，
如果 $q_{i}$ 係一隻常用字（例如英文入面嘅 in 或者 of 呀噉），噉佢嘅 ${\text{IDF}}(q_{i})$ 分數理應會低（ $N-n(q_{i})$ 數值細）；所以 ${\text{IDF}}(q_{i})$ 呢嚿嘢嘅存在係為咗阻止啲常用字干擾搜尋結果。

計完之後，就會每份文件得出個分數 ${\text{score}}$ 表示份文件對條問題嚟講幾有啦更，分數愈高表示愈有啦更，然後個搜尋器就可以按分數將啲摷到嘅文件列出嚟，分數最高嘅行先。Okapi BM25 源於 1980 年代，到咗廿一世紀初經已廣泛噉俾搜尋器採用。

註釋

↑ $\sum$ 係加總。

睇埋

自然語言處理

攷

↑ Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management. 36 (6): 779-808.
↑ Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 2". Information Processing & Management. 36 (6): 809-840.
↑ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval, Cambridge University Press, 2009, p. 233.

[3] $\sum$ 係加總。

[1] Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management. 36 (6): 779-808.

[2] Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 2". Information Processing & Management. 36 (6): 809-840.

[4] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval, Cambridge University Press, 2009, p. 233.

[1]

[2]

[註 1]

[3]