語義距離

語義距離（粵拼：jyu5 ji6 keoi5 lei4）係自然語言處理（NLP）上成日用嘅一個概念，指兩隻字（或者第啲語言單位）之間喺意思上爭幾遠。舉個簡化嘅例子，狼同狗呢兩隻字嘅語義距離好近－兩隻字都係分別表示緊某種動物，而且兩隻字所表示嘅動物好相似；相比之下，狼同雲呢兩隻字嘅語義距離就遠，因為兩隻字表示嘅事物冇乜直接嘅啦掕^[1]。

喺實際 NLP 上，「語義距離要點量度」係一條相當受爭義嘅問題：喺最簡單嗰種情況下，語義距離可以用「兩隻字喺語義網絡入面最短要通過幾多條邊，先可以將佢哋連埋一齊」嚟量度－理論上，要過嘅邊數量愈少，兩隻字嘅語義距離就算愈短；實際計數嗰陣，研究者可以攞 WordNet 等現成嘅語義網絡－呢啲網絡會紀錄嗮隻語言入面啲字之間嘅關係，然後攞要計嗰兩隻字 1 同 2，睇吓字 1 同字 2 之間「隔咗幾多隻字」，就得出一個語義距離估計值^[2]。專業 NLP 上仲有更加進階複雜嘅方法計語義距離。

响廿一世紀初嘅 NLP 同相關應用上，語義距離相當有用，可以攞嚟做詞義消歧（教電腦撞到歧義情況嗰陣，同每隻有歧義嘅字搵出佢嘅意思）嘅工作^[3]，而且仲有認知心理學工作者提倡可以運用語義距離嘅概念嚟衡量文字作品「幾有創意」^[4]^[5]。

理論基礎

語義距離建基於一個簡單嘅事實：語言同文字係攞嚟表達意思嘅，每隻字都梗會有某啲意思；呢個諗頭又帶起咗語義網絡嘅概念－一個基本嘅語義網絡會有^[6]

若干個節點^{[e 1]}，每個節點代表一個概念，
節點之間會有一啲連繫^{[e 2]}，兩個概念之間嘅連繫表達佢哋之間嘅關係。

舉例說明，想像文頭嗰個語義網絡，圖入面有哺乳類、貓同埋毛皮等嘅節點，每個節點都代表咗個概念，幅圖仲有一條條嘅連繫（啲箭咀）喺節點之間，表示啲概念之間嘅關係－由貓指去哺乳類條箭咀表示「貓係一種哺乳類」，而由貓指去毛皮條箭咀表示「貓有毛皮」... 呀噉。

語義網絡等嘅概念，就自然帶起咗語義距離嘅諗法：直覺上認為，字詞同字詞之間可以意思上相近，例如蜜蜂同蝴蝶都係指緊某啲昆蟲，所以喺意思上相對近，而蜜蜂同雕像（後者指緊一種死物）就俾人感覺係意思上相距遠；有人仲主張「字詞同字詞之間喺意思上爭幾遠」可以用客觀嘅方法量度，簡單嘅有「數吓嗰兩隻字詞喺個語義網絡^{[註 1]}入面相距幾多條連繫」－呢啲噉嘅量度就可以得出語義距離嘅估計值^[2]^[7]。

要留意嘅係，啲人傾語義距離相關嘅嘢嗰陣，好多時會用語義相似度^{[e 3]}或者語義相關度^{[e 4]}呢類詞彙^{[註 2]}。語義相似度係語義距離嘅相反，不過兩隻詞彙都係描述緊同一個諗頭：如果兩隻字之間喺語義上相似，佢哋之間就算係語義距離短、語義相似度高；而如果兩隻字之間喺語義上唔相似，噉佢哋之間就算係語義距離長、語義相似度低^[8]。

基本計法

計語義距離嘅過程，涉及咗噉嘅運算：

Input：要比較嘅字符或者句子；
Output：一個個數值表示語義距離嘅估計值；

呢個過程會用到以下呢啲數學概念。要留意嘅係，同一嚿 input，唔同計法出嘅數值可以好唔同，分析者會睇情況決定用邊種計法。

雅卡德指數

雅卡德指數^{[e 5]}係統計學成日用嘅一種數值，可以攞嚟評估兩個集有幾接近。想像家陣有兩個集 $A$ 同 $B$ ，佢哋之間有個交集（ $A\cap B$ ；指兩個集都有嘅嘢^{[註 3]}）；噉兩個集之間嘅雅卡德指數 $J(A,B)$ 可以用以下條式計：

J(A,B)={{|A\cap B|} \over {|A\cup B|}}={{|A\cap B|} \over {|A|+|B|-|A\cap B|}}

$J(A,B)$ 數值愈大，就愈表示兩個集相似^[9]。上述噉嘅思考方式可以用落句子度－想像而家 $A$ 同 $B$ 係以下呢兩句句子^[10]：

「個樽係空嘅。」
「個樽入面乜都冇。」

用最簡單嗰種方法諗嘅話：呢兩句嘢入面都有個同樽呢兩隻字，呢兩隻字就係 $A\cap B$ （ $A\cap B=2$ ），淨低嗰八隻字唔喺 $A\cap B$ 入面（ $A\cup B=2+8$ ），所以噉計嘅話，兩句句子之間嘅雅卡德語義距離就係

J(A,B)={2 \over 10}=0.2

雅卡德指數有個好處係唔受重複咗嘅字影響，例如上面句子 1 就算加多個「空」字落去，雅卡德語義距離數值都唔會變。而喺實際嘅 NLP 應用上，雅卡德指數最大嘅弱點係淨係可以攞嚟量度句子或者段落之間嘅相似度，冇得攞嚟比較字同字之間嘅相似度。

歐幾里得距離

想像而家

n=2

，

{\text{A}}

同

{\text{B}}

係嗰兩隻字嘅字嵌入，可以畫做空間入面嘅兩點，跟住分析者就可以計兩點之間嘅距離

[{\text{AB}}]

。

首先，思考字嵌入^{[e 6]}嘅概念：家陣攞若干隻字俾部電腦分析，段嘢入面每隻字都可以有件字嵌入；每件字嵌入係個以實數表示嘅 $n$ 維向量，啲實數會表示隻字嘅意思，所以啲數值相近嘅字嵌入會係表示緊意思上相近嘅字^[11]。

舉個實質啲嘅例子，想像每隻字嘅字嵌入有 30 個實數（ $n=30$ ^{[註 4]}），當中第一個實數表示嗰隻字同「貓科」有幾強嘅語義關聯，第二個實數表示嗰隻字同「人類」有幾強嘅語義關聯，第三個實數表示嗰隻字同「昆蟲」有幾強嘅語義關聯... 等等，數值愈正就表示語義關聯愈勁，即係

Cat（貓）呢隻字嘅字嵌入係 [0.9, 0.1, -0.8...]；
Tiger（老虎）呢隻字嘅字嵌入係 [0.7, -0.6, -0.75...]；
Nebula（星雲）呢隻字嘅字嵌入係 [-0.9, -0.95, -0.95...]；

姑且唔好諗「點樣知一隻字嘅字嵌入數值係乜」嘅問題住^{[註 5]}。假想而家每隻字都有咗件字嵌入，一件字嵌入係個向量，所以可以當做空間入面嘅一點噉嚟睇－好似附圖噉；當咗兩隻字係空間入面嘅兩點，分析者就有得計兩點之間嘅歐幾里得距離^{[e 7]}，得出個數值嚟反映「嗰兩隻字喺意思上爭幾遠」^{[註 6]}，例如 Python 就有陳述式可以一嘢俾出兩點之間嘅歐幾里得距離^[10]：

# ...
# 假設 embeddings[0] = cat 嘅 embedding，embeddings[1] = tiger 嘅 embedding
  距離 = euclidean_distance(embeddings[0], embeddings[1]) # 距離嘅數值設做嗰兩個 embedding 之間嘅歐幾里得距離。

餘弦相似度

餘弦相似度^{[e 8]}可以攞嚟量度兩列數之間有幾相似。計餘弦相似度嗰陣，條式會攞兩個向量做 input，而 output 就會俾個數 $\theta$ 出嚟， $\theta$ 嘅值反映兩個向量之間「個角度有幾大」， $\theta$ 數值愈細就表示兩個向量愈相似^[12]。

設 $\mathbf {A}$ 同 $\mathbf {B}$ 做要分析嗰兩個向量，而 $A_{i}$ 同 $B_{i}$ 係 $\mathbf {A}$ 同 $\mathbf {B}$ 嘅組成部份，噉餘弦相似度可以用噉嘅式計：

{\text{cosine soeng ci dou}}=S_{C}(A,B):=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}}

呢條式會出嘅 output 可能數值最低係 -1 最高係 1，數值愈高就表示兩個向量愈相似。如果上面條式嗰兩個向量係兩隻字嘅字嵌入，餘弦相似度就可以計到兩隻字喺語義上爭幾遠^[10]。

睇埋

文獻

Dumais, S (2003). "Data-driven approaches to information access". Cognitive Science. 27 (3): 491-524. doi:10.1207/s15516709cog2703_7.
Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity (PDF). In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1254-1259). Austin, Tx: The Cognitive Science Society, Inc.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211.
Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J., & Webb, M. E. (2021). Naming unrelated words predicts creativity. Proceedings of the National Academy of Sciences, 118(25)，有認知心理學家指出，語義距離嘅概念可以攞嚟量度創意－創意定義上就係指個人能夠諗啲同主流見解唔同嘅諗頭，所以愈有創意嘅人理應會愈能夠（例如）寫出啲同主流意見語義距離遠嘅字句，又或者因為能夠諗出好多唔同嘅諗頭，而能夠做到自己寫嘅唔同文字之間語義距離遠^[13]。

註釋

↑ 假設個網絡做到準確噉描述現實。
↑ 不過有好多人嫌語義相關度呢隻詞唔嚴謹－兩隻字之間喺意思上有緊密關係，唔等如佢哋之間意思相似。可以睇吓反義詞以及部位詞同群體詞等嘅概念。
↑ 有關呢啲數學符號嘅意思，可以睇吓集合論相關嘅嘢。
↑ 喺實際應用上， $n$ 嘅數值閒閒哋會係幾百。
↑ 事實係，字嵌入嘅計法喺 NLP 上都係一條大課題。
↑ 技術性啲講，實際應用上通常會同呢個距離值做標準化。

來源

篇文用咗嘅行話，英文名如下：

↑ node
↑ edge
↑ similarity
↑ relatedness
↑ Jaccard index
↑ word embedding
↑ Euclidean distance
↑ cosine similarity

篇文引用咗以下呢啲文獻同網頁：

↑ Agirre, E., & Rigau, G. (1997). A proposal for word sense disambiguation using conceptual distance. AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4, 161-172.
↑ ^2.0 ^2.1 Budanitsky, Alexander and Hirst, Graeme. "Evaluating WordNet-based measures of lexical semantic relatedness." Computational Linguistics, 32(1), March 2006, 13-47.
↑ Diamantini, C.; Mircoli, A.; Potena, D.; Storti, E. (2015-06-01). "Semantic disambiguation in a social information discovery system". 2015 International Conference on Collaboration Technologies and Systems (CTS): 326-333.
↑ ^4.0 ^4.1 ^4.2 Kenett, Y. N. (2019). What can quantitative measures of semantic distance tell us about creativity?. Current Opinion in Behavioral Sciences, 27, 11-16.
↑ Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J., & Webb, M. E. (2021). Naming unrelated words predicts creativity. Proceedings of the National Academy of Sciences, 118(25).
↑ John F. Sowa (1987). "Semantic Networks". In Stuart C Shapiro (ed.). Encyclopedia of Artificial Intelligence. Retrieved 29 April 2008.
↑ Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). "Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1-254.
↑ Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). "The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1-30.
↑ Murphy, Allan H. (1996). "The Finley Affair: A Signal Event in the History of Forecast Verification". Weather and Forecasting. 11 (1): 3.
↑ ^10.0 ^10.1 ^10.2 Ultimate Guide To Text Similarity With Python. NewsCatcher.
↑ Jurafsky, Daniel; H. James, Martin (2000). Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Prentice Hall.
↑ COSINE DISTANCE, COSINE SIMILARITY, ANGULAR COSINE DISTANCE, ANGULAR COSINE SIMILARITY.
↑ Kenett, Y. N. (2019). What can quantitative measures of semantic distance tell us about creativity?. Current Opinion in Behavioral Sciences, 27, 11-16.

拎

（英文） Text Similarity Measures Live Demo.
（英文） How to Rank Text Content by Semantic Similarity. Towards Data Science.
（英文） Ultimate Guide To Text Similarity With Python. NewsCatcher.

[9] 假設個網絡做到準確噉描述現實。

[13] 不過有好多人嫌語義相關度呢隻詞唔嚴謹－兩隻字之間喺意思上有緊密關係，唔等如佢哋之間意思相似。可以睇吓反義詞以及部位詞同群體詞等嘅概念。

[16] 有關呢啲數學符號嘅意思，可以睇吓集合論相關嘅嘢。

[21] 喺實際應用上， $n$ 嘅數值閒閒哋會係幾百。

[22] 事實係，字嵌入嘅計法喺 NLP 上都係一條大課題。

[24] 技術性啲講，實際應用上通常會同呢個距離值做標準化。

[7] ↑ node

[8] ↑ edge

[11] similarity

[12] relatedness

[15] Jaccard index

[19] word embedding

[23] Euclidean distance

[25] sine similarity

[1] Agirre, E., & Rigau, G. (1997). A proposal for word sense disambiguation using conceptual distance. AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4, 161-172.

[budaitsky2006-2] 2.0 ^2.1 Budanitsky, Alexander and Hirst, Graeme. "Evaluating WordNet-based measures of lexical semantic relatedness." Computational Linguistics, 32(1), March 2006, 13-47.

[3] Diamantini, C.; Mircoli, A.; Potena, D.; Storti, E. (2015-06-01). "Semantic disambiguation in a social information discovery system". 2015 International Conference on Collaboration Technologies and Systems (CTS): 326-333.

[Kennett2019-4] 4.0 ^4.1 ^4.2 Kenett, Y. N. (2019). What can quantitative measures of semantic distance tell us about creativity?. Current Opinion in Behavioral Sciences, 27, 11-16.

[5] Olson, J. A., Nahas, J., Chmoulevitch, D., Cropper, S. J., & Webb, M. E. (2021). Naming unrelated words predicts creativity. Proceedings of the National Academy of Sciences, 118(25).

[sowa1987-6] John F. Sowa (1987). "Semantic Networks". In Stuart C Shapiro (ed.). Encyclopedia of Artificial Intelligence. Retrieved 29 April 2008.

[10] Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). "Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1-254.

[14] Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). "The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1-30.

[17] Murphy, Allan H. (1996). "The Finley Affair: A Signal Event in the History of Forecast Verification". Weather and Forecasting. 11 (1): 3.

[ultimateguidetotextsimilaritytopython-18] 10.0 ^10.1 ^10.2 Ultimate Guide To Text Similarity With Python. NewsCatcher.

[20] Jurafsky, Daniel; H. James, Martin (2000). Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Prentice Hall.

[26] COSINE DISTANCE, COSINE SIMILARITY, ANGULAR COSINE DISTANCE, ANGULAR COSINE SIMILARITY.

[27] Kenett, Y. N. (2019). What can quantitative measures of semantic distance tell us about creativity?. Current Opinion in Behavioral Sciences, 27, 11-16.

[1]

[2]

[3]

[4]

[5]

[6]

[e 1]

[e 2]

[註 1]

[7]

[e 3]

[e 4]

[註 2]

[8]

[e 5]

[註 3]

[9]

[10]

[e 6]

[11]

[註 4]

[註 5]

[e 7]

[註 6]

[e 8]

[12]

[13]