語料庫語言學

語料庫語言學jyu5 liu2 fu3 jyu5 jin4 hok6（英文：Corpus linguistics）泛指靠語料庫嚟做嘅語言學研究，係語言學嘅重要一環^[1]：語言學定義上就係研究語言，而要研究一樣嘢，就實要攞大量屬嗰個類嘅事物嚟做樣本－語料庫正正就能夠提供大量嘅語言材料，而語言學家攞住一隻語言嘅語料，就可以分析（例如）嗰隻語言嘅文法^[2]。

語料庫

語料庫（text corpus）係其中一種最常用最重要嘅語言數據來源。一個語料庫會係一嚿大型有結構嘅語言資源，一隻語言嘅語料庫會包含大量屬嗰隻語言嘅錄音同文字。舉個例說明，國際英文語料庫（International Corpus of English，ICE）就係一個好出名嘅英文語料庫，ICE 搵勻嗮世界各地超過 20 個以英文做官方語言一部份嘅國家或者地區（包括香港），每個國家地區對應嘅英文語料都係儲咗當地啲人講英文嘅錄音，仲有係當地啲人用英文寫嘅隨筆、書信、學術文同新聞報道等嘅多種文字材料；到咗 2018 年，ICE 對包括嘅每個國家地區都最少有 1,500,000 字咁長嘅材料（大型）^[3]^{[註 1]}。

呢幅圖描繪英國國家語料庫（British National Corpus，BNC）嘅結構－啲語料 90% 係文字，10% 係講嘢嘅語音，啲文字語料入面包括報紙、研究期刊同埋虛構故仔書呀噉。

語料庫嘅設計好有技巧，要用好多方法特登整到方便人用，例如：

數據做好嗮啲事前處理
大細：語料庫基本上就係為語言學同相關嘅研究提供語言樣本，而樣本通常係愈大愈好嘅；一個語料庫嘅大細通常會用入面啲語料嘅總字數嚟計，字數愈高個語料庫就算係愈大，（假設第啲因素不變）個語料庫愈大就愈有用^[4]；例如到咗 2020 年，牛津英文語料庫（Oxford English Corpus；一個描述廿一世紀初嘅英文嘅語料庫^[5]）總字數有成 21 億字咁多^[6]。
近期：語言文字係會隨時間而演變嘅，例如潮流用語等嘅現象就反映咗語言嘅演變；因為噉，如果一個語料庫啲數據唔夠近期，就會搞到研究者攞唔到最新嘅資訊，例如想像有位 AI 研究者想寫一隻曉同人傾偈嘅傾偈機械人，但佢用嘅語料庫數據唔夠新，冇近 10 年嘅數據，就會搞到隻機械人唔識用近 10 年嘅潮流用語－做唔到自然嘅傾談^[7]。喺實際應用上，語料庫仲會同啲語料標明埋每件語料係源自咩年份嘅，方便用家判斷嗰件語料係咪啱用。
廣泛：一般認為，語料庫最好能夠包含唔同類嘅語料，例如牛津英文語料庫同國際英文語料庫都既會包含用英文寫嘅文字，又會包括啲人講嘢嘅錄音；除此之外，唔同情境下寫字同講嘢都會唔同，例如新聞報導、散文故仔同學術文之間就有相當嘅差異，而啲人「喺正式場合講嘢」同「喺一般日常屋企人之間講嘢」又會好唔同，國際英文語料庫等嘅語料庫都有諗到呢一點，所以喺搵語料用嗰陣會包含唔同嘅料，會有新聞報導、虛構故仔同學術文等嘅多種唔同材料；而且仲會有埋搜尋器噉嘅功能，方便做 NLP 嘅人可以輕易噉摷自己想要嘅料嚟用^[8]。
中繼資料（metadata）：中繼資料係指「描繪數據嘅數據」，係語料庫不可或缺嘅一環；例如家陣有一份語料，份語料係篇隨筆，篇隨筆入面嘅字就係數據本身，而中繼資料就係指描述呢件數據嘅數據－包括係描述篇文嘅作者、字數、出嘅年份同埋出處... 呀噉；中繼資料喺好多分析上都好有用，例如描述啲數據「係咩年份出」嘅中繼資料就可以幫手分析隻語言隨時間嘅變化^[9]。

... 呀噉。

睇埋

自然語言處理

註釋

↑ 不過都有人指出，以廿一世紀初嘅語料庫嚟講，1,500,000 字算係有啲細。

攷

↑ Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter. (1992).
↑ Gudkov, V., Mitrofanova, O., & Filippskikh, E. (2020). Automatically ranked Russian paraphrase corpus for text generation. arXiv preprint arXiv:2006.09719.
↑ Kirk, J., & Nelson, G. (2018). The International Corpus of English project: A progress report (PDF). World Englishes, 37(4), 697-716.
↑ Hanke, T., Schulder, M., Konrad, R., & Jahn, E. (2020, May). Extending the Public DGS Corpus in size and depth. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives (pp. 75-82).
↑ Culpeper, J. (2009). The metalanguage of impoliteness: using Sketch Engine to explore the Oxford English Corpus. na.
↑ "The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. Retrieved 27 October 2016.
↑ Renouf, A., & Kehoe, A. (2013). Filling the gaps: Using the WebCorp Linguist's Search Engine to supplement existing text resources. International Journal of Corpus Linguistics, 18(2), 167-198.
↑ International Corpus of English (ICE) Homepage.
↑ What is metadata and why do you need it?. Developing Linguistic Corpora: a Guide to Good Practice.

[4] 不過都有人指出，以廿一世紀初嘅語料庫嚟講，1,500,000 字算係有啲細。

[1] Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter. (1992).

[2] Gudkov, V., Mitrofanova, O., & Filippskikh, E. (2020). Automatically ranked Russian paraphrase corpus for text generation. arXiv preprint arXiv:2006.09719.

[kirknelson2018-3] Kirk, J., & Nelson, G. (2018). The International Corpus of English project: A progress report (PDF). World Englishes, 37(4), 697-716.

[5] Hanke, T., Schulder, M., Konrad, R., & Jahn, E. (2020, May). Extending the Public DGS Corpus in size and depth. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives (pp. 75-82).

[6] Culpeper, J. (2009). The metalanguage of impoliteness: using Sketch Engine to explore the Oxford English Corpus. na.

[7] "The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. Retrieved 27 October 2016.

[8] Renouf, A., & Kehoe, A. (2013). Filling the gaps: Using the WebCorp Linguist's Search Engine to supplement existing text resources. International Journal of Corpus Linguistics, 18(2), 167-198.

[9] International Corpus of English (ICE) Homepage.

[10] What is metadata and why do you need it?. Developing Linguistic Corpora: a Guide to Good Practice.

[1]

[2]

[3]

[註 1]

[4]

[5]

[6]

[7]

[8]

[9]