1. Gabatarwa
Dabarun saka kalmomi kamar Word2Vec sun kawo sauyi mai girma a sarrafa harshe ta hanyar ɗaukar alaƙar ma'ana tsakanin kalmomin rubutu bisa ga mahallinsu. Hakazalika, an ƙera Audio Word2Vec don cire tsarin sauti daga sassan kalmomin magana. Duk da haka, Audio Word2Vec na gargajiya yana mai da hankali ne kawai akan bayanan sauti da aka koya daga cikin kowane kalmar magana, yana watsi da mahallin ma'ana da ke tasowa daga jerin kalmomi a cikin furuci.
Wannan takarda tana ba da shawarar sabon tsari mai matakai biyu wanda ke haɗa wannan gibi. Manufar ita ce ƙirƙirar wakilcin vector don kalmomin magana waɗanda ke ɗauke da tsarin sautinsu da ma'anarsu. Wannan aiki yana da wahala domin, kamar yadda aka lura a cikin takardar, kamancen sauti da alaƙar ma'ana sau da yawa suna da bambanci. Misali, "ɗan'uwa" da "'yar'uwa" suna kusa da ma'ana amma sun bambanta ta fuskar sauti, yayin da "ɗan'uwa" da "damuwa" suna kama da sauti amma ba su da alaƙa ta ma'ana. Hanyar da aka ba da shawara tana nufin rarraba da kuma haɗa waɗannan bangarorin biyu tare, yana ba da damar ƙarin ƙarfin aikace-aikace kamar cire takardun magana na ma'ana, inda za a iya samun takardu masu alaƙa da ma'anar tambaya, ba wai kawai waɗanda ke ɗauke da ainihin kalmar tambaya ba.
2. Hanyoyin Bincike
Babban ƙirƙira shine tsarin saka jeri mai matakai biyu wanda aka ƙera don fara ware bayanan sauti sannan a sanya fahimtar ma'ana a samansa.
2.1 Mataki na 1: Saka Sauti tare da Rarraba Mai Magana
Mataki na farko yana sarrafa ɓangarorin kalmomin magana na danye. Babban manufarsa ita ce koyon saka sauti mai ƙarfi—vector wanda ke wakiltar jerin sautunan da ke cikin kalmar—yayin da ake cire ko rarraba abubuwan da ke damun kamar ainihin mai magana da yanayin rikodi. Wannan yana da mahimmanci domin halayen mai magana na iya mamaye siginar kuma su ɓoye ainihin abun ciki na sauti. Ana iya amfani da dabarun da aka yi wahayi ta hanyar daidaitawa ko horar da adawa (mai kama da hanyoyin rarraba a cikin CycleGAN) a nan don ƙirƙirar sararin sauti marar bambancin mai magana.
2.2 Mataki na 2: Saka Ma'ana
Mataki na biyu yana ɗaukar saka sautunan da aka raba mai magana daga Mataki na 1 a matsayin shigarwa. Ana sarrafa waɗannan saka-saka la'akari da mahallin kalmomin magana a cikin furuci. Ta hanyar nazarin jerin waɗannan vectors na sauti (misali, ta amfani da cibiyar sadarwar juyi mai maimaitawa ko tsarin transformer), ƙirar tana koyon fahimtar alaƙar ma'ana, kamar Word2Vec na tushen rubutu. Fitowar wannan mataki shine ƙarshen saka "sauti-da-ma'ana" na kowane kalmar magana.
2.3 Tsarin Kimantawa
Don kimanta yanayin saka-saka biyu, marubutan sun ba da shawarar dabarun kimantawa masu kama da juna. Ana kimanta ingancin sauti ta ayyuka kamar gano kalmar magana ko tattara kamancen sauti. Ana kimanta ingancin ma'ana ta hanyar daidaita saka-saka na odiyo tare da saka kalmomin rubutu da aka riga aka horar (misali, saka GloVe ko BERT) da auna alaƙa a cikin sararin vectors ɗinsu ko aiki akan ayyukan ma'ana.
3. Cikakkun Bayanai na Fasaha
3.1 Tsarin Lissafi
Manufar koyo tana haɗa ayyukan asara da yawa. Don Mataki na 1, asarar sake ginawa ko kwatantawa yana tabbatar da an adana abun ciki na sauti, yayin da asarar adawa ko alaƙa ke rage bayanan mai magana. Don Mataki na 2, ana amfani da asarar tsinkaya ta tushen mahalli, kamar manufar skip-gram ko CBOW daga Word2Vec. Ana iya fassara manufar haɗin gwiwa don cikakkiyar ƙirar kamar haka:
$L_{total} = \lambda_1 L_{phonetic} + \lambda_2 L_{speaker\_inv} + \lambda_3 L_{semantic}$
inda $L_{phonetic}$ ke tabbatar da amincin sauti, $L_{speaker\_inv}$ yana ƙarfafa rarrabuwa, kuma $L_{semantic}$ yana ɗaukar alaƙar kalmomin mahalli.
3.2 Tsarin Ƙirar Ƙirar Ƙira
Ana tsammanin tsarin ƙira shine bututun cibiyar sadarwar jijiya mai zurfi. Mataki na 1 na iya amfani da cibiyar sadarwar convolutional (CNN) ko mai ɓoyewa don sarrafa spectrograms, sannan a biye da Layer na bututun wanda ke samar da vector na sauti da aka raba mai magana. Mataki na 2 mai yiwuwa yana amfani da ƙirar jeri (RNN/LSTM/Transformer) wanda ke ɗaukar jerin vectors na Mataki-1 kuma yana fitar da saka-saka masu sanin mahalli. An horar da ƙirar har zuwa ƙarshe akan tarin furucin magana.
4. Sakamakon Gwaji
4.1 Bayanan Gwaji da Saiti
An gudanar da gwaje-gwaje akan tarin takardun magana, mai yiwuwa an samo su daga tushe kamar LibriSpeech ko labaran watsa shirye-shirye. Saitin ya haɗa da horar da ƙirar matakai biyu da kwatanta shi da ma'auni kamar Audio Word2Vec na yau da kullun (sauti kawai) da saka-saka na tushen rubutu.
4.2 Ma'aunin Aiki
Mahimman ma'auni sun haɗa da:
- Daidaituwar Cire Sauti / Tunawa: Don nemo daidaitattun daidaitattun kalmar magana.
- Matsakaicin Matsakaicin Daidaituwar Cire Ma'ana (MAP): Don dawo da takardu masu alaƙa da ma'ana ga tambaya.
- Alaƙar Saka-saka: Kamancen Cosine tsakanin saka-saka na odiyo da saka-saka na kalmar rubutu masu dacewa.
4.3 Binciken Sakamako
Takardar ta ba da rahoton sakamako masu ban sha'awa na farko. Saka-saka na matakai biyu da aka ba da shawara sun fi Audio Word2Vec na sauti kawai a cikin ayyukan cire ma'ana, suna samun nasarar dawo da takardun da ke da alaƙa da jigo amma ba su ɗauke da kalmar tambaya ba. A lokaci guda, sun ci gaba da yin aiki mai ƙarfi akan ayyukan cire sauti, suna nuna riƙon bayanan sauti. Kimantawa mai kama da juna ta nuna babbar alaƙa tsakanin saka-saka na odiyo da aka ba da shawara da saka rubutu idan aka kwatanta da hanyoyin tushe.
Mahimman Fahimta
- Hanyar matakai biyu tana raba koyon bayanan sauti da ma'ana yadda ya kamata.
- Rarraba mai magana a Mataki na 1 yana da mahimmanci don gina wakilcin sauti mai tsabta.
- Tsarin yana ba da damar bincike na ma'ana a cikin ma'ajiyar odiyo, babban tsalle sama da gano kalma.
5. Misalin Tsarin Bincike
Harka: Kimanta Tsarin Cire Laccocin Magana
Yanayi: Mai amfani yana tambayar bayanan laccocin magana tare da jimlar "haɗin cibiyar sadarwar jijiya."
Bincike tare da Saka-saka da aka Ba da Shawara:
- Daidaitawar Sauti: Tsarin yana dawo da laccocin inda aka faɗi ainihin jimlar "haɗin cibiyar sadarwar jijiya" (babban kamancen sauti).
- Daidaitawar Ma'ana: Tsarin kuma yana dawo da laccocin da ke tattaunawa akan "gangaren gangare," "baya-baya," ko "mai daidaita Adam," domin saka-saka na waɗannan sharuɗɗan suna kusa a cikin yankin ma'ana na tambaya.
Kimantawa: Ana ƙididdige daidaiton don daidaitattun sauti. Don daidaitattun ma'ana, masu lissafin ɗan adam suna yin hukunci game da dacewa, kuma ana ƙididdige Matsakaicin Matsakaicin Daidaituwa (MAP). Ƙarfin tsarin na daidaita nau'ikan sakamako biyu yana nuna ƙimar haɗin saka-saka.
6. Hangar Aikace-aikace & Hanyoyin Gaba
Aikace-aikace:
- Mataimakan Murya Masu Hikima: Fahimtar manufar mai amfani fiye da daidaitawar umarni na zahiri.
- Binciken Ma'ajiyar Multimedia: Binciken ma'ana a cikin faifan bidiyo, tarurruka, da rikodin sauti na tarihi.
- Kayan Aikin Samun dama: Ingantaccen kewayon abun ciki ga marasa gani a cikin kafofin watsa labarai na sauti.
- Cire Magana Tsakanin Harsuna: Yuwuwar samun abun ciki a cikin harshe ɗaya bisa tambaya a wani, ta amfani da ma'ana a matsayin gada.
Hanyoyin Bincike na Gaba:
- Bincika ƙarin dabarun rarrabuwa masu ci gaba (misali, bisa Beta-VAE ko FactorVAE) don ingantattun siffofi na sauti.
- Haɗawa tare da manyan ƙirar magana da aka riga aka horar (misali, Wav2Vec 2.0, HuBERT) a matsayin gaba mai ƙarfi.
- Tsawaita tsarin don ƙirar dogon zance da ma'anar matakin takarda.
- Binciken koyan ɗan gajeren lokaci ko koyan sifili don kalmomi da ba a saba gani ba.
7. Nassoshi
- Mikolov, T., et al. (2013). Ingantaccen Ƙididdiga na Wakilcin Kalma a cikin Sararin Vector. arXiv:1301.3781.
- Chung, Y.-A., & Glass, J. (2018). Speech2Vec: Tsarin Jere-zuwa-Jere don Koyon Saka Kalmomi daga Magana. Interspeech.
- Zhu, J.-Y., et al. (2017). Fassarar Hoto-zuwa-Hoto mara Haɗin gwiwa ta amfani da Cibiyoyin Adawa masu Daidaituwar Zagaye. ICCV (CycleGAN).
- Baevski, A., et al. (2020). wav2vec 2.0: Tsarin don Koyon Wakilcin Magana na Kai-da-kai. NeurIPS.
- Lee, H.-y., & Lee, L.-s. (2018). Audio Word2Vec: Koyon Wakilcin Yankin Audio mara Kulawa ta amfani da Mai gina kai-da-kai. IEEE/ACM TASLP.
- Chen, Y.-C., et al. (2019). Saka Sauti-da-Ma'ana na Kalmomin Magana tare da Aikace-aikace a Cire Abubuwan Magana. arXiv:1807.08089v4.
8. Binciken Kwararru
Babban Fahimta: Wannan takarda ba wani ƙarin ci gaba ne kawai akan Audio Word2Vec ba; yana da dabarun juya zuwa rufe gibin wakilci tsakanin magana da rubutu. Marubutan sun gano daidai tashin hankali tsakanin siginar sauti da ma'ana a cikin odiyo a matsayin babban ƙalubale, ba kawai abin damuwa ba. Hanyarsu ta matakai biyu wata hanya ce mai aiki, mai tunanin injiniya ga matsalar da mutane da yawa a fagen suka yi watsi da su ta hanyar ɗaukar magana a matsayin "rubutu mai hayaniya" kawai. Ainihin fahimtar ita ce ɗaukar halayen mai magana da sauran bambance-bambancen sauti a matsayin hayaniyar adawa da za a cire kafin koyon ma'ana ya fara, wani mataki da ya yi amfani da hikimar daga nasarar binciken rarrabuwa a cikin hangen nesa na kwamfuta (misali, ka'idojin da ke bayan canjin salon CycleGAN).
Kwararar Ma'ana: Ma'anar hanyar bincike tana da inganci kuma tana da kariya. Mayar da hankali na Mataki na 1 akan sautunan da ba su bambanta da mai magana ba ba za a iya sasantawa ba—ƙoƙarin koyon ma'ana daga siffofi na danye, masu dogaro da mai magana, aikin wawa ne, kamar yadda binciken gane mai magana na shekaru da yawa ya tabbatar. Mataki na 2 sannan ya sake amfani da tsarin Word2Vec da aka kafa, amma a maimakon yin aiki akan alamun rubutu daban-daban, yana aiki akan saka-saka na sauti mai ci gaba. Wannan kwararar tana kama da tsarin fahimtar ɗan adam na fassara magana (sauti → sautunan sauti → ma'ana) fiye da ƙirar ƙira har zuwa ƙarshe waɗanda ke ƙetare tsarin tsaka-tsaki.
Ƙarfi & Kurakurai: Babban ƙarfinsa shine aikace-aikacen aikace-aikace. Tsarin kai tsaye yana ba da damar binciken ma'ana a cikin ma'ajiyar odiyo, fasalin da ke da ƙimar kasuwanci da bincike nan take. Tsarin kimantawa mai kama da juna shima ƙarfi ne, yana ba da ma'auni mai haske, mai bangarori da yawa. Duk da haka, aibi yana cikin yuwuwar rauninsa. Nasarar Mataki na 2 gaba ɗaya ta dogara ne da kamala rarrabuwar Mataki na 1. Duk wani ragowar bayanin mai magana ko tashar ya zama hayaniyar ma'ana mai damuwa. Bugu da ƙari, ƙirar tana fama da kalmomi masu kama da sauti ("rubutu" da "dama"), inda ainihin sauti ya yi daidai amma ma'ana ta bambanta—matsalar da saka rubutu ba su da. Gwaje-gwajen farko na takardar, duk da cewa suna da ban sha'awa, suna buƙatar ƙididdigewa zuwa bayanai na duniya masu hayaniya, masu yawan masu magana, don tabbatar da ƙarfi.
Fahimta Mai Aiki: Ga masu aiki, wannan aikin tsari ne. Aikin nan take shine aiwatar da gwada wannan bututun matakai biyu akan bayanan odiyo na keɓaɓɓu. Dole ne kimantawa ya wuce ma'auni na ilimi don haɗa da nazarin masu amfani akan gamsuwar bincike. Ga masu bincike, hanyar gaba a bayyane take: 1) Haɗa ƙirar magana masu zaman kansu na zamani (kamar Wav2Vec 2.0 daga Binciken AI na Facebook) a matsayin gaba mai ƙarfi don Mataki na 1. 2) Bincika tsarin ƙirar transformer a Mataki na 2 don ɗaukar mahalli mai tsayi fiye da RNNs. 3) Bincika horar da harsuna da yawa don ganin ko rabon sauti-da-ma'ana ya ƙirƙiri sararin ma'ana marar harshe. Wannan takarda ta aza dutsen tushe; mataki na gaba shine gina babban coci na ainihin fahimtar odiyo a kansa.