Zaɓi Harshe

J-MAC: Tarin Kaset na Littafin Sauti na Masu Magana Da Yawa Na Jafananci don Haɗa Murya

Bincika hanyar gina tarin kaset na J-MAC, gudunmawar fasaha, sakamakon kimantawa, da alkiblar gaba don haɗa muryar littafin sauti mai bayyanawa.
audio-novel.com | PDF Size: 0.4 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - J-MAC: Tarin Kaset na Littafin Sauti na Masu Magana Da Yawa Na Jafananci don Haɗa Murya

1. Gabatarwa

Takardar ta gabatar da J-MAC (Tarin Kaset na Littafin Sauti na Masu Magana Da Yawa Na Jafananci), wani sabon tarin murya da aka tsara don ci gaba da bincike a cikin haɗa murya mai bayyanawa, mai fahimtar mahallin, musamman don aikace-aikacen littafin sauti. Marubutan suna jayayya cewa, yayin da salon karatu na TTS ya cimma inganci kusan na ɗan adam, gaba gaba ya haɗa da sarrafa mahallin rikitarwa, tsakanin jimloli, bayyanawar takamaiman mai magana, da kwararar labari—alamomin ƙwararrun labarun littafin sauti. Rashin ingantaccen tarin kaset na littafin sauti na masu magana da yawa, musamman ga harsuna kamar Jafananci, an gano shi a matsayin babban cikas. J-MAC yana nufin cike wannan gibi ta hanyar samar da albarkatu da aka gina daga littattafan sauti na ƙwararru, ta amfani da hanyar gina ta atomatik, marar son harshe.

2. Gina Tarin Kaset

Gina J-MAC ya ƙunshi hanyar matakai uku: tattara bayanai, tsaftacewa, da daidaitaccen daidaitawar rubutu da sauti.

2.1 Tattara Bayanai

An zaɓi littattafan sauti bisa manyan ma'auni guda biyu: 1) Samun daidaitaccen rubutun tunani (fifita littattafan da ba su da haƙƙin mallaka don guje wa kurakuran rubutun ASR akan sunayen ƙungiyoyi), da 2) Kasancewar nau'ikan labarai na ƙwararrun masu magana da yawa na littafi ɗaya don ɗaukar bayyanawar da ta dogara da mai magana. Wannan mayar da hankali kan rikodin layi daya (littafi ɗaya, masu magana daban-daban) zaɓi ne na dabara don ba da damar binciken sarrafawa akan salon mai magana.

2.2 Tsabtace Bayanai & Daidaitawa

Sautin littafin sauti na ɗanyen yana ƙarƙashin tsari na gyara matakai da yawa. Na farko, rabewar murya da kayan kida (misali, ta amfani da kayan aiki kamar Spleeter ko Open-Unmix) ya ware muryar mai magana daga duk wani kiɗan baya ko tasirin sauti. Na gaba, Rarraba Lokaci na Haɗin Kai (CTC), yawanci daga ƙirar ASR da aka riga aka horar, yana ba da daidaitaccen daidaitawa tsakanin sassan sauti da rubutun da ya dace. A ƙarshe, Gano Ayyukan Murya (VAD) ana amfani da shi don gyara iyakokin sassan magana, tabbatar da tsaftatattun furuci masu daidai da rubutu.

3. Hanyar Fasaha

Babban ƙirƙira yana cikin hanyar atomatik, wanda ke rage ƙoƙarin hannu.

3.1 Rabewar Murya da Kayan Kida

Wannan mataki yana da mahimmanci don samun bayanan magana "tsafta". Takardar tana nuna amfani da ƙirar rabewar tushe don ciro waƙar murya, cire abubuwan da ba na magana ba waɗanda zasu iya lalata horon ƙirar TTS.

3.2 Daidaitawa na Tushen CTC

Ana amfani da daidaitawar CTC saboda ikonta na sarrafa jerin tsayi daban-daban ba tare da rarrabuwa bayyananne ba. Aikin asarar CTC, $L_{CTC} = -\log P(\mathbf{y}|\mathbf{x})$, inda $\mathbf{x}$ shine shigarwar sauti kuma $\mathbf{y}$ shine jerin alamun manufa, yana ba da damar ƙirar ta koyi daidaitawa tsakanin firam ɗin sauti da haruffa/fonim.

3.3 Gyara VAD

Bayan daidaitawar CTC, ana amfani da algorithms na VAD (misali, bisa ma'auni na kuzari ko hanyoyin sadarwa na jijiyoyi) don gano daidaitattun farkon da ƙarshen magana a cikin sassan da aka daidaita kusan, cire shiru ko hayaniya na gaba/na baya.

4. Kimantawa & Sakamako

Marubutan sun gudanar da kimantawar haɗa muryar littafin sauti ta amfani da ƙirar da aka horar akan J-MAC. Muhimman binciken sun haɗa da:

  • Fadada Hanyar: Inganta hanyar haɗawa ta asali (misali, mafi kyawun ƙirar sauti) ya inganta yanayin maganganun roba a cikin duk masu magana a cikin tarin kaset.
  • Abubuwan Haɗaka: Yanayin maganganun littafin sauti da aka haɗa ya sami tasiri sosai ta hanyar mu'amala mai rikitarwa tsakanin hanyar haɗawa, halayen muryar mai magana da ake nufi, da takamaiman littafi/abun ciki da ake haɗawa. Rarraba waɗannan abubuwan har yanzu kalubale ne.

Fahimtar Kimantawa

Sakamako na Asali: Ingancin haɗawa ya dogara da mu'amalar Mai Magana x Hanyar x Abun ciki ba tare da sauƙi ba.

5. Muhimman Fahimta & Tattaunawa

  • J-MAC yana magance matsalar ƙarancin bayanai mai mahimmanci don binciken TTS mai bayyanawa a Jafananci.
  • Hanyar gina atomatik gudunmawa ce mai mahimmanci, tana rage farashi da lokacin ƙirƙirar irin waɗannan tarin kaset kuma yana iya amfani da shi ga wasu harsuna.
  • Kimantawa ya jaddada cewa haɗa littafin sauti ba kawai haɓaka TTS na jimla ɗaya ba ne; yana buƙatar ƙirar mahallin labari mafi girma da ainihin mai magana.
  • Binciken "haɗaka" yana nuna cewa ma'auni na kimantawa da ƙirar gaba suna buƙatar yin la'akari da abubuwan da suka shafi fuskoki da yawa.

6. Bincike na Asali: Ra'ayi na Masana'antu

Fahimta ta Asali: Takardar J-MAC ba kawai game da sabon tarin bayanai ba ce; wasa ne na dabara don canza tsarin TTS daga samar da furuci keɓaɓɓe zuwa ƙirar labari gaba ɗaya. Marubutan sun gano daidai cewa madaidaicin ƙimar ƙima na gaba a cikin haɗa murya yana cikin abun ciki mai tsayi, mai bayyanawa kamar littattafan sauti, faifan sauti, da labarun mu'amala—wuraren da TTS na yanzu har yanzu yana sauti na mutum-mutumi kuma marar son mahalli. Ta hanyar buɗe tarin kaset na masu magana da yawa, ba kawai suna ba da bayanai ba; suna kafa ma'auni da ajandar bincike.

Kwararar Ma'ana: Ma'anarsu ba ta da aibi: 1) Bayanai masu inganci shine man fetur don koyo mai zurfi. 2) Littattafan sauti na ƙwararru shine ma'auni na zinare don magana mai bayyanawa, mai daidaituwa cikin mahalli. 3) Ƙirƙirar tarin kaset ta hannu yana da tsada sosai. Don haka, hanyar atomatik (rabewa → daidaitawar CTC → VAD) ita ce kawai mafita mai iya aunawa. Wannan yayi daidai da motsin AI mai mayar da hankali kan bayanai wanda Andrew Ng ya jagoranta, inda ingancin hanyar bayanai yake da mahimmanci kamar tsarin ƙira.

Ƙarfi & Aibobi: Babban ƙarfi shine aikin hanyar da ƙirar marar son harshe. Amfani da abubuwan da aka riga aka yi kamar ƙirar rabewar tushe (misali, bisa tsarin gine-gine kamar U-Net da aka yi amfani da shi a cikin Demucs) da ASR na tushen CTC yana sa ya zama mai maimaitawa. Duk da haka, aibin takardar shine taɓaɓɓun sa akan matsalar "mahalli" da ta haskaka. Tana ba da bayanai (J-MAC) amma tana ba da ƙayyadaddun sabbin hanyoyin ƙira don amfani da mahallin tsakanin jimloli ko rarraba salon mai magana daga abun ciki. Sakamakon kimantawa, duk da yake mai haske, bayyanawa ne maimakon tsari. Ta yaya muke ƙirar abubuwan "haɗaka" a zahiri? Dabarun daga canja salon da koyo na wakilcin rarrabuwa, kamar waɗanda ke cikin CycleGAN ko masu sarrafa kai na bambancin, an nuna su amma ba a bincika su sosai ba.

Fahimta Mai Aiki: Ga masu aiki a masana'antu, abin da za a ɗauka biyu ne. Na farko, saka hannun jari a gina ko samun irin wannan tarin murya mai tsayi, salon salo da yawa—zai zama babban bambanci. Na biyu, fifikon bincike ya kamata ya kasance akan tsarin gine-gine masu fahimtar mahalli. Wannan na iya nufin ƙirar tushen canji tare da taga mahalli mai tsayi sosai, ko ƙirar matsayi waɗanda ke ɓoye sautin gida, salon mai magana, da baka na labari na duniya daban. Aikin ƙungiyoyi kamar Google Brain akan SoundStream ko Microsoft akan VALL-E yana nuna hanyoyin da suka dogara da codec na jijiyoyi waɗanda za a iya faɗaɗa su tare da alamun mahalli da J-MAC ke bayarwa. Gaba ba kawai haɗa jimla ba ne; haɗa wasan kwaikwayo ne.

7. Cikakkun Bayanan Fasaha & Tsarin Lissafi

Tsarin daidaitawa ya dogara sosai akan manufar CTC. Don jerin shigarwa $\mathbf{x}$ (fasalin sauti) na tsayi $T$ da jerin alamun manufa $\mathbf{l}$ (haruffan rubutu) na tsayi $U$, inda $T > U$, CTC yana gabatar da alamar fanko $\epsilon$ kuma ya ayyana maƙasudi da yawa zuwa ɗaya $\mathcal{B}$ daga hanya $\pi$ (na tsayi $T$) zuwa $\mathbf{l}$. Yuwuwar hanya ita ce: $P(\pi|\mathbf{x}) = \prod_{t=1}^{T} y_{\pi_t}^t$, inda $y_{\pi_t}^t$ shine yuwuwar alama $\pi_t$ a lokacin $t$. Yuwuwar sharuɗɗan jerin alamun ita ce jimlar duk hanyoyin da $\mathcal{B}$ ya zana zuwa gare ta: $P(\mathbf{l}|\mathbf{x}) = \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{l})} P(\pi|\mathbf{x})$. Wannan tsari yana ba da damar ƙirar ta koyi daidaitawa ba tare da bayanan da aka riga aka raba ba. A cikin hanyar J-MAC, ƙirar CTC da aka riga aka horar (misali, bisa tsarin gine-gine kamar DeepSpeech2) tana samar da waɗannan daidaitawa don rarraba sauti.

8. Sakamakon Gwaji & Bayanin Ginshiƙi

Yayin da guntun PDF da aka bayar bai ƙunshi ginshiƙai bayyananne ba, sakamakon da aka bayyana yana nuna ƙirar kimantawa mai abubuwa da yawa. Hoto na sakamako na hasashe wanda zai kwatanta babban binciken su zai zama hoton saman 3D ko jerin ginshiƙan ginshiƙai.

Bayanin Ginshiƙi: Gatari na y yana wakiltar Matsakaicin Matsayin Ra'ayi (MOS) don yanayi (misali, ma'auni 1-5). Gatari na x yana jera hanyoyin haɗawa daban-daban (misali, Tacotron2, FastSpeech2, ƙirar da aka gabatar). Rukunin/z-gatari zai wakilci masu magana daban-daban daga J-MAC (Mai Magana A, B, C) da/ko littattafai daban-daban (Littafi X, Littafi Y). Babban binciken gani zai kasance cewa tsayin ginshiƙan (MOS) ba ya bin tsari mai daidaitawa a cikin rukuni. Misali, Hanya 1 na iya zama mafi kyau ga Mai Magana A akan Littafi X, amma mafi muni ga Mai Magana B akan Littafi Y, yana nuna "haɗakar abubuwa mai ƙarfi" a fili. Sandunan kurakurai za su iya nuna juyewa mai mahimmanci, yana nuna kalubalen zana sauƙaƙan ƙarshe.

9. Tsarin Bincike: Misalin Lamari

Nazarin Lamari: Kimanta Sabuwar Ƙirar TTS don Littattafan Sauti

Manufa: Ƙayyade ko "Model-Z" ya inganta akan ma'auni don haɗa littafin sauti ta amfani da J-MAC.

Tsari:

  1. Rarraba Bayanai: Rarraba J-MAC ta littafi da mai magana. Tabbatar cewa saitin gwaji ya ƙunshi jimlolin da ba a gani ba daga littattafan da aka gani a cikin horo (cikin yanki) da gabaɗayan littattafan da ba a gani ba (waje-yanki).
  2. Horo na Ƙira: Horar da duka Ma'auni (misali, FastSpeech2) da Model-Z akan rabe-rabe iri ɗaya na horo. Yi amfani da nau'ikan rubutu-sauti na J-MAC.
  3. Kimantawa Mai Sarrafawa: Samar da magana don jerin rubutu iri ɗaya a cikin duk yanayin gwaji (haɗakar Mai Magana x Littafi).
  4. Ma'auni:
    • Na Farko: MOS don Yanayi da Bayyanawa.
    • Na Biyu: Ƙimar Kurakuran Kalma (WER) na ASR akan maganganun roba (fahimta), Matsayin Kamanceceniya na Mai Magana (misali, ta amfani da ƙirar tabbatar da mai magana kamar ECAPA-TDNN).
    • Ma'auni na Mahalli: Gwajin A/B inda masu kimantawa suka saurari jimloli biyu na haɗawa a jere suka ƙididdige daidaituwa.
  5. Bincike: Yi ANOVA ko irin wannan binciken ƙididdiga don ware tasirin Model, Mai Magana, Littafi, da mu'amalarsu akan maki MOS. Hasashen maras tushe zai kasance "Model-Z ba shi da tasiri ba tare da Mai Magana da Littafi ba."
Wannan tsari yana magance matsalar haɗaka da aka haskaka a cikin takardar kai tsaye.

10. Aikace-aikacen Gaba & Hanyoyin Bincike

  • Littattafan Sauti Na Musamman: Haɗa littattafai cikin muryar mai ba da labari da mai amfani ya fi so ko ma kwafin muryarsa.
  • Labari Mai Ƙarfi don Wasanni/XR: Samar da tattaunawa mai fahimtar mahalli, mai bayyanawa da labari a cikin lokaci na ainihi don kafofin watsa labarai masu mu'amala.
  • Samun dama: Rage lokaci da farashi sosai don samar da littattafan sauti ga makafi ko ga littattafai a cikin harsunan da ba su da albarkatu.
  • Hanyoyin Bincike:
    1. Koyo na Wakilcin Rarrabuwa: Haɓaka ƙirar da ke rarraba abun ciki, salon mai magana, motsin rai, da sautin labari a fili zuwa masu canji na ɓoye.
    2. Ƙirar Mahalli Mai Tsayi: Amfani da bambance-bambancen canji masu inganci (misali, Longformer, Performer) don ƙaddara haɗawa akan sakin layi ko surori gaba ɗaya.
    3. Canja Sauti & Sarrafawa: Ba da damar sarrafa ƙayyadaddun sarrafa sauri, girmamawa, da sautin murya a cikin wurare masu tsayi, watakila ta amfani da guntun sautin tunani azaman faɗakarwar salon.
    4. Faɗaɗa Tsakanin Harsuna: Aiwatar da hanyar gina J-MAC don gina irin wannan tarin kaset ga wasu harsuna, haɓaka nazarin kwatance.

11. Nassoshi

  1. J. Shen, da sauransu, "Haɗa TTS na Halitta ta Hanyar Sharadi WaveNet akan Hasashen Mel Spectrogram," ICASSP 2018.
  2. A. Vaswani, da sauransu, "Hankali Duk Abinda Kake Bukata," NeurIPS 2017.
  3. Y. Ren, da sauransu, "FastSpeech: Sauri, Ƙarfi da Sarrafa Rubutu zuwa Magana," NeurIPS 2019.
  4. J.-Y. Zhu, da sauransu, "Fassarar Hotuna-zuwa-Hoto mara Biyu ta amfani da Cibiyoyin Adawa masu Daidaituwa na Zagaye," ICCV 2017 (CycleGAN).
  5. A. Défossez, da sauransu, "Demucs: Mai Cirewa Mai zurfi don Tushen Kiɗa tare da ƙarin bayanan da ba a lakafta ba an sake haɗa su," arXiv:1909.01174.
  6. A. van den Oord, da sauransu, "WaveNet: Ƙirar Samarwa don Sauti ɗanyen," arXiv:1609.03499.
  7. J. Kong, da sauransu, "HiFi-GAN: Cibiyoyin Adawa na Samarwa don Haɗa Murya Mai Inganci da Ingantaccen Inganci," NeurIPS 2020.
  8. N. Zeghidour, da sauransu, "SoundStream: Codec na Sauti na Jijiyoyi na Ƙarshe-zuwa-Ƙarshe," arXiv:2107.03312.
  9. A. Graves, da sauransu, "Rarraba Lokaci na Haɗin Kai: Lakafta Jerin Bayanan da ba a raba su ba tare da Hanyoyin Sadarwar Jijiyoyi masu Maimaitawa," ICML 2006.
  10. Andrew Ng, "AI Mai Mayar da Hankali kan Bayanai," DeepLearning.AI.