Zaɓi Harshe

Audiobook-CC: Tsarin Samar da Littattafan Sauti na Tsawon Lokaci da Ake Iya Sarrafawa tare da Muryoyi Daban-daban

Bincike kan Audiobook-CC, sabon tsarin haɗakar sauti don samar da littattafan sauti masu daidaitawa, masu bayyana motsin rai tare da muryoyi daban-daban, sarrafa cikakkun bayanai da tsarin dogon lokaci.
audio-novel.com | PDF Size: 1.3 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - Audiobook-CC: Tsarin Samar da Littattafan Sauti na Tsawon Lokaci da Ake Iya Sarrafawa tare da Muryoyi Daban-daban

1. Gabatarwa & Bayyani

Tsarin Rubutu-zuwa-Sauti (TTS) da ake da su galibi an inganta su don haɗakar jimla ɗaya, ba su da tsarin da ake buƙata don ƙirƙirar dogon dangantaka da samar da cikakken iko akan abubuwan aiki kamar motsin rai da daidaiton hali. Wannan yana haifar da babban gibi a cikin samar da ingantattun littattafan sauti ta atomatik, waɗanda ke buƙatar daidaiton labari da muryoyin halaye daban-daban, masu jawo motsin rai a cikin dogayen surori.

Takardar "Audiobook-CC: Samar da Sauti na Dogon Lokaci da Ake Iya Sarrafawa don Littattafan Sauti na Muryoyi Daban-daban" ta magance wannan gibi. Ta ba da shawarar sabon tsari wanda aka gina akan sabbin abubuwa uku na asali: tsarin tsarin lokaci don daidaitawa tsakanin jimloli, tsarin horarwa don raba sarrafa salo daga umarnin sauti, da dabarar tacewa don haɓaka bayyanar motsin rai da ikon bin umarni.

2. Hanyoyi & Tsarin Gine-gine

An ƙera tsarin Audiobook-CC musamman don yanayin littattafan sauti masu tsayi da kuma halaye daban-daban. Hanyarsa ta ƙunshi raba dogon rubutu zuwa surori, yin bincike kan rubutu da halayen mutum, cire labarai da tattaunawa, sanya muryoyi ta hanyar rarraba, da kuma haɗa sauti ta amfani da tsarin ƙirar ƙirar da aka gabatar.

2.1 Tsarin Tsarin Lokaci

Don shawo kan "makantar lokaci" na tsarin TTS na baya a cikin samar da dogon lokaci, Audiobook-CC ya haɗa da tsarin tsarin lokaci na zahiri. Wannan ɓangaren an ƙera shi don ɗauka da amfani da bayanan ma'ana daga jimlolin da suka gabata, yana tabbatar da cewa sautin murya, saurin magana, da sautin motsin rai na magana na yanzu sun yi daidai da ci gaban labarin. Wannan yana magance babban aibi a cikin tsarin kamar AudioStory ko MultiActor-Audiobook, waɗanda ke sarrafa jimloli a keɓance.

2.2 Tsarin Horarwa na Rarraba

Kalubale mai mahimmanci a cikin TTS da ake iya sarrafawa shi ne haɗakar abun ciki na ma'ana na rubutu da bayanan salo/motsin rai da ke cikin umarnin sauti. Audiobook-CC yana amfani da sabon tsarin horarwa na rarraba. Wannan dabarar tana raba salon sautin da aka samar daga halayen sauti na kowane umarnin sauti da aka bayar. Sakamakon haka shi ne sautin murya da motsin rai na sakamakon suna bin umarnin ma'ana da alamun lokaci da aminci, maimakon kasancewa cikin tasirin halayen sauti na umarni. Wannan tsarin ya samo kwarin gwiwa daga dabarun koyon wakilci da ake gani a fagage kamar haɗakar hoto (misali, ƙa'idodin rarraba da aka bincika a cikin CycleGAN), wanda aka yi amfani da shi a nan a fagen sauti.

2.3 Tacewa don Bayyanar Motsin Rai

Don haɓaka ikon ƙirar don bayyanar motsin rai mai zurfi da kuma amsawa ga umarnin harshe na halitta (misali, "karanta wannan cikin baƙin ciki"), marubutan sun ba da shawarar hanyar tacewa. Wannan dabarar tana iya haɗawa da horar da ƙirar akan ingantattun sakamakonta ko ƙirƙirar siginar horo mai inganci wanda ke jaddada bambancin motsin rai da bin umarni, ta haka ne ake "tace" ƙarfin sarrafawa cikin ƙirar ƙarshe.

3. Cikakkun Bayanai na Fasaha & Tsarin Lissafi

Duk da yake PDF ɗin bai ba da cikakkun ƙididdiga ba, ana iya tsara ainihin gudunmawar fasaha ta zahiri. Tsarin lokaci mai yiwuwa ya ƙunshi mai ɓoyayyen transformer wanda ke sarrafa taga alamun rubutu na baya $\mathbf{C} = \{x_{t-k}, ..., x_{t-1}\}$ tare da alamar yanzu $x_t$ don samar da wakilcin da ya san lokaci $\mathbf{h}_t^c = f_{context}(\mathbf{C}, x_t)$.

Ana iya fassara asarar rarraba a matsayin rage ƙaramin bayanin haɗin kai tsakanin lambar salo $\mathbf{s}$ da aka ciro daga umarni da wakilcin ma'ana $\mathbf{z}$ na rubutun da aka yi niyya, yana ƙarfafa 'yancin kai: $\mathcal{L}_{disentangle} = \min I(\mathbf{s}; \mathbf{z})$.

Tsarin tacewa na iya amfani da tsarin malami-dalibi, inda ƙirar malami (ko alamar bincike ta farko) ke samar da samfuran bayyanawa, kuma ana horar da ƙirar ɗalibi don dacewa da wannan sakamakon yayin da kuma yana bin manufofin horo na asali, wanda aka tsara shi kamar haka: $\mathcal{L}_{distill} = \text{KL}(P_{student}(y|x) || P_{teacher}(y|x))$.

4. Sakamakon Gwaji & Kimantawa

Takardar ta ruwaito cewa Audiobook-CC ya sami mafi girman aiki idan aka kwatanta da ma'auni na yau da kullun a cikin mahimman ma'auni don samar da littattafan sauti. Kimantawa ta ƙunshi:

An gudanar da binciken cirewa don tabbatar da gudunmawar kowane ɓangaren da aka gabatar (tsarin lokaci, rarraba, tacewa). Sakamakon mai yiwuwa ya nuna cewa cire kowane ɗayan waɗannan ginshiƙai uku yana haifar da raguwar aiki da za a iya aunawa, yana tabbatar da wajibcinsu. Ana samun samfuran demo akan gidan yanar gizon aikin.

5. Tsarin Bincike: Fahimta ta Asali & Zargi

Fahimta ta Asali: Ƙungiyar Ximalaya ba kawai suna gina wani ƙirar TTS ba; suna samar da injiniyan hankali na labari. Ainihin sabon abu na Audiobook-CC shine ɗaukar surori na littafin sauti ba a matsayin jerin jimloli masu zaman kansu ba amma a matsayin rukuni na wasan kwaikwayo mai haɗaka, inda lokaci ke ƙayyade motsin rai kuma ainihin halin mutum ya zama mai dorewa, mai iya sarrafawa. Wannan yana canza tsari daga haɗakar sauti zuwa haɗakar labari.

Ci gaban Ma'ana: Takardar ta gano daidai matsalar masana'antu: farashi da girma. Samar da littattafan sauti na hannu yana hana shi ga abubuwan da suka fi yawa a dandamali kamar Ximalaya. Maganinsu yana haɗa kayan aikin fasaha guda uku a ma'ana: lokaci (don daidaito), rarraba (don tsaftataccen sarrafawa), da tacewa (don inganci). Ci gaba daga matsala zuwa amsawar gine-gine yana da ma'ana kuma yana da ma'ana ta kasuwanci.

Ƙarfi & Kurakurai: Ƙarfin ba shakka ne—magance dogon lokaci da sarrafa halaye daban-daban a cikin tsari ɗaya kalubale ne mai ƙarfi na injiniya. Hanyar rarraba da aka gabatar tana da kyau musamman, mai yiwuwa ta magance matsalar "zubar da murya" inda lafazin umarni ya gurɓata halin da ake nufi. Duk da haka, aibin takardar shine rashin bayyanawa game da bayanan. Rayuwa da mutuwar TTS na ingancin littafin sauti ta dogara da bayanan horonsa. Ba tare da cikakkun bayanai game da girman, bambancin, da lakabin (motsin rai, hali) na bayanan su na keɓaɓɓen ba, ba zai yiwu a tantance yadda wannan nasarar za ta iya maimaitawa ko yaduwa ba. Shin wannan babban ci gaban algorithm ne ko nasara ce ta babban adadi, an tsara bayanai a hankali? Binciken cirewa yana tabbatar da gine-gine, amma injin bayanan ya kasance akwatin baƙi.

Fahimta Mai Aiki: Ga masu fafatawa da masu bincike, abin da za a ɗauka a bayyane yake: fagen fama na gaba a cikin TTS shine dogon lokaci na sarrafawa. Zuba jari a cikin bincike wanda ya wuce ma'auni na matakin jimla kamar MOS (Matsakaicin Ra'ayi) zuwa ma'auni na matakin surori don ci gaban labari da daidaiton hali yana da mahimmanci. Ga dandamalin abun ciki, ma'anar ita ce saurin demokraɗiyyacin samar da ingantaccen abun ciki na sauti, wanda zai rage matsalar nau'ikan nau'ikan da marubuta masu zaman kansu.

6. Hasashen Aikace-aikace & Hanyoyin Gaba

Tasirin Audiobook-CC ya wuce littattafan sauti na al'ada.

Hanyoyin Bincike na Gaba:

  1. Daidaiton Murya Tsakanin Harsuna da Al'adu: Kiyaye ainihin muryar halin mutum lokacin da aka haɗa labari ɗaya a cikin harsuna daban-daban.
  2. Samar da Labari na Ainihi, Mai Mu'amala: Daidaita sautin labari da motsin rai na halaye a ainihin lokaci bisa ga amsawar mai sauraro ko zaɓi.
  3. Haɗawa tare da LLMs masu Yanayi Daban-daban: Haɗa tsarin haɗakar tare da manyan ƙirar harshe waɗanda zasu iya samar da rubutun labari, bayanin halaye, da umarnin motsin rai a cikin tsarin ƙirƙirar labari daga farko har ƙarshe.
  4. Kwafin Murya na Da'a da Sifa: Haɓaka ingantattun matakan kariya da hanyoyin sifa yayin da fasahar ta sa haɗakar murya mai inganci ta zama mai sauƙi.

7. Nassoshi

  1. MultiActor-Audiobook (Ana tsammanin aikin da aka ambata, daidaitaccen tsarin ambaton daga PDF).
  2. AudioStory: [Nassoshi daga PDF].
  3. Dopamine Audiobook: [Nassoshi daga PDF].
  4. MM-StoryAgent: [Nassoshi daga PDF].
  5. Shaja et al. (Sautin Sarari don TTS): [Nassoshi daga PDF].
  6. CosyVoice & CosyVoice 2: [Nassoshi daga PDF].
  7. MoonCast: [Nassoshi daga PDF].
  8. MOSS-TTSD: [Nassoshi daga PDF].
  9. CoVoMix: [Nassoshi daga PDF].
  10. koel-TTS: [Nassoshi daga PDF].
  11. Zhu, J., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. A cikin ICCV. (Nassoshi na waje don ra'ayoyin rarraba).
  12. OpenAI. (2023). Rahoton Fasaha na GPT-4. (Nassoshi na waje don iyawar LLM a cikin samar da labari).
  13. Google AI. (2023). AudioLM: Hanyar Ƙirar Harshe don Samar da Sauti. (Nassoshi na waje don tsarin samar da sauti).