Zaɓi Harshe

MultiActor-Audiobook: Samar da Littafin Sauti Ba tare da Horarwa ba tare da Fuska da Muryoyi

Binciken fasaha na MultiActor-Audiobook, sabon tsarin samar da littattafan sauti masu bayyanawa ta amfani da siffofin mai magana da yawa da umarnin rubutu na LLM.
audio-novel.com | PDF Size: 1.3 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - MultiActor-Audiobook: Samar da Littafin Sauti Ba tare da Horarwa ba tare da Fuska da Muryoyi

1. Gabatarwa & Bayyani

MultiActor-Audiobook yana gabatar da tsarin samar da littattafan sauti masu bayyanawa ba tare da horarwa ba, wanda ke nuna masu magana daban-daban. Yana magance manyan iyakoki na tsarin da suka gabata: tsadar tarin bayanan mawaƙa, takamaiman yanki na samfuran da aka horar, da kuma aikin ƙwaƙwalwar aikin bayyana sautin murya. Babban ƙirƙira yana cikin hanyoyinsa guda biyu na atomatik, ba tare da horarwa ba: Samar da Siffar Mai Magana ta Hanyoyi Daban-daban (MSP) da Samar da Umarnin Rubutu na LLM (LSI). Ta hanyar haɗa muryoyin da suka danganci hali daga siffofin gani da aka samar da kuma fahimtar alamun motsin rai/sautin murya daga mahallin rubutu, tsarin yana nufin samar da littattafan sauti tare da labari mai daidaito, dacewa, da bayyanawa ba tare da kowane bayanin horo na musamman ba.

2. Hanyoyin Tsarin Asali

Ingancin tsarin ya dogara ne akan sabbin hanyoyi guda biyu masu alaƙa waɗanda ke sarrafa mafi ƙalubalen sassan samar da littafin sauri: ƙirƙirar muryar hali da karatu mai bayyanawa.

2.1 Samar da Siffar Mai Magana ta Hanyoyi Daban-daban (MSP)

Wannan tsari yana ƙirƙirar murya ta musamman, mai daidaito ga kowane hali a cikin labari daga bayanin rubutu kawai.

  1. Gano Ƙungiya & Ciro Siffar Rubutu: LLM (misali, GPT-4) yana bincika rubutun labari don gano duk ƙungiyoyin magana (halaye, mai ba da labari). Ga kowannensu, yana ciro siffofi masu bayyanawa (halin mutum, shekaru, matsayi, halayen jiki) daga rubutun labari.
  2. Samar da Siffar Gani: Samfurin rubutu-zuwa-hoto (misali, Stable Diffusion) yana amfani da bayanin rubutu da aka ciro don samar da hoton fuska wanda ke nuna halin a zahiri.
  3. Haɗa Fuska-zuwa-Murya: Tsarin Fuska-zuwa-Murya da aka riga aka horar (yana nuni da aiki kamar [14]) yana ɗaukar hoton fuska da aka samar da taken sa don haɗa ɗan gajeren samfurin murya. Wannan samfurin ya ƙunshi siffofi na musamman na sautin muryar halin (sautin murya, matakin sautin murya, salon magana). Wannan muryar ta zama maƙasudin duk maganganun da wannan hali zai yi.
Wannan tsarin ya cika ba tare da horarwa ba ga sabbin halaye, ba ya buƙatar rikodin da ya gabata.

2.2 Samar da Umarnin Rubutu na LLM (LSI)

Don guje wa karatu mai ban sha'awa, wannan tsari yana samar da umarni masu ƙarfi na sautin murya a matakin jumla.

  1. Nazari Mai Fahimtar Mahalli: Ga kowace jumla da za a haɗa, ana ba da LLM: jumlar da aka yi niyya, mahallin da ke kewaye (jumlolin da suka gabata/na gaba), da bayanin siffar mai magana na yanzu.
  2. Samar da Umarni: LLM yana fitar da tsarin umarni da ke ƙayyade yanayin motsin rai (misali, "mai farin ciki," "mai baƙin ciki"), sautin murya (misali, "mai ban dariya," "mai iko"), bambancin sautin murya, da ƙimar magana da suka dace da mahalli da hali.
  3. Ƙarfafawa don TTS: Ana tsara waɗannan umarnin zuwa umarni na harshe na halitta (misali, "Faɗi wannan cikin sautin [motsin rai] tare da bambancin [sautin murya]") wanda ke jagorantar samfurin Rubutu-zuwa-Magana (TTS) da aka riga aka horar, mai iya ɗaukar umarni don samar da sautin ƙarshe.
Wannan yana maye gurbin bayyanawa na hannu tare da fahimtar atomatik, mai dacewa da mahalli.

3. Tsarin Fasaha & Cikakkun Bayanai

3.1 Tsarin Aiki na Tsarin

Ana iya ganin aikin ƙarshe-zuwa-ƙarshe a matsayin tsari mai bi da bi: Rubutun Labari na Shigarwa → LLM (Gano Mai Magana & Ciro Siffa) → Text2Image (Samar da Fuska) → Face2Voice (Samfurin Murya) → [Ga Kowane Hali]
Ga kowace jumla: [Jumla + Mahalli + Siffa] → LLM (LSI) → Prompt-TTS (tare da Muryar Hali) → Sashin Sautin Fitowa
Littafin sauti na ƙarshe shine haɗin fitowar duk jumlolin da aka sarrafa na ɗan lokaci.

3.2 Tsarin Lissafi

Ana iya tsara ainihin tsarin samarwa na jumla $s_i$ da hali $c$ ya faɗa. Bari $C$ ya zama taga mahalli a kusa da $s_i$, kuma $P_c$ ya zama siffar hali $c$ ta hanyoyi daban-daban (mai ɗauke da bayanin rubutu $D_c$, fuskar da aka samar $F_c$, da samfurin murya $V_c$).

Tsarin LSI yana samar da umarnin vector $I_i$: $$I_i = \text{LLM}_{\theta}(s_i, C, P_c)$$ inda $\text{LLM}_{\theta}$ shine babban samfurin harshe tare da sigogi $\theta$.

Ana haɗa sautin ƙarshe $A_i$ na jumla ta hanyar samfurin TTS mai iya ɗaukar umarni $\text{TTS}_{\phi}$, wanda aka ƙaddara akan muryar hali $V_c$ da umarni $I_i$: $$A_i = \text{TTS}_{\phi}(s_i | V_c, I_i)$$ Ƙarfin tsarin ba tare da horarwa ba ya samo asali ne daga amfani da samfuran da aka riga aka horar, daskararrun samfuran ($\text{LLM}_{\theta}$, Text2Image, Face2Voice, $\text{TTS}_{\phi}$) ba tare da daidaitawa ba.

4. Sakamakon Gwaji & Kimantawa

Takardar ta tabbatar da MultiActor-Audiobook ta hanyar kwatanta kimantawa da samfuran littattafan sauti na kasuwanci da nazarin cire sassa.

4.1 Kimantawar Dan Adam

Masu kimantawa na ɗan adam sun kimanta samfuran littafin sauti da aka samar akan ma'auni kamar bayyanawar motsin rai, daidaiton mai magana, da dabi'ar gabaɗaya. MultiActor-Audiobook ya sami maki masu gasa ko mafi girma idan aka kwatanta da sabis na littafin sauti na TTS na kasuwanci. Musamman ma, ya zarce tsarin tushe waɗanda suka yi amfani da murya ɗaya ko sauƙaƙan tsarin sautin murya, musamman a cikin tattaunawar da ta ƙunshi halaye da yawa tare da siffofi daban-daban.

4.2 Kimantawar MLLM

Don ƙara kimantawar ɗan adam, marubutan sun yi amfani da Manyan Samfuran Harshe ta Hanyoyi Daban-daban (MLLMs) kamar GPT-4V. An gabatar da MLLM da sauti da bayanin yanayin/hali kuma aka tambaye shi ya yanke hukunci ko isar da muryar ya dace da mahalli. Wannan ma'aunin haƙiƙa ya tabbatar da ikon tsarin na samar da sautin murya mai dacewa da mahalli da inganci kamar tsarin kasuwanci, yana tabbatar da ingancin sashin LSI.

4.3 Nazarin Cire Sassa

Nazarin cire sassa ya nuna gudunmawar kowane babban sashi:

  • Ba tare da MSP (Yin amfani da murya na gabaɗaya): Daidaiton mai magana da bambancin hali sun ragu sosai, wanda ya haifar da tattaunawar da ke rikitarwa.
  • Ba tare da LSI (Yin amfani da TTS maras son rai): Sautin ya zama maras ban sha'awa kuma maras motsin rai, yana samun maki mara kyau akan ma'auni na bayyanawa.
  • Cikakken Tsarin (MSP + LSI): Ya sami mafi girman maki a duk fannoni na kimantawa, yana tabbatar da buƙatar haɗin gwiwar sassan biyu.
Waɗannan sakamakon sun tabbatar da ingancin tsarin hanyoyi biyu da aka gabatar.

5. Tsarin Nazari & Nazarin Lamari

Aiwatar da Tsarin: Don nazarin labari don samarwa, tsarin yana bin tsari mai ƙayyadaddun ƙa'ida. Nazarin Lamari - Ɗan Ɗan Labarin Fantasy:

  1. Shigarwa: "Dattijon mai sihiri, gemunsa ya daɗe kuma ya yi launin toka, ya yi gunaguni na gargaɗi. 'Ku kula da inuwoyi,' ya ce, muryarsa kamar duwatsu masu niƙa."
  2. Aiwatar da MSP: LLM ya gano "dattijon mai sihiri" a matsayin mai magana. Ya ciro siffa: {shekaru: tsoho, matsayi: mai sihiri, mai bayyanawa: gemu mai tsayi da launin toka, ingancin murya: kamar duwatsu masu niƙa}. Text2Image ya samar da fuska mai tsufa. Face2Voice ya samar da samfurin murya mai zurfi, mai tsakuwa.
  3. Aiwatar da LSI don "Ku kula da inuwoyi": LLM ya karɓi jumlar, mahalli (gargaɗi), da siffar dattijon. Ya samar da umarni: {motsin rai: damuwa mai tsanani, sautin murya: mai ban tsoro da ƙasa, sautin murya: ƙasa kuma a tsaye, gudun: a hankali}.
  4. Fitowa: TTS mai iya ɗaukar umarni yana haɗa "Ku kula da inuwoyi" ta amfani da muryar dattijon mai tsakuwa, wanda aka isar da shi cikin hanyar hankali, mai ban tsoro, ƙananan sautin murya.
Wannan tsarin yana nuna yadda ake canza alamun rubutu zuwa sauti mai bayyanawa ta hanyoyi daban-daban ba tare da sa hannun ɗan adam ba.

6. Nazari Mai Zurfi & Fahimtar Kwararru

Ainihin Fahimta: MultiActor-Audiobook ba kawai wani kayan rufaffen TTS ba ne; yana da maƙasudi daga mai da hankali kan bayanai zuwa mai da hankali kan umarni na samar da sauti. Babban nasararsa ta gaske ita ce ɗaukar ƙirƙirar littafin sauti a matsayin matsalar dawo da mahalli ta hanyoyi daban-daban da bin umarni, yana ƙetare tsadar tsadar kwafin murya na al'ada da ƙirar sautin murya. Wannan ya yi daidai da babban canjin masana'antu, misali ta samfuran kamar DALL-E da Stable Diffusion a cikin hangen nesa, inda haɗin kai daga sassan da aka riga aka horar ya maye gurbin horon samfurin gaba ɗaya.

Tsarin Hankali: Hankali yana da kyau a layi amma ya dogara da zato maras ƙarfi. MSP yana ɗauka cewa samfurin Fuska-zuwa-Murya yana taswira kowace fuskar da aka samar zuwa murya mai dacewa, mai daidaito—wani tsalle na bangaskiya idan aka yi la'akari da ƙalubalen da aka sani a cikin koyon wakilcin hanyoyi daban-daban (kamar yadda aka gani a cikin bambance-bambance tsakanin sararin samaniya na hoto da sauti da aka tattauna a cikin ayyuka kamar AudioCLIP). LSI yana ɗauka cewa fahimtar LLM na rubutun "sautin murya mai ban tsoro" ya fassara daidai zuwa sigogin sauti a cikin samfurin TTS na gaba—wani tazara na ma'ana-acoustic wanda ya kasance ƙalubale na asali, kamar yadda aka lura a cikin wallafe-wallafen sarrafa magana.

Ƙarfi & Kurakurai: Ƙarfinsa ba shakka yana da inganci na tattalin arziki da aiki: ba tare da horarwa ba, babu matsalar lasisi don muryoyin ɗan wasan kwaikwayo, ƙirar ƙira mai sauri. Kurakuran yana cikin rufin inganci. Tsarin yana da kyau kamar yadda mafi raunin sashinsa na kasuwa—samfurin Face2Voice da TTS mai iya ɗaukar umarni. Zai yi wahala tare da ƙananan bayyanawa da daidaito na dogon lokaci. Shin zai iya ɗaukar muryar hali da ke karyewa da motsin rai, wani ƙaramin bayani wanda ke buƙatar sarrafa ƙananan sautin murya? Da wuya. Dogaro akan siffar gani don murya kuma yana iya zama mai haɓaka son kai, wata matsala da aka rubuta da kyau a cikin ɗabi'ar AI mai samarwa.

Fahimta Mai Aiki: Ga masu saka hannun jari da manajan samfura, wannan wani ƙwararren MVP ne don kasuwanni na musamman: indie game dev, saurin daidaita abun ciki, nishaɗin ilimi na musamman. Koyaya, ga babban wallafe-wallafen da ke neman inganci mai gasa da ɗan adam, abin ƙari ne, ba maye gurbinsa ba. Hanyar nan take ya kamata ta mayar da hankali kan hanyoyin gauraye: yin amfani da wannan tsarin don samar da "daftarin farko" mai wadatar littafin sauti wanda darektan ɗan adam zai iya gyara da goge shi cikin inganci, yana rage lokacin samarwa da kashi 70-80% maimakon neman kashi 100% na sarrafa kansa. Babban fifikon bincike dole ne ya rufe tazarar ma'ana-acoustic ta hanyar mafi kyawun wuraren haɗin gwiwa, watakila an yi wahayi daga dabarun daidaitawa da ake amfani da su a cikin samfuran hanyoyi daban-daban kamar Flamingo ko CM3.

7. Ayyukan Gaba & Hanyoyi

Tsarin da MultiActor-Audiobook ya gabatar ya buɗe hanyoyi da yawa:

  • Kafofin Watsa Labarai Masu Mu'amala & Wasanni: Samar da tattaunawar hali a cikin wasanni ko labarun mu'amala na ainihi cikin sauri dangane da zaɓin ɗan wasa da sauye-sauyen yanayin hali.
  • Samun dama & Ilimi: Canza littattafai, takardu, ko labarun yara na musamman cikin sauri zuwa labarai masu jan hankali, masu muryoyi da yawa, yana haɓaka damar samun dama ga masu nakasar gani ko ƙirƙirar kayan koyo mai nutsewa.
  • Daidaituwar Abun Ciki: Saurin dubbing da murya don abun cikin bidiyo ta hanyar samar da muryoyin da suka dace da al'adu da hali a cikin harsunan da aka yi niyya, ko da yake wannan yana buƙatar manyan bayanan TTS na harsuna da yawa.
  • Hanyoyin Bincike na Gaba:
    1. Ingantaccen Ƙirar Siffa: Haɗa ƙarin hanyoyi (misali, ayyukan hali, sautunan da aka bayyana) bayyan fuska da bayanin rubutu kawai don sanar da murya da sautin murya.
    2. Daidaituwar Mahalli Mai Tsayi: Inganta LSI don kiyaye daidaiton baka na labari mai faɗi (misali, raguwar motsin rai na hali a hankali) a ko'ina cikin littafi, ba kawai jumlolin gida ba.
    3. Hasashen Sigogin Sauti Kai Tsaye: Matsawa bayan umarnin harshe na halitta zuwa samar da LLM fitarwa kai tsaye, ma'anoni masu fassara na siffofi na sauti (F0 contours, makamashi) don sarrafa mafi ƙanƙanta, kama da hanyar da ake bi a cikin VALL-E amma a cikin yanayin ba tare da horarwa ba.
    4. Ƙirar Murya ta ɗabi'a: Haɓaka tsare-tsare don bincika da kawar da son kai daga sassan Face2Voice da samar da siffa don hana stereotyping.
Manufar ƙarshe ita ce cikakken injin haɗin "labari-zuwa-sautin sauti" na gabaɗaya, mai sarrafawa, da ɗabi'a.

8. Nassoshi

  1. Tan, X., et al. (2021). NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. arXiv preprint arXiv:2105.04421.
  2. Wang, C., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
  3. Zhang, Y., et al. (2022). META-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  4. Radford, A., et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. Proceedings of ICML.
  5. Kim, J., et al. (2021). VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Proceedings of ICML.
  6. OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
  7. Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the CVPR.
  8. Alayrac, J., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. Advances in Neural Information Processing Systems.
  9. Park, K., Joo, S., & Jung, K. (2024). MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers. Manuscript submitted for publication.
  10. Guzhov, A., et al. (2022). AudioCLIP: Extending CLIP to Image, Text and Audio. Proceedings of the ICASSP.