1. Gabatarwa
Samar da bayanin fim ta atomata, ko Bayanin Sauti (AD), fasaha ce mai taimako da aka tsara don sauƙaƙa samun kafofin watsa labarai na gani ga masu nakasar gani. Ta ƙunshi samar da taƙaitaccen bayani mai alaƙa da labari na abubuwan gani waɗanda ake saka su cikin tsagawar magana. Ba kamar bayyana bidiyo na yau da kullun ba, wanda sau da yawa yake bayyana gajerun faifai masu zaman kansu, bayanin fim yana buƙatar fahimta da taƙaita labarun da ke bayyana a cikin hotuna da fage da yawa, waɗanda suka haɗa da ƙwaƙƙwaran halaye, sauye-sauyen fage, da jerin abubuwan da suka faru. Wannan takarda ta gabatar da Movie101v2, ingantacciyar ma'auni, babban bayanai mai harsuna biyu da aka tsara don ci gaba da bincike a cikin wannan fanni mai sarƙaƙi. Aikin ya ba da shawara mai bayyanawa, tsarin ayyuka uku don wannan aikin kuma ya ba da cikakken kimantawa ta amfani da samfuran gani-harshe na zamani.
2. Ayyukan Da Suka Gabata & Dalili
Bayanan da suka gabata kamar LSMDC, M-VAD, MAD, da Movie101 na asali sun kafa tushe amma suna fama da gazawa masu mahimmanci waɗanda ke hana ci gaba zuwa tsarin bayani mai amfani, na ainihi.
2.1. Gazawar Bayanan Da Ake Da Su
- Girma & Iyaka: Yawancin bayanai ƙanana ne (misali, Movie101 na asali: fina-finai 101) ko kuma sun ƙunshi gajerun faifan bidiyo (misali, ~4-6 daƙiƙa), suna hana samfuran koyon haɗakar labari na dogon lokaci.
- Shingen Harshe: Movie101 na asali na Sinanci ne kawai, yana iyakance amfani da ƙaƙƙarfan samfuran da aka horar da su cikin Turanci.
- Ingancin Bayanai: Metadata da aka tattara ta atomatik sau da yawa yana ɗauke da kurakurai (misali, halaye da suka ɓace, sunaye masu sabani), yana rage amincin horo da tantancewa.
- Sauƙaƙe Aiki: Wasu bayanai, kamar LSMDC, suna maye gurbin sunayen halaye da "wani," suna rage aikin zuwa bayyana gaba ɗaya kuma suna cire muhimman abubuwan labari.
2.2. Bukatar Movie101v2
An gabatar da Movie101v2 don magance waɗannan gibin kai tsaye, yana samar da ingantaccen albarkatu, mai harsuna biyu, kuma mai girma wanda ke nuna ainihin sarƙaƙar aikin bayanin fim, yana ba da damar ingantaccen haɓaka samfura da tantancewa.
3. Bayanan Movie101v2
3.1. Siffofi Masu Muhimmanci da Ingantawa
- Bayanan Harsuna Biyu: Yana ba da bayanin Sinanci da Turanci ga kowane ɗan gajeren faifan bidiyo, yana faɗaɗa samun dama da amfani da samfura.
- Ingantaccen Girma: An faɗaɗa sosai daga fina-finai 101 na asali, yana ba da tarin faifan bidiyo-bayani mafi girma kuma mafi bambanta.
- Ingantaccen Ingancin Bayanai: An tabbatar da metadata da aka gyara da hannu, gami da ingantaccen jerin halaye da amfani da suna mai daidaituwa a cikin bayanai.
- Tsayayyen Sassa na Bidiyo: Yana nuna gajerun fina-finai masu tsayi waɗanda suka ƙunshi ci gaban labari mafi sarƙaƙa, suna ƙalubalantar samfuran don kiyaye haɗakar labari.
3.2. Kididdigar Bayanai
Fina-finai
Mafi girma sosai > 101
Biyu na Bidiyo-Bayani
Mafi girma sosai > 14,000
Harsuna
2 (Sinanci & Turanci)
Matsakaicin Tsawon Faifan
Ya fi 4.1s (MAD) tsayi
4. Tsarin Ayyuka Uku
Takardar ta sake fasalin bayanin fim ta atomata a matsayin ƙalubale mai ci gaba tare da matakai uku daban-daban, kowannensu yana da sarƙaƙi mai ƙaruwa.
4.1. Mataki na 1: Bayanin Gaskiyar Gani
Mataki na tushe. Dole ne samfuran su bayyana daidai abubuwan da ake gani a cikin harbi ɗaya ko ɗan gajeren faifai: fage, halaye, abubuwa, da ayyuka na atomatik. Wannan yayi daidai da bayyana bidiyo mai yawa na al'ada. Tantancewa yana mai da hankali kan daidaito da sake dawowa na ƙungiyoyin gani.
4.2. Mataki na 2: Fahimtar Labari
Mataki na tsaka-tsaki. Dole ne samfuran su fahimci alaƙar dalili, dalilan halaye, da ci gaban labari a cikin harbuna da yawa. Wannan yana buƙatar fahimta ba kawai abin da ake gani ba, amma dalilin da yasa yake faruwa da abin da yake nufi ga labarin. Ma'auni a nan suna tantance daidaituwar ma'ana da alaƙar labari.
4.3. Mataki na 3: Samar da Bayani Mai Haɗaka
Mataki na ƙarshe, mai shirye don aikace-aikace. Dole ne samfuran su samar da bayani masu sauƙi, taƙaitacce, kuma masu dacewa da masu sauraro waɗanda suka haɗa gaskiyar gani da fahimtar labari cikin sauƙi. Bayanin dole ne ya dace da tsagawar magana, ya kiyaye haɗakar lokaci, kuma ya zama mai amfani ga mai kallon da ba shi da kyau. Tantancewa ya ƙunshi ma'auni gabaɗaya kamar BLEU, ROUGE, METEOR, da hukunce-hukuncen ɗan adam akan sauƙi, haɗaka, da amfani.
5. Tsarin Gwaji & Ma'auni
5.1. Samfuran Da Aka Tantance
Binciken ya kafa ma'auni ta amfani da kewayon manyan samfuran gani-harshe (VLMs), gami da amma ba'a iyakance ga:
- GPT-4V (Gani): Sigar mai nau'ikan nau'ikan GPT-4 na OpenAI.
- Sauran VLMs na zamani kamar BLIP-2, Flamingo, da VideoLLaMA.
5.2. Ma'aunin Tantancewa
- Mataki na 1: Ma'auni na tushen ƙungiya (Daidaito, Sake dawowa, F1) don halaye, abubuwa, ayyuka.
- Mataki na 2: Ma'auni na tushen ma'ana, mai yiwuwa ta amfani da samfuran abin da ke ciki ko daidaiton tsinkaya.
- Mataki na 3: Ma'aunin samar da rubutu (BLEU-4, ROUGE-L, METEOR, CIDEr) da maki tantancewar ɗan adam.
6. Sakamako & Bincike
6.1. Ayyuka a Matsakai Daban-daban
Sakamakon ma'auni ya bayyana babban gibin aiki a cikin matakai uku:
- Mataki na 1 (Gaskiyar Gani): VLMs na zamani suna samun aiki mai ƙarfi, suna nuna kyakkyawar iya gane abu da fage.
- Mataki na 2 (Fahimtar Labari): Aiki ya ragu sosai. Samfuran suna fama da tunani na dalili, fahimtar alaƙar halaye, da haɗa abubuwan da suka faru a cikin lokaci.
- Mataki na 3 (Bayanin Haɗaka): Ko da mafi kyawun samfura kamar GPT-4V suna samar da bayanin da sau da yawa daidai ne amma ba su da zurfin labari, kwararar labari, da daidaitaccen lokaci da ake buƙata don AD na ainihi. Maki na atomatik (BLEU, da sauransu) ba su da alaƙa da cikakken hukunce-hukuncen ɗan adam na amfani.
6.2. Kalubalen Da Aka Gano
- Samfurin Dogon Lokaci: Kiyaye mahallin a kan jerin bidiyo masu tsayi shine rauni na asali.
- Tunani na Labari: Ƙaura daga bayani zuwa fahimtar labari, dalili, da ma'ana.
- Samarwa Mai Daidaita Masu Sauraro: Keɓance fitarwa don zama mafi bayani ga masu sauraro marasa gani, wanda ke buƙatar ka'idar tunani.
- Gibin Tantancewa: Ma'auni na atomatik na yanzu bai isa ba don tantance ingancin bayanin da aka yi amfani da shi.
7. Cikakkun Bayanai na Fasaha & Tsarin Aiki
Ana iya tsara tsarin matakai uku. Bari $V = \{v_1, v_2, ..., v_T\}$ ya wakilci jerin firam ɗin bidiyo/faifai. Manufar ita ce samar da bayani $N = \{w_1, w_2, ..., w_M\}$.
Mataki na 1: Ciro gaskiyar gani $F_t = \phi(v_t)$, inda $\phi$ shine na'urar fahimtar gani da ke gano ƙungiyoyi da ayyuka a lokacin $t$.
Mataki na 2: Fahimtar abubuwan labari $P = \psi(F_{1:T})$, inda $\psi$ shine na'urar tunani na labari wanda ke gina jadawalin labari ko sarkar dalili daga jerin gaskiya.
Mataki na 3: Samar da bayani $N = \Gamma(F, P, C)$. A nan, $\Gamma$ shine na'urar samar da harshe wanda ba kawai ya dogara da gaskiya $F$ da labari $P$ ba, har ma da ƙuntatawa na mahallin $C$ (misali, lokaci dangane da magana, taƙaitacce).
Misalin Tsarin Bincike (Ba Lamba ba): Don gano gazawar samfuri, mutum na iya amfani da wannan tsarin. Don fitar da bayani mara kyau, duba: 1) Shin manyan ƙungiyoyin gani daga Mataki na 1 sun ɓace ko kuskure? 2) Shin alaƙar dalili tsakanin abubuwa biyu (Mataki na 2) an fassara shi da kuskure? 3) Shin harshen (Mataki na 3) yana da sauƙi amma ba daidai ba ko cikakke? Wannan ganewar tsari yana taimakawa wajen gano takamaiman na'urar da ke buƙatar ingantawa.
8. Bincike na Asali & Hikimar Kwararru
Hikimar Asali: Movie101v2 ba wani bayanai kawai ba ne; yana da shiri na shiga tsakani wanda ya gano daidai tushen tsayawar bincike na AD ta atomata: rashin hanya mai matakai, ma'auni daga bayani mai sauƙi zuwa bayanin da aka yi amfani da shi. Ta hanyar rarraba babban aikin "samar da bayani" zuwa matsaloli uku masu sauƙi, marubutan sun ba da abin da ake buƙata don ci gaba a hankali, kamar yadda gabatarwar ImageNet da tsarinta na matakai ya kawo sauyi ga gane abu.
Kwararar Ma'ana: Ma'anar takardar tana da gamsarwa. Ta fara da gano dalilin da yasa bayanan da suka gabata (gajerun faifai, harshe ɗaya, hayaniya) suka haifar da samfuran da suke yin kyau akan ma'auni na ilimi amma sun kasa a cikin yanayi na ainihi. Maganin ya kasance biyu: 1) Gina ingantaccen bayanai (Movie101v2) wanda ke nuna sarƙaƙar duniyar gaske, da 2) Ayyana ma'auni mai bayyanawa (matakai uku) wanda ke tilasta wa al'umma fuskantar gibin tunani na labari kai tsaye, maimakon ɓoye shi a bayan maki samar da rubutu na saman.
Ƙarfi & Kurakurai: Babban ƙarfi shine wannan tsarin ra'ayi. Tsarin matakai uku shine mafi mahimmancin gudunmawar takardar, mai yiwuwa zai yi tasiri ga ma'auni na gaba fiye da bayanin fim. Bangaren harsuna biyu mataki ne mai hankali don amfani da cikakken ikon tsarin VLM da Turanci ya mamaye. Duk da haka, kuskure yana cikin layin da ake nufi. A aikace, waɗannan matakan suna haɗuwa sosai; masu bayanin ɗan adam ba sa raba gaskiya, labari, da harshe. Tantancewa na iya kasancewa cikin keɓe. Bugu da ƙari, yayin da bayanan ya fi girma, gwaji na gaske zai kasance bambancinsa a cikin nau'ikan, daraktoci, da salon fina-finai don guje wa son zuciya, darasi da aka koya daga ƙalubale a cikin bayanan gane fuska.
Hikima Mai Aiki: Ga masu bincike: Ku mai da hankali kan Mataki na 2 (Fahimtar Labari). Wannan shine sabon iyaka. Dole ne a haɗa fasaha daga labari na lissafi (misali, samar da jadawalin labari, koyon rubutun wasan kwaikwayo) da samfuran tare da ingantaccen tunani na lokaci (kamar masu canza bidiyo na ci gaba). Ga masana'antu (misali, dandamali na watsa shirye-shirye): Ku yi haɗin gwiwa tare da ilimi don amfani da ma'auni kamar Movie101v2 don haɓaka samfurin cikin gida. Manufar ya kamata ta zama tsarin haɗin gwiwa inda AI ke kula da Mataki na 1 da ƙarfi, yana taimaka wa mutane a Mataki na 2, kuma mutane suna inganta Mataki na 3 don ingantaccen kulawa—samfurin haɗin kai na haɗin gwiwa, kamar yadda bincike daga dakin gwaje-gwajen Haɗin Mutum-Kwamfuta na MIT ya nuna akan ƙarfafa AI. Hanyar zuwa cikakken atomatik, ingantaccen AD tana da tsayi, amma Movie101v2 yana ba da taswira ta farko mai aminci.
9. Aikace-aikace na Gaba & Hanyoyi
- Kafofin Watsa Labarai Mai Farko na Samun Damar: Haɗawa cikin sabis na watsa shirye-shirye (Netflix, Disney+) don samar da AD na ainihin lokaci ko kafin a samar da shi don babban ɗakin karatu na abun ciki.
- Kayan Aikin Ilimi: Samar da bayanin bayani don bidiyoyin ilimi da shirye-shiryen gaskiya, haɗa koyo ga ɗalibai masu nakasar gani.
- Bincike na Abun Ciki & Bincike: Samfuran fahimtar labari na asali na iya ƙarfafa bincike mai ci gaba a cikin ma'ajiyar bidiyo (misali, "nemo fage inda hali yake da matsalar ɗabi'a").
- Labari Mai Mu'amala: A cikin wasan kwaikwayo ko VR, samar da bayanin labari mai ƙarfi dangane da ayyukan ɗan wasa na iya haifar da ƙarin gogewa ga duk masu amfani.
- Hanyoyin Bincike: 1) Haɓaka samfuran haɗin gwiwa waɗanda ke koyon matakai uku tare maimakon ɗaukar su daban. 2) Ƙirƙirar ingantaccen ma'auni, mai yiwuwa ta amfani da LLMs a matsayin alkalai ko haɓaka ma'auni na musamman. 3) Binciken ɗan ƙaramin ko sifili don daidaitawa zuwa sabbin fina-finai ta amfani da rubutun fim da metadata a matsayin ƙarin mahallin.
10. Nassoshi
- Yue, Z., Zhang, Y., Wang, Z., & Jin, Q. (2024). Movie101v2: Ingantaccen Ma'auni na Bayanin Fim. arXiv preprint arXiv:2404.13370v2.
- Han, Z., et al. (2023a). AutoAD II: Zuwa Haɗa Bayanin Sauti tare da Alamar Mahalli. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Han, Z., et al. (2023b). AutoAD: Bayanin Fim a cikin Mahalli. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Soldan, M., et al. (2022). MAD: Babban Bayanan Bayani don Tushen Harshe a cikin Bidiyo daga Bayanin Sauti na Fim. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Rohrbach, A., et al. (2017). Bayanin Fim. International Journal of Computer Vision (IJCV).
- Torabi, A., et al. (2015). Amfani da Sabis na Bayanin Bidiyo don Ƙirƙirar Babban Tushen Bayanai don Binciken Bayanin Bidiyo. arXiv preprint arXiv:1503.01070.
- OpenAI. (2023). GPT-4V(ision) Tsarin Kati. OpenAI.
- Zhu, J.-Y., et al. (2017). Fassarar Hotuna zuwa Hotuna marasa Biyu ta amfani da Cibiyoyin Adawa na Ci gaba da Ci gaba. Proceedings of the IEEE International Conference on Computer Vision (ICCV). (An ambata a matsayin misalin tsarin da ya rarraba matsala mai sarƙaƙi—fassarar hoto—zuwa zagayowar taswira da sake ginawa).