Zaɓi Harshe

Gano Ayyuka da Ƙaramin Kulawa ta Hanyar Bayanin Sauti

Takarda bincike da ke binciken yadda ake amfani da bayanin sauti mai ɓarna a matsayin kulawa mara ƙarfi don horar da samfuran gano ayyuka a cikin bidiyo, yana rage farashin bayanin kuma yana amfani da fasali na nau'i-nau'i daban-daban.
audio-novel.com | PDF Size: 0.9 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - Gano Ayyuka da Ƙaramin Kulawa ta Hanyar Bayanin Sauti

1. Gabatarwa

Bidiyoyi suna wakiltar tushen bayanai mai wadata, mai nau'i-nau'i daban-daban don koyon inji, suna ƙunshe da bayanan sarari (na gani), na lokaci, da kuma sau da yawa bayanan ji. Duk da haka, amfani da wannan damar gaba ɗaya yana hana shi saboda tsadar samun cikakkun bayanai na matakin misali (lokacin farawa, lokacin ƙarewa, lakabin aiki) don gano ayyuka a cikin bidiyoyin da ba a tsaftace su ba. Wannan takarda tana magance wannan matsalar ta hanyar gabatar da sabuwar hanyar kulawa mara ƙarfi wacce ke amfani da bayani na sauti mai sauƙi kuma a samu a matsayin babban siginar kulawa. Babban fahimta shine cewa bayanan, duk da cewa ba su da daidaitaccen lokaci (suna ba da kusan lokacin farawa kamar yadda yake a cikin bayanan EPIC Kitchens), sun ƙunshi alamomin ma'ana masu mahimmanci waɗanda zasu iya jagorantar samfura don kula da sassan bidiyo masu dacewa da kuma koyon masu gano ayyuka masu inganci, yana rage dogaro da bayanai sosai.

2. Ayyukan Da Aka Yi & Bayanin Matsala

2.1 Tsarin Kulawa a Gano Ayyuka

Fannin gano ayyuka na lokaci yana aiki ƙarƙashin manyan tsare-tsare guda uku na kulawa:

  • Cikakken Kulawa: Yana buƙatar tsadar bayanai na matakin misali (iyakokin lokaci daidai). Yana haifar da babban aiki amma ba shi da iyaka.
  • Kulawa Mara Ƙarfi (Matakin Bidiyo): Yana amfani da lakabin aji na matakin bidiyo kawai. Yana ɗaukan ƴan ayyuka a kowace bidiyo (misali, THUMOS14 yana da ~1 aji/bidiyo), wanda ba gaskiya ba ne ga dogayen bidiyoyi masu rikitarwa kamar waɗanda ke cikin EPIC Kitchens (matsakaicin ~35 aji/bidiyo).
  • Kulawa Mara Ƙarfi (Bayani): Tsarin da aka gabatar. Yana amfani da rubutun bayanin sauti mai ɓarna, mai alamar lokaci guda a matsayin lakabi mara ƙarfi. Wannan yana da ƙarin bayani fiye da lakabin matakin bidiyo amma ya fi cikakken bayanin misali arha.

Kwatanta Bayanan

THUMOS14: Matsakaicin aji 1.08/bidiyo. EPIC Kitchens: Matsakaicin aji 34.87/bidiyo. Wannan bambanci mai tsanani yana nuna iyakokin hanyoyin WSAD na gargajiya a cikin yanayin duniya ta gaske.

2.2 Kalubalen Kulawa Mara Ƙarfi

Babban kalubale shine rashin daidaituwar lokaci tsakanin alamar lokacin bayani da ainihin misalin aikin. Dole ne samfurin ya koya don danne firam ɗin baya maras mahimmanci kuma ya mai da hankali ga sashen lokaci daidai da aikin da aka bayyana, duk da lakabin mai ɓarna.

3. Hanyar Da Aka Gabatar

3.1 Bayyani Gabaɗaya na Tsarin Samfurin

Samfurin da aka gabatar tsari ne na nau'i-nau'i daban-daban da aka ƙera don sarrafa da haɗa fasali daga firam ɗin RGB, kwararar gani (motsi), da waƙoƙin sauti na muhalli. Babban sashi shine tsarin kulawa na lokaci wanda ke koyon nauyin mahimmanci na firam ɗin bidiyo daban-daban dangane da alaƙarsu da lakabin bayanin sauti da aka bayar.

3.2 Koyo Daga Bayanin Sauti Mai Ɓarna

Maimakon ɗaukar alamar lokacin bayani a matsayin lakabi mai wuya, samfurin yana ɗaukarsa a matsayin alama mara ƙarfi. Manufar koyon tana ƙarfafa maki aiki masu girma ga firam ɗin da ke kusa da wurin bayani don daidaitaccen aji na aiki, yayin da yake rage ayyukan aiki ga duk sauran firam ɗin da azuzuwan. Wannan yana kama da wani nau'i na koyo na misali da yawa (MIL) inda bidiyo ya zama "jakar" firam, kuma "misali" mai kyau (aikin) yana wani wuri kusa da wurin da aka bayyana.

3.3 Haɗa Fasali Na Nau'i-Nau'i Daban-Daban

Ana fitar da fasali daga nau'ikan daban-daban (RGB don bayyanar, kwarara don motsi, sauti don sautin muhalli) ta amfani da cibiyoyin sadarwa da aka riga aka horar (misali, I3D don RGB/Kwarara, VGGish don sauti). Ana haɗa waɗannan fasali, ko dai ta hanyar haɗawa da wuri ko ta hanyar ƙarin kayan aikin kulawa na nau'i-nau'i daban-daban, don samar da ingantaccen wakilci na haɗin gwiwa don rarraba ayyuka da gano wuri.

4. Gwaje-gwaje & Sakamako

4.1 Bayanan Gwaji da Saiti

An gudanar da babban kimantawa akan bayanan EPIC Kitchens 100, babban bayanan bidiyo na son kai mai cikakken bayanin ayyuka da kuma bayanan sauti masu dacewa. An horar da samfurin ta amfani da lokutan farawa na bayani kawai da lakabin fi'ili-suna da aka rubuta. Ana auna aikin ta amfani da ma'auni na gano ayyuka na lokaci kamar matsakaicin Matsakaicin Daidaito (mAP) a bakin kofa daban-daban na Haɗin kai akan Haɗin kai (tIoU).

4.2 Sakamako Na Ƙididdiga

Takardar ta nuna cewa samfurin da aka gabatar, wanda aka horar da shi kawai tare da kulawar bayani, ya sami nasarar gasa da samfuran da aka horar da su tare da ƙarin tsadar kulawa. Duk da yake a zahiri ya kasance a bayan ma'auni na cikakken kulawa, ya fi hanyoyin kulawa mara ƙarfi na matakin bidiyo girma musamman, musamman akan bayanan da ke da ayyuka da yawa a kowace bidiyo. Wannan ya tabbatar da hasashe cewa bayani yana ba da siginar kulawa mai mahimmanci ta "tsakiyar ƙasa".

4.3 Nazarin Cire Sassa (Ablation Studies)

Nazarin cire sassa ya tabbatar da mahimmanci na kowane sashi:

  • Nau'i-Nau'i Daban-Daban: Amfani da fasalin RGB+Flow+Audio ya ci gaba da fiye da kowane nau'i guda ɗaya.
  • Kulawar Lokaci: Tsarin kulawa da aka gabatar yana da mahimmanci don tace firam ɗin da ba su da alaƙa da inganta daidaiton gano wuri.
  • Bayani vs. Matakin Bidiyo: Horarwa tare da lakabin bayani yana haifar da sakamako mafi kyau na gano fiye da amfani da lakabin matakin bidiyo kawai akan EPIC Kitchens, yana tabbatar da mafi girman bayanin na farko.

5. Bincike Na Fasaha & Tsarin Aiki

5.1 Tsarin Lissafi

Babban manufar koyo za a iya tsara shi a matsayin haɗuwa da asarar rarrabuwa da asarar gano wuri na lokaci wanda siginar bayani mara ƙarfi ke jagoranta. Bari $V = \{f_t\}_{t=1}^T$ ya zama jerin fasalin firam ɗin bidiyo. Don lakabin bayani $y_n$ tare da alamar lokaci $\tau_n$, samfurin yana samar da makin aji na matakin firam $s_t^c$. Ana koyon nauyin kulawa na lokaci $\alpha_t$ ga kowane firam. Asarar rarrabuwa don aikin da aka bayyana jimla ce mai nauyi: $$\mathcal{L}_{cls} = -\log\left(\frac{\exp(\sum_t \alpha_t s_t^{y_n})}{\sum_c \exp(\sum_t \alpha_t s_t^c)}\right)$$ A lokaci guda, ana amfani da asarar santsi ko rashin yawa na lokaci $\mathcal{L}_{temp}$ akan $\alpha_t$ don ƙarfafa rarraba mai kololuwa a kusa da misalin aikin. Jimlar asara ita ce $\mathcal{L} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{temp}$.

5.2 Misalin Tsarin Bincike

Nazarin Shari'a: Bincika Hanyoyin Kasawar Samfurin
Don fahimtar iyakokin samfurin, zamu iya gina tsarin bincike:

  1. Duba Bayanai: Gano bidiyoyin inda hasashen samfurin (sashen lokaci) yake da ƙaramin IoU da gaskiyar gaskiya. Duba waɗannan bidiyoyin da bayanansu da hannu.
  2. Rarrabuwa: Rarraba gazawar. Rukuni na gama gari sun haɗa da:
    • Shubuha Bayani: Bayanin (misali, "Ina shirya abinci") yana da matuƙar girma kuma bai dace da gajeren misalin aiki guda ɗaya ba.
    • Ayyuka Haɗaɗɗe: Aikin da aka bayyana (misali, "ɗauki wuka da yanka kayan lambu") ya ƙunshi ƙananan ayyuka da yawa, yana rikitar da samfurin.
    • Mamayar Bango: Bayyanar gani na aikin yana da cunkoso ko kuma yayi kama da sauran firam ɗin da ba aiki ba.
  3. Tushen Dalili & Ragewa: Don "Shubuha Bayani," mafita na iya haɗawa da amfani da ƙarin ingantaccen samfurin harshe don tantance ƙananan bayani ko haɗa siginar koyo wanda ke hukunta gano dogon lokaci don lakabi maras tabbas.
Wannan tsarin bincike ya wuce rahoton ma'auni mai sauƙi zuwa binciken samfura masu aiki.

6. Tattaunawa & Hanyoyin Gaba

Babban Fahimta: Wannan aikin dabarar aiki ce a kusa da matsalar bayanin bayanai. Ya gano daidai cewa a duniyar gaske, siginonin kulawa "kyauta" kamar bayanan sauti, taken rufewa, ko rubutun ASR suna da yawa. Ainihin gudunmawar ba sabon tsarin jijiyoyi ba ne, amma tabbataccen tabbacin cewa zamu iya—kuma ya kamata—ƙirƙirar tsarin koyo don narkar da waɗannan siginoni masu ɓarna, na duniyar gaske maimakon jiran cikakkun bayanai da aka tsara.

Kwararar Ma'ana: Hujja tana da ƙarfi: bayanin matakin misali ba shi da ɗorewa don ma'auni → lakabin matakin bidiyo ba su da ƙarfi ga bidiyoyi masu rikitarwa → bayanin sauti matsakaici ne mai arha, mai bayani → ga samfurin da zai iya amfani da shi. Amfani da EPIC Kitchens, tare da rarraba ayyuka masu yawa, babban fasaha ne don haskaka kuskuren kulawar matakin bidiyo.

Ƙarfi & Kurakurai: Ƙarfinsa shine aikinsa da kuma bayyanannen ƙimar shawara don aikace-aikacen masana'antu (misali, daidaita abun ciki, binciken bidiyo, rayuwa mai taimako) inda farashi ke da mahimmanci. Kuskuren, kamar yadda yake da yawancin hanyoyin kulawa mara ƙarfi, shine rufin aiki. Samfurin yana da iyaka ta hanyar hayaniyar kulawarsa. Mataki ne na farko mai kyau, amma ba mafita ta ƙarshe ba don aikace-aikacen da ke buƙatar daidaitaccen lokaci.

Fahimta Masu Aiki: Ga masu bincike: Bincika kulawar kai ta nau'i-nau'i daban-daban (misali, amfani da aikin daga Haɗin Harshe-Hoto Pre-training (CLIP) na Radford et al.) don ƙara rage dogaro da kowane lakabin rubutu. Ga masu aiki: Nan da nan yi amfani da wannan tsari ga bayanan bidiyo na cikin gida tare da rubutun da ake da su ko rajistan sauti. Fara ta ɗaukar alamun lokaci a cikin rajista a matsayin wuraren bayani mara ƙarfi.

Hanyoyin Gaba:

  • Amfani da Manyan Samfuran Gani-Harshe (VLMs): Samfura kamar CLIP ko BLIP-2 suna ba da ingantattun wakilcin gani-rubutu masu daidaitawa. Aikin gaba zai iya amfani da waɗannan a matsayin fifiko mai ƙarfi don ingantaccen tushen jimlolin da aka bayyana a cikin abun cikin bidiyo, mai yuwuwar shawo kan wasu batutuwan shubuha.
  • Haɗin Kai Tsakanin Bayanai: Shin samfurin da aka horar akan bidiyoyin ɗakin dafa abinci na son kai (EPIC) zai iya gano ayyuka a cikin bidiyoyin wasanni na mutum na uku tare da sautin mai sharhi? Bincika canja wurin koyon jagorancin bayani yana da mahimmanci.
  • Daga Gano zuwa Tsinkaya: Bayani sau da yawa yana bayyana aiki yayin da yake faruwa ko kuma bayan haka. Shin za a iya amfani da wannan siginar don koyon samfuran tsinkayar aiki, yana hasashen aiki kaɗan kafin ya faru?
  • Haɗawa tare da Koyo Mai Aiki: Rashin tabbas na samfurin ko nauyin kulawa za a iya amfani da su don tambayar mai bayani na ɗan adam don bayani kawai akan mafi rikitarwa na bayani-bidiyo, ƙirƙirar ingantaccen tsarin bayanin ɗan adam a cikin madauki.

7. Nassoshi

  1. Ye, K., & Kovashka, A. (2021). Weakly-Supervised Action Detection Guided by Audio Narration. A cikin Proceedings of the ... (Tushen PDF).
  2. Damen, D., et al. (2018). Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. European Conference on Computer Vision (ECCV).
  3. Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML).
  4. Carreira, J., & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Conference on Computer Vision and Pattern Recognition (CVPR).
  5. Wang, L., et al. (2016). Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. European Conference on Computer Vision (ECCV).
  6. Zhu, J.-Y., et al. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. International Conference on Computer Vision (ICCV).