{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T16:36:04Z","timestamp":1778258164441,"version":"3.51.4"},"reference-count":131,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T00:00:00Z","timestamp":1634688000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T00:00:00Z","timestamp":1634688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/T004991\/1"],"award-info":[{"award-number":["EP\/T004991\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/N033779\/1"],"award-info":[{"award-number":["EP\/N033779\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2022,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100\u00a0hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (Damen in Scaling egocentric vision: ECCV, 2018), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the \u201ctest of time\u201d\u2014i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval\u00a0(from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics.<\/jats:p>","DOI":"10.1007\/s11263-021-01531-2","type":"journal-article","created":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T19:53:39Z","timestamp":1634759619000},"page":"33-55","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":349,"title":["Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100"],"prefix":"10.1007","volume":"130","author":[{"given":"Dima","family":"Damen","sequence":"first","affiliation":[]},{"given":"Hazel","family":"Doughty","sequence":"additional","affiliation":[]},{"given":"Giovanni Maria","family":"Farinella","sequence":"additional","affiliation":[]},{"given":"Antonino","family":"Furnari","sequence":"additional","affiliation":[]},{"given":"Evangelos","family":"Kazakos","sequence":"additional","affiliation":[]},{"given":"Jian","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Davide","family":"Moltisanti","sequence":"additional","affiliation":[]},{"given":"Jonathan","family":"Munro","sequence":"additional","affiliation":[]},{"given":"Toby","family":"Perrett","sequence":"additional","affiliation":[]},{"given":"Will","family":"Price","sequence":"additional","affiliation":[]},{"given":"Michael","family":"Wray","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,10,20]]},"reference":[{"key":"1531_CR1","doi-asserted-by":"crossref","unstructured":"Bearman A, Russakovsky O, Ferrari V, & Fei-Fei L (2016) What\u2019s the point: semantic segmentation with point supervision. In ECCV","DOI":"10.1007\/978-3-319-46478-7_34"},{"key":"1531_CR2","unstructured":"Bhattacharyya A, Fritz M, & Schiele B (2019) Bayesian prediction of future street scenes using synthetic likelihoods. In ICLR"},{"key":"1531_CR3","doi-asserted-by":"crossref","unstructured":"Bojanowski P, Lajugie R, Bach F, Laptev I, Ponce J, Schmid C, & Sivic J (2014) Weakly supervised action labeling in videos under ordering constraints. In ECCV","DOI":"10.1007\/978-3-319-10602-1_41"},{"key":"1531_CR4","doi-asserted-by":"crossref","unstructured":"Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, & Beijbom O (2019) nuScenes: A multimodal dataset for autonomous driving. arXiv","DOI":"10.1109\/CVPR42600.2020.01164"},{"key":"1531_CR5","doi-asserted-by":"crossref","unstructured":"Cao Y, Long M, Wang J, & Yu P (2017) Correlation hashing network for efficient cross-modal retrieval. In BMVC","DOI":"10.5244\/C.31.128"},{"key":"1531_CR6","doi-asserted-by":"crossref","unstructured":"Caputo B, M\u00fcller H, Martinez-Gomez J, Villegas M, Acar B, Patricia N, Marvasti N, \u00dcsk\u00fcdarl\u0131 S, Paredes R, Cazorla M, et\u00a0al. (2014) Imageclef 2014: Overview and analysis of the results. In: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer 192\u2013211","DOI":"10.1007\/978-3-319-11382-1_18"},{"issue":"9","key":"1531_CR7","doi-asserted-by":"publisher","first-page":"1023","DOI":"10.1177\/0278364915614638","volume":"35","author":"N Carlevaris-Bianco","year":"2016","unstructured":"Carlevaris-Bianco, N., Ushani, A. K., & Eustice, R. M. (2016). University of Michigan North Campus long-term vision and lidar dataset. Int J Robotics Res, 35(9), 1023\u20131035.","journal-title":"Int J Robotics Res"},{"key":"1531_CR8","doi-asserted-by":"crossref","unstructured":"Carreira J, & Zisserman A (2017) Quo Vadis, action recognition? A new model and the Kinetics dataset. In CVPR","DOI":"10.1109\/CVPR.2017.502"},{"key":"1531_CR9","unstructured":"Carreira J, Noland E, Hillier C, & Zisserman A (2019) A short note on the Kinetics-700 human action dataset. arXiv"},{"key":"1531_CR10","doi-asserted-by":"crossref","unstructured":"Chang C, Huang DA, Sui Y, Fei-Fei L, & Niebles JC (2019) D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In CVPR","DOI":"10.1109\/CVPR.2019.00366"},{"key":"1531_CR11","unstructured":"Chen D, & Dolan, W (2011) Collecting highly parallel data for paraphrase evaluation. In NAACL-HLT"},{"key":"1531_CR12","doi-asserted-by":"crossref","unstructured":"Chen MH, Kira Z, AlRegib G, Yoo J, Chen R, & Zheng J (2019) Temporal attentive alignment for large-scale video domain adaptation. In ICCV","DOI":"10.1109\/ICCV.2019.00642"},{"key":"1531_CR13","unstructured":"Ch\u00e9ron G, Alayrac J, Laptev I, & Schmid C (2018) A flexible model for training action localization with varying levels of supervision. In NeurIPS"},{"key":"1531_CR14","doi-asserted-by":"crossref","unstructured":"Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, & Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR","DOI":"10.1109\/CVPR.2016.350"},{"key":"1531_CR15","doi-asserted-by":"crossref","unstructured":"Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, & Wray M (2018) Scaling egocentric vision: The EPIC-KITCHENS dataset. In ECCV","DOI":"10.1007\/978-3-030-01225-0_44"},{"key":"1531_CR16","doi-asserted-by":"crossref","unstructured":"Damen D, Leelasawassuk T, Haines O, Calway A, & Mayol-Cuevas W (2014). You-do, I-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In BMVC","DOI":"10.5244\/C.28.30"},{"key":"1531_CR17","doi-asserted-by":"crossref","unstructured":"De\u00a0Geest R, Gavves E, Ghodrati A, Li Z, Snoek C, & Tuytelaars T (2016) Online action detection. In ECCV","DOI":"10.1007\/978-3-319-46454-1_17"},{"key":"1531_CR18","unstructured":"De\u00a0La\u00a0Torre F, Hodgins J, Bargteil A, Martin X, Macey J, Collado A, & Beltran P (2008) Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database. In Robotics Institute"},{"key":"1531_CR19","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li LJ, Li K, & Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In CVPR","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"1531_CR20","unstructured":"Ding L, & Xu C (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In CVPR"},{"key":"1531_CR21","doi-asserted-by":"crossref","unstructured":"Fathi A, Li Y, & Rehg J (2012) Learning to recognize daily actions using gaze. In ECCV","DOI":"10.1007\/978-3-642-33718-5_23"},{"key":"1531_CR22","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Fan H, Malik J, & He K (2019) SlowFast networks for video recognition. In ICCV","DOI":"10.1109\/ICCV.2019.00630"},{"key":"1531_CR23","doi-asserted-by":"crossref","unstructured":"Furnari A, & Farinella GM (2020) Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)","DOI":"10.1109\/TPAMI.2020.2992889"},{"key":"1531_CR24","doi-asserted-by":"crossref","unstructured":"Furnari A, Battiato S, & Farinella GM (2018) Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In ECCVW","DOI":"10.1007\/978-3-030-11021-5_24"},{"key":"1531_CR25","doi-asserted-by":"crossref","unstructured":"Ganin Y, Ustinova E, Ajakan, H, Germain P, Larochelle H, Laviolette F, Marchand M, & Lempitsky V (2016) Domain-adversarial training of neural networks. JMLR","DOI":"10.1007\/978-3-319-58347-1_10"},{"key":"1531_CR26","doi-asserted-by":"crossref","unstructured":"Geiger A, Lenz P, & Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"1531_CR27","unstructured":"Gong B, Shi Y, Sha F, & Grauman K (2012) Geodesic Flow Kernel for Unsupervised Domain Adaptation. In Computer Vision and Pattern Recognition"},{"key":"1531_CR28","unstructured":"Gorban A, Idrees H, Jiang YG, Zamir AR, Laptev I, Shah M, & Sukthankar R (2015). THUMOS challenge: Action recognition with a large number of classes. http:\/\/www.thumos.info\/"},{"key":"1531_CR29","doi-asserted-by":"crossref","unstructured":"Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fr\u00fcnd I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I, Memisevic R (2017) The \u201cSomething Something\u201d video database for learning and evaluating visual common sense. In ICCV","DOI":"10.1109\/ICCV.2017.622"},{"key":"1531_CR30","doi-asserted-by":"crossref","unstructured":"Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C, & Malik J (2018) AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR","DOI":"10.1109\/CVPR.2018.00633"},{"key":"1531_CR31","unstructured":"Gupta S, & Malik J (2016) Visual semantic role labeling. In CVPR"},{"key":"1531_CR32","doi-asserted-by":"crossref","unstructured":"Gygli M, & Ferrari V (2019) Efficient object annotation via speaking and pointing. IJCV","DOI":"10.1007\/s11263-019-01255-4"},{"key":"1531_CR33","doi-asserted-by":"crossref","unstructured":"He K, Girshick R, & Doll\u00e1r P (2019) Rethinking ImageNet pre-training. In ICCV","DOI":"10.1109\/ICCV.2019.00502"},{"key":"1531_CR34","doi-asserted-by":"crossref","unstructured":"He K, Gkioxari G, Doll\u00e1r P, & Girshick R (2017) Mask R-CNN. In ICCV","DOI":"10.1109\/ICCV.2017.322"},{"issue":"1","key":"1531_CR35","doi-asserted-by":"publisher","first-page":"153","DOI":"10.2307\/1912352","volume":"47","author":"JJ Heckman","year":"1979","unstructured":"Heckman, J. J. (1979). Sample Selection Bias as a Specification Error. Econometrica, 47(1), 153\u2013161.","journal-title":"Econometrica"},{"key":"1531_CR36","doi-asserted-by":"crossref","unstructured":"Heilbron FC, Escorcia V, Ghanem B, & Niebles JC (2015) ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"1531_CR37","doi-asserted-by":"crossref","unstructured":"Heilbron FC, Lee JY, Jin H, & Ghanem B (2018) What do i annotate next?, An empirical study of active learning for action localization. In ECCV","DOI":"10.1007\/978-3-030-01252-6_13"},{"key":"1531_CR38","unstructured":"Honnibal M, & Montani I (2017) spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing"},{"key":"1531_CR39","doi-asserted-by":"crossref","unstructured":"Hsu HK, Yao CH, Tsai YH, Hung WC, Tseng HY, Singh M, & Yang MH (2020) Progressive domain adaptation for object detection. In Winter Conference on Applications of Computer Vision","DOI":"10.1109\/WACV45572.2020.9093358"},{"key":"1531_CR40","doi-asserted-by":"crossref","unstructured":"Huang X, Cheng X, Geng Q, Cao B, Zhou D, Wang P, Lin Y, & Yang R (2018) The apolloscape dataset for autonomous driving. In CVPRW","DOI":"10.1109\/CVPRW.2018.00141"},{"key":"1531_CR41","doi-asserted-by":"crossref","unstructured":"Huang DA, Fei-Fei L, & Niebles JC (2016) Connectionist temporal modeling for weakly supervised action labeling. In ECCV","DOI":"10.1007\/978-3-319-46493-0_9"},{"key":"1531_CR42","unstructured":"Ioffe S, & Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML"},{"key":"1531_CR43","unstructured":"Jamal A, Namboodiri VP, Deodhare D, & Venkatesh K (2018) Deep domain adaptation in action space. In BMVC"},{"key":"1531_CR44","doi-asserted-by":"crossref","unstructured":"J\u00e4rvelin K, & Kek\u00e4l\u00e4inen J (2002) Cumulated gain-based evaluation of IR techniques. TOIS","DOI":"10.1145\/582415.582418"},{"key":"1531_CR45","unstructured":"Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, & Sukthankar R (2014) THUMOS challenge: Action recognition with a large number of classes. http:\/\/crcv.ucf.edu\/THUMOS14\/"},{"key":"1531_CR46","doi-asserted-by":"crossref","unstructured":"Kang C, Xiang S, Liao S, Xu C, & Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. TMM","DOI":"10.1109\/TMM.2015.2390499"},{"key":"1531_CR47","doi-asserted-by":"crossref","unstructured":"Karpathy A, & Fei-Fei L (2015) Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"1531_CR48","unstructured":"Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, & Zisserman A (2017) The Kinetics human action video dataset. arXiv"},{"key":"1531_CR49","doi-asserted-by":"crossref","unstructured":"Kazakos E, Nagrani A, Zisserman A, & Damen D (2019) EPIC-Fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV","DOI":"10.1109\/ICCV.2019.00559"},{"key":"1531_CR50","unstructured":"Kingma DP, & Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980"},{"key":"1531_CR51","doi-asserted-by":"crossref","unstructured":"Koppula HS, & Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. TPAMI","DOI":"10.1109\/TPAMI.2015.2430335"},{"key":"1531_CR52","doi-asserted-by":"crossref","unstructured":"Krishna R, Hata K, Ren F, Fei-Fei L, & Niebles JC (2017) Dense-captioning events in videos. In ICCV","DOI":"10.1109\/ICCV.2017.83"},{"key":"1531_CR53","doi-asserted-by":"crossref","unstructured":"Kuehne H, Arslan A, & Serre T (2014) The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR","DOI":"10.1109\/CVPR.2014.105"},{"key":"1531_CR54","doi-asserted-by":"crossref","unstructured":"Kuehne H, Jhuang H, Garrote E, Poggio T, & Serre T (2011) HMDB: a large video database for human motion recognition. In ICCV","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"1531_CR55","doi-asserted-by":"crossref","unstructured":"Lea C, Flynn MD, Vidal R, Reiter A, & Hager GM (2017) Temporal convolutional networks for action segmentation and detection. In CVPR","DOI":"10.1109\/CVPR.2017.113"},{"key":"1531_CR56","doi-asserted-by":"crossref","unstructured":"Lee N, Choi W, Vernaza P, Choy C, Torr PHS, & Chandraker M (2017) DESIRE: Distant future prediction in dynamic scenes with interacting agents. In CVPR","DOI":"10.1109\/CVPR.2017.233"},{"key":"1531_CR57","doi-asserted-by":"crossref","unstructured":"Li J, Lei P, & Todorovic S (2019) Weakly supervised energy-based learning for action segmentation. In ICCV","DOI":"10.1109\/ICCV.2019.00634"},{"key":"1531_CR58","doi-asserted-by":"crossref","unstructured":"Li Y, Ye Z, & Rehg JM (2015) Delving into egocentric actions. In CVPR","DOI":"10.1109\/CVPR.2015.7298625"},{"key":"1531_CR59","doi-asserted-by":"crossref","unstructured":"Lin J, Gan C, & Han S (2019) TSM: Temporal shift module for efficient video understanding. In ICCV","DOI":"10.1109\/ICCV.2019.00718"},{"key":"1531_CR60","doi-asserted-by":"crossref","unstructured":"Lin T, Liu X, Li X, Ding E, & Wen S (2019) BMN: Boundary-matching network for temporal action proposal generation. In ICCV","DOI":"10.1109\/ICCV.2019.00399"},{"key":"1531_CR61","doi-asserted-by":"crossref","unstructured":"Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, & Zitnick CL (2014) Microsoft COCO: Common objects in context. In ECCV","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"1531_CR62","doi-asserted-by":"crossref","unstructured":"Liu D, Jiang T, & Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR","DOI":"10.1109\/CVPR.2019.00139"},{"key":"1531_CR63","doi-asserted-by":"crossref","unstructured":"Liu Z, Miao Z, Zhan X, Lin D, Yu SX, & Icsi, UCB (2020) Open Compound Domain Adaptation. In Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR42600.2020.01242"},{"issue":"1","key":"1531_CR64","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1177\/0278364916679498","volume":"36","author":"W Maddern","year":"2017","unstructured":"Maddern, W., Pascoe, G., Linegar, C., & Newman, P. (2017). 1 year, 1000 km: the Oxford RobotCar dataset. Int J Robot Res, 36(1), 3\u201315.","journal-title":"Int J Robot Res"},{"key":"1531_CR65","unstructured":"Mahdisoltani F, Berger G, Gharbieh W, Fleet D, & Memisevic R (2018) On the effectiveness of task granularity for transfer learning. arXiv"},{"key":"1531_CR66","doi-asserted-by":"crossref","unstructured":"Marszalek M, Laptev I, & Schmid C (2009) Actions in context. In CVPR","DOI":"10.1109\/CVPR.2009.5206557"},{"key":"1531_CR67","doi-asserted-by":"crossref","unstructured":"McInnes L, Healy J, & Melville J (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv","DOI":"10.21105\/joss.00861"},{"key":"1531_CR68","doi-asserted-by":"crossref","unstructured":"Mettes P, Koelma DC, & Snoek CGM (2016) The ImageNet shuffle: Reorganized pre-training for video event detection. In ICMR","DOI":"10.1145\/2911996.2912036"},{"key":"1531_CR69","doi-asserted-by":"crossref","unstructured":"Mettes P, Van\u00a0Gemert JC, & Snoek CG (2016) Spot on: Action localization from pointly-supervised proposals. In ECCV","DOI":"10.1007\/978-3-319-46454-1_27"},{"key":"1531_CR70","doi-asserted-by":"crossref","unstructured":"Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, & Sivic J (2019) HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV","DOI":"10.1109\/ICCV.2019.00272"},{"key":"1531_CR71","unstructured":"Mikolov T, Chen K, Corrado G, & Dean J (2013) Efficient estimation of word representations in vector space. In ICLR"},{"key":"1531_CR72","doi-asserted-by":"crossref","unstructured":"Moltisanti D, Fidler S, & Damen D (2019). Action recognition from single timestamp supervision in untrimmed videos. In CVPR","DOI":"10.1109\/CVPR.2019.01015"},{"key":"1531_CR73","doi-asserted-by":"crossref","unstructured":"Moltisanti D, Wray M, Mayol-Cuevas W, & Damen D (2017) Trespassing the boundaries: Labeling temporal bounds for object interactions in egocentric video. In ICCV","DOI":"10.1109\/ICCV.2017.314"},{"key":"1531_CR74","doi-asserted-by":"crossref","unstructured":"Monfort M, Vondrick C, Oliva A, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, & Gutfreund D (2020) Moments in Time dataset: One million videos for event understanding. TPAMI","DOI":"10.1109\/TPAMI.2019.2901464"},{"key":"1531_CR75","doi-asserted-by":"crossref","unstructured":"Munro J, & Damen D (2020) Multi-modal domain adaptation for fine-grained action recognition. In CVPR","DOI":"10.1109\/CVPR42600.2020.00020"},{"key":"1531_CR76","doi-asserted-by":"crossref","unstructured":"Narayan S, Cholakkal H, Khan F, & Shao L (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In ICCV","DOI":"10.1109\/ICCV.2019.00877"},{"key":"1531_CR77","doi-asserted-by":"crossref","unstructured":"Neuhold G, Ollmann T, Bulo SR, & Kontschieder P (2017) The mapillary vistas dataset for semantic understanding of street scenes. In ICCV","DOI":"10.1109\/ICCV.2017.534"},{"key":"1531_CR78","doi-asserted-by":"crossref","unstructured":"Nguyen P, Liu T, Prasad G, & Han B (2018). Weakly supervised action localization by sparse temporal pooling network. In CVPR","DOI":"10.1109\/CVPR.2018.00706"},{"key":"1531_CR79","doi-asserted-by":"crossref","unstructured":"Nguyen P, Ramanan D, & Fowlkes C (2019) Weakly-supervised action localization with background modeling. In ICCV","DOI":"10.1109\/ICCV.2019.00560"},{"key":"1531_CR80","doi-asserted-by":"crossref","unstructured":"Noroozi M, & Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV","DOI":"10.1007\/978-3-319-46466-4_5"},{"key":"1531_CR81","doi-asserted-by":"crossref","unstructured":"Oberdiek P, Rottmann M, & Fink GA (2020) Detection and Retrieval of Out-of-Distribution Objects in Semantic Segmentation. In Computer Vision and Pattern Recognition Workshops","DOI":"10.1109\/CVPRW50498.2020.00172"},{"key":"1531_CR82","doi-asserted-by":"crossref","unstructured":"Pan B, Cao Z, Adeli E, & Niebles JC (2020) Adversarial cross-domain action recognition with co-attention. In AAAI","DOI":"10.1609\/aaai.v34i07.6854"},{"key":"1531_CR83","unstructured":"Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, & Chintala S (2019). Pytorch: An imperative style, high-performance deep learning library. In Wallach H, Larochelle H, Beygelzimer A, d\u00c1lch\u00e9-Buc F, Fox E, & Garnett R, eds. Advances in Neural Information Processing Systems 32. Curran Associates, Inc. 8024\u20138035"},{"key":"1531_CR84","doi-asserted-by":"crossref","unstructured":"Patron-Perez A, Marszalek M, Zisserman A, & Reid I (2010) High Five: Recognising human interactions in TV shows. In BMVC","DOI":"10.5244\/C.24.50"},{"key":"1531_CR85","doi-asserted-by":"crossref","unstructured":"Peng X, Bai Q, Xia X, Huang Z, Saenko K, & Wang B (2019) Moment matching for multi-source domain adaptation. In ICCV","DOI":"10.1109\/ICCV.2019.00149"},{"key":"1531_CR86","unstructured":"Peng X, Usman B, Kaushik N, Hoffman J, Wang D, & Saenko K (2017) VisDA: The visual domain adaptation challenge. arXiv"},{"key":"1531_CR87","doi-asserted-by":"crossref","unstructured":"Perazzi F, Pont-Tuset J, McWilliams B, Gool LV, Gross M, & Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In CVPR","DOI":"10.1109\/CVPR.2016.85"},{"key":"1531_CR88","doi-asserted-by":"crossref","unstructured":"Pirsiavash H, & Ramanan D (2012) Detecting activities of daily living in first-person camera views. In CVPR","DOI":"10.1109\/CVPR.2012.6248010"},{"key":"1531_CR89","unstructured":"Planamente M, Plizzari C, Alberti E, & Caputo B (2021) Cross-domain first person audio-visual action recognition through relative norm alignment. arXiv preprint arXiv:2106.01689"},{"key":"1531_CR90","unstructured":"Plizzari C, Planamente M, Alberti E, & Caputo B (2021). Polito-iit submission to the epic-kitchens-100 unsupervised domain adaptation challenge for action recognition. arXiv preprint arXiv:2107.00337"},{"key":"1531_CR91","doi-asserted-by":"crossref","unstructured":"Qi F, Yang X, & Xu C (2018) A unified framework for multimodal domain adaptation. In ACM-MM","DOI":"10.1145\/3240508.3240633"},{"key":"1531_CR92","unstructured":"Rasiwasia N, Mahajan D, Mahadevan V, & Aggarwal G (2014) Cluster canonical correlation analysis. In AISTATS"},{"key":"1531_CR93","doi-asserted-by":"crossref","unstructured":"Richard A, Kuehne H, Iqbal A, & Gall J (2018) NeuralNetwork-Viterbi: A framework for weakly supervised video learning. In CVPR","DOI":"10.1109\/CVPR.2018.00771"},{"key":"1531_CR94","doi-asserted-by":"crossref","unstructured":"Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR","DOI":"10.1109\/CVPR.2012.6247801"},{"key":"1531_CR95","doi-asserted-by":"crossref","unstructured":"Rohrbach A, Rohrbach M, Tandon N, & Schiele B (2015) A dataset for movie description. In CVPR","DOI":"10.1109\/CVPR.2015.7298940"},{"key":"1531_CR96","doi-asserted-by":"crossref","unstructured":"Roth J, Chaudhuri S, Klejch O, Marvin R, Gallagher A, Kaver L, Ramaswamy S, Stopczynski A, Schmid C, Xi Z, et\u00a0al. (2019). AVA-ActiveSpeaker: An audio-visual dataset for active speaker detection. arXiv","DOI":"10.1109\/ICCVW.2019.00460"},{"key":"1531_CR97","doi-asserted-by":"crossref","unstructured":"Saenko K, Kulis B, Fritz M, & Darrell T (2010) Adapting visual category models to new domains. In ECCV","DOI":"10.1007\/978-3-642-15561-1_16"},{"key":"1531_CR98","doi-asserted-by":"crossref","unstructured":"Saenko K, Kulis B, Fritz M, & Darrell T (2010) Adapting Visual Category Models to New Domains. In European Conference on Computer Vision","DOI":"10.1007\/978-3-642-15561-1_16"},{"key":"1531_CR99","doi-asserted-by":"crossref","unstructured":"Shan D, Geng J, Shu M, & Fouhey DF (2020) Understanding human hands in contact at internet scale. In CVPR","DOI":"10.1109\/CVPR42600.2020.00989"},{"key":"1531_CR100","unstructured":"Sigurdsson GA, Gupta A, Schmid C, Farhadi A, & Alahari K (2018) Charades-ego: A large-scale dataset of paired third and first person videos. In ArXiv"},{"key":"1531_CR101","doi-asserted-by":"crossref","unstructured":"Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, & Gupta A (2016) Hollywood in Homes: Crowdsourcing data collection for activity understanding. In ECCV","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"1531_CR102","doi-asserted-by":"crossref","unstructured":"Silberman N, Hoiem D, Kohli P, & Fergus R (2012) Indoor segmentation and support inference from RGBD images. In ECCV","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"1531_CR103","doi-asserted-by":"crossref","unstructured":"Singh KK, & Lee YJ (2017). Hide-and-Seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV","DOI":"10.1109\/ICCV.2017.381"},{"key":"1531_CR104","unstructured":"Soomro K, Zamir AR, & Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv"},{"key":"1531_CR105","doi-asserted-by":"crossref","unstructured":"Stein S, & McKenna SJ (2013). Combining Embedded Accelerometers with Computer Vision for Recognizing Food Preparation Activities. In International Joint Conference on Pervasive and Ubiquitous Computing","DOI":"10.1145\/2493432.2493482"},{"key":"1531_CR106","doi-asserted-by":"crossref","unstructured":"Stein S, & McKenna S (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp","DOI":"10.1145\/2493432.2493482"},{"key":"1531_CR107","doi-asserted-by":"crossref","unstructured":"Torralba A, & Efros AA (2011) Unbiased look at dataset bias. In CVPR 2011","DOI":"10.1109\/CVPR.2011.5995347"},{"key":"1531_CR108","unstructured":"Ueberla JP (1997) Domain adaptation with clustered language models. In International Conference on Acoustics, Speech and Signal Processing"},{"key":"1531_CR109","doi-asserted-by":"crossref","unstructured":"Venkateswara H, Eusebio J, Chakraborty S, & Panchanathan S (2017) Deep hashing network for unsupervised domain adaptation. In CVPR","DOI":"10.1109\/CVPR.2017.572"},{"key":"1531_CR110","doi-asserted-by":"crossref","unstructured":"Vondrick C, Shrivastava A, Fathi A, Guadarrama S, & Murphy K (2018) Tracking emerges by colorizing videos. In ECCV","DOI":"10.1007\/978-3-030-01261-8_24"},{"key":"1531_CR111","doi-asserted-by":"crossref","unstructured":"Wang L, Xiong Y, Lin D, & Van\u00a0Gool L (2017) Untrimmednets for weakly supervised action recognition and detection. In CVPR","DOI":"10.1109\/CVPR.2017.678"},{"key":"1531_CR112","doi-asserted-by":"crossref","unstructured":"Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, & Gool LV (2016) Temporal Segment Networks: Towards good practices for deep action recognition. In ECCV","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"1531_CR113","unstructured":"Weinzaepfel P, Martin X, & Schmid C (2016) Human action localization with sparse spatial supervision. arXiv"},{"key":"1531_CR114","doi-asserted-by":"crossref","unstructured":"Wray M, Larlus D, Csurka G, & Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In ICCV","DOI":"10.1109\/ICCV.2019.00054"},{"key":"1531_CR115","doi-asserted-by":"crossref","unstructured":"Wulfmeier M, Bewley A, & Posner I (2018) Incremental Adversarial Domain Adaptation for Continually Changing Environments. In International Conference on Robotics and Automation. 4489\u20134495","DOI":"10.1109\/ICRA.2018.8460982"},{"key":"1531_CR116","doi-asserted-by":"crossref","unstructured":"Xu J, Mei T, Yao T, & Rui Y (2016) MSR-VTT: A large video description dataset for bridging video and language. In CVPR","DOI":"10.1109\/CVPR.2016.571"},{"key":"1531_CR117","doi-asserted-by":"crossref","unstructured":"Xu T, Zhu F, Wong EK, & Fang Y (2016) Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. IMAVIS","DOI":"10.1016\/j.imavis.2016.01.001"},{"key":"1531_CR118","unstructured":"Yang L, Huang Y, Sugano Y, & Sato Y (2021) Epic-kitchens-100 unsupervised domain adaptation challenge for action recognition 2021: Team m3em technical report. arXiv preprint arXiv:2106.10026"},{"key":"1531_CR119","doi-asserted-by":"crossref","unstructured":"Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, & Fei-Fei L (2017) Every Moment Counts: Dense detailed labeling of actions in complex videos. IJCV","DOI":"10.1007\/s11263-017-1013-y"},{"key":"1531_CR120","doi-asserted-by":"crossref","unstructured":"Yogamani S, Hughes C, Horgan J, Sistu G, Varley P, O\u2018Dea D, Uric\u00e1r M, Milz S, Simon M, Amende K et\u00a0al. (2019) Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In ICCV","DOI":"10.1109\/ICCV.2019.00940"},{"key":"1531_CR121","unstructured":"Yu F, Xian W, Chen Y, Liu F, Liao M, Madhavan V, & Darrell T (2018) BDD100K: A diverse driving video database with scalable annotation tooling. arXiv"},{"key":"1531_CR122","doi-asserted-by":"crossref","unstructured":"Zach C, Pock T, & Bischof H (2007) A duality based approach for realtime TV-L1 optical flow. In Pattern Recognition","DOI":"10.1109\/CVPR.2007.383196"},{"key":"1531_CR123","doi-asserted-by":"crossref","unstructured":"Zamir AR, Sax A, Shen W, Guibas L, Malik J, & Savarese S (2018) Taskonomy: Disentangling task transfer learning. In CVPR","DOI":"10.24963\/ijcai.2019\/871"},{"key":"1531_CR124","doi-asserted-by":"crossref","unstructured":"Zhai M, Bessinger Z, Workman S, & Jacobs N (2017) Predicting Ground-Level Scene Layout from Aerial Imagery. In Computer Vision and Pattern Recognition","DOI":"10.1109\/CVPR.2017.440"},{"key":"1531_CR125","unstructured":"Zhai X, Puigcerver J, Kolesnikov A, Ruyssen P, Riquelme C, Lucic M, Djolonga J, Pinto AS, Neumann M, Dosovitskiy A, Beyer L, Bachem O, Tschannen M, Michalski M, Bousquet O, Gelly S, & Houlsby N (2019) A large-scale study of representation learning with the visual task adaptation benchmark. arXiv"},{"key":"1531_CR126","doi-asserted-by":"crossref","unstructured":"Zhao H, Yan Z, Torresani L, & Torralba A (2019) HACS: Human action clips and segments dataset for recognition and temporal localization. In ICCV","DOI":"10.1109\/ICCV.2019.00876"},{"key":"1531_CR127","doi-asserted-by":"crossref","unstructured":"Zhou B, Andonian A, Oliva A, & Torralba A (2018) Temporal relational reasoning in videos. ECCV","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"1531_CR128","doi-asserted-by":"crossref","unstructured":"Zhou L, Kalantidis Y, Chen X, Corso JJ, & Rohrbach M (2019) Grounded video description. In CVPR","DOI":"10.1109\/CVPR.2019.00674"},{"key":"1531_CR129","doi-asserted-by":"crossref","unstructured":"Zhou B, Kr\u00e4henb\u00fchl P, & Koltun V (2019) Does computer vision matter for action? Science Robotics","DOI":"10.1126\/scirobotics.aaw6661"},{"key":"1531_CR130","doi-asserted-by":"crossref","unstructured":"Zhou L, Xu C, & Corso JJ (2017) Towards automatic learning of procedures from web instructional videos. In AAAI","DOI":"10.1609\/aaai.v32i1.12342"},{"key":"1531_CR131","doi-asserted-by":"crossref","unstructured":"Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, & Torralba A (2017) Scene parsing through ADE20K dataset. In CVPR","DOI":"10.1109\/CVPR.2017.544"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-021-01531-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-021-01531-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-021-01531-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,10]],"date-time":"2024-09-10T07:59:55Z","timestamp":1725955195000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-021-01531-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,20]]},"references-count":131,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,1]]}},"alternative-id":["1531"],"URL":"https:\/\/doi.org\/10.1007\/s11263-021-01531-2","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10,20]]},"assertion":[{"value":"18 January 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 September 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 October 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}