RESUMEN
Pixel-level structure segmentations have attracted considerable attention, playing a crucial role in autonomous driving within the metaverse and enhancing comprehension in light field-based machine vision. However, current light field modeling methods fail to integrate appearance and geometric structural information into a coherent semantic space, thereby limiting the capability of light field transmission for visual knowledge. In this paper, we propose a general light field modeling method for pixel-level structure segmentation, comprising a generative light field prompting encoder (LF-GPE) and a prompt-based masked light field pretraining (LF-PMP) network. Our LF-GPE, serving as a light field backbone, can extract both appearance and geometric structural cues simultaneously. It aligns these features into a unified visual space, facilitating semantic interaction. Meanwhile, our LF-PMP, during the pretraining phase, integrates a mixed light field and a multi-view light field reconstruction. It prioritizes considering the geometric structural properties of the light field, enabling the light field backbone to accumulate a wealth of prior knowledge. We evaluate our pretrained LF-GPE on two downstream tasks: light field salient object detection and semantic segmentation. Experimental results demonstrate that LF-GPE can effectively learn high-quality light field features and achieve highly competitive performance in pixel-level segmentation tasks.