## Introduction
- Two-stage approach
	- Method
		1. Generate class-agnostic mask proposal.
		2. Leverage pre-trained CLIP to perform open-vocabulary classification.
	- Assumption
		1. The model can generate classagnostic mask proposals.
		2. Pre-trained CLIP can transfer its classification performance to masked image proposals.
	- Examination
		1. Using ground-truth masks as region proposal.
		2. Feed masked images to a pre-trained CLIP for classification.
		3. Get mIoU of 20.1% on the ADE20K-150 dataset.
		4. Use MaskFormer(a mask proposal generator trained on COCO) as an region proposal generator.
		5. Select the region proposals with highest overlap with ground-truth masks.
		6. Assign the object label to this region.
		7. This model reach mIoU of 66.5%. (despite imperfect region proposal)
	- Conclusion
	  Pre-trained CLIP not performed well over masked images, we hypothesize that CLIP trained on natural image which are not cropped or noised by segmentation masks.

## Vocabularies
1. ground-truth masks: refer to the manually annotated masks or pixel-level labels that are used to define the correct segmentation of objects in an image. Each pixel in the ground-truth mask is assigned a specific class label corresponding to the object or region it belongs to.