Release of PromptSRC with pretrained models.

This commit is contained in:
uzair khattak
2023-07-13 23:43:31 +05:00
commit 8be7dcff6b
132 changed files with 106641 additions and 0 deletions

22
LICENSE Normal file
View File

@@ -0,0 +1,22 @@
MIT License
Copyright (c) 2023 Muhammad Uzair Khattak
Copyright (c) 2022 Muhammad Uzair Khattak
Copyright (c) 2021 Kaiyang Zhou
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

158
README.md Normal file
View File

@@ -0,0 +1,158 @@
# Self-regulating Prompts: Foundational Model Adaptation without Forgetting
> [**Self-regulating Prompts: Foundational Model Adaptation without Forgetting**]()<br>
> [Muhammad Uzair Khattak*](https://muzairkhattak.github.io/), [Syed Talal Wasim*](https://talalwasim.github.io), [Muzammal Naseer](https://scholar.google.com/citations?user=tM9xKA8AAAAJ&hl=en&oi=ao), [Salman Khan](https://salman-h-khan.github.io/), [Ming-Hsuan Yang](http://faculty.ucmerced.edu/mhyang/), [Fahad Shahbaz Khan](https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en)
*Joint first authors
[![paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)]()
[![Website](https://img.shields.io/badge/Project-Website-87CEEB)](https://muzairkhattak.github.io/PromptSRC/)
[![slides](https://img.shields.io/badge/Presentation-Slides-B762C1)](https://drive.google.com/file/d/1d14q8hhAl6qGsiPYpNIVfShMCulVJSUa/view?usp=sharing)
Official implementation of the paper "[Self-regulating Prompts: Foundational Model Adaptation without Forgetting](https://arxiv.org/abs/2210.03117)".
<hr />
# :rocket: News
* **(July 12, 2023)**
* Pre-trained models and evaluation codes for reproducing PromptSRC official benchmark results are released.
* Training codes for [PromptSRC](configs/trainers/PromptSRC) are released.
* This repository also supports [MaPle (CVPR'23)](configs/trainers/MaPLe),
[CoOp (IJCV'22)](configs/trainers/CoOp), [Co-CoOp (CVPR'22)](configs/trainers/CoCoOp)
architectures.
<hr />
## Highlights
![main figure](docs/main_figure.png)
> <p align="justify"> <b> <span style="color: blue;">Left</span></b>:
> Existing prompt learning approaches for foundational Vision-Language models like CLIP rely on task-specific objectives that restrict
> prompt learning to learn a feature space suitable only for downstream tasks and
> consequently lose the generalized knowledge of CLIP (shown in <span style="color: purple;">purple</span></b>).
> Our self-regulating framework explicitly guides the training trajectory of prompts
> towards the closest point between two optimal solution manifolds (solid line) to
> learn task-specific representations while also retaining generalized CLIP knowledge
> (shown in <span style="color: green;">green</span>). <b><span style="color: blue;">Middle</span></b>: Averaged
> across 11 image recognition datasets, PromptSRC surpasses existing methods on the
> base-to-novel generalization setting. <b><span style="color: blue;">Right</span></b>: We evaluate
> our approach on four diverse image recognition benchmarks for CLIP and show
> consistent gains over previous state-of-the-art approaches. </p>
> **<p align="justify"> Abstract:** *Prompt learning has emerged as an efficient alternative
> for fine-tuning foundational models, such as CLIP, for various downstream tasks.
> Conventionally trained using the task-specific objective, i.e., cross-entropy loss,
> prompts tend to overfit downstream data distributions and find it challenging to capture
> task-agnostic general features from the frozen CLIP. This leads to the loss of the model's
> original generalization capability. To address this issue, our work introduces a
> self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating
> Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic
> general representations using a three-pronged approach by: (a) regulating {prompted}
> representations via mutual agreement maximization with the frozen model, (b) regulating
> with self-ensemble of prompts over the training trajectory to encode their complementary
> strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance
> with the visual branch. To the best of our knowledge, this is the first regularization
> framework for prompt learning that avoids overfitting by jointly attending to pre-trained
> model features, the training trajectory during prompting, and the textual diversity.
> PromptSRC explicitly steers the prompts to learn a representation space that maximizes
> performance on downstream tasks without compromising CLIP generalization. We perform
> experiments on 4 benchmarks where PromptSRC performs favorably well compared
> to the existing methods. Our code and pre-trained models are publicly available.* </p>
## Regularization Framework for Prompt Learning
We propose PromptSRC (Prompting with Self-regulating Constraints) which steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization.
**Key components of PromptSRC:**
1) **Mutual agreement maximization:** PromptSRC explicitly guides the prompts to jointly acquire both <i>task-specific knowledge</i> and <i>task-agnostic generalized knowledge</i> by maximizing the mutual agreement between prompted and features of the frozen VL model.
2) **Gaussian weighted prompt aggregation:** We propose a weighted self-ensembling strategy for prompts over the training trajectory that captures complementary features and enhances their generalization abilities.
3) **Textual diversity:** PromptSRC regulates prompts with textual diversity to mitigate sample diversity imbalance compared to the visual branch during training.
## :ballot_box_with_check: Supported Methods
| Method | Paper | Configs | Training Scripts |
|---------------------------|:----------------------------------------------|:---------------------------------------------------------------:|:-------------------------------:|
| PromptSRC | [arXiv]() | [link](configs/trainers/PromptSRC/) | [link](scripts/promptsrc) |
| Independent V-L Prompting | - | [link](configs/trainers/IVLP/) | [link](scripts/independent-vlp) |
| MaPLe | [CVPR 2023](https://arxiv.org/abs/2210.03117) | [link](configs/trainers/CoOp) | [link](scripts/maple) |
| CoOp | [IJCV 2022](https://arxiv.org/abs/2109.01134) | [link](configs/trainers/CoOp) | [link](scripts/coop) |
| Co-CoOp | [CVPR 2022](https://arxiv.org/abs/2203.05557) | [link](configs/trainers/CoCoOp) | [link](scripts/cocoop) |
<hr />
## Results
Results reported below show accuracy for base and novel classes for across 11 recognition datasets averaged over 3 seeds.
### Effectiveness of PromptSRC in comparison with baseline Independent V-L Prompting
PromptSRC effectively maximizes supervised task performance (base classes) without compromising on CLIP's original generalization towards new unseen tasks (novel classes).
| Name | Base Acc. | Novel Acc. | HM | Epochs |
|---------------------------------------------------------------------------------|:---------:|:----------:|:---------:|:------:|
| CLIP | 69.34 | 74.22 | 71.70 | - |
| Independent V-L Prompting | 84.21 | 71.79 | 77.51 | 20 |
| PromptSRC (ours) | **84.26** | **76.10** | **79.97** | 20 |
### PromptSRC in comparison with existing state-of-the-art
| Name | Base Acc. | Novel Acc. | HM | Epochs |
|--------------------------------------------|:---------:|:----------:|:---------:|:------:|
| [CLIP](https://arxiv.org/abs/2103.00020) | 69.34 | 74.22 | 71.70 | - |
| [CoOp](https://arxiv.org/abs/2109.01134) | 82.69 | 63.22 | 71.66 | 200 |
| [CoCoOp](https://arxiv.org/abs/2203.05557) | 80.47 | 71.69 | 75.83 | 10 |
| [ProDA](https://arxiv.org/abs/2205.03340) | 81.56 | 75.83 | 76.65 | 100 |
| [MaPLe](https://arxiv.org/abs/2210.03117) | 82.28 | 75.14 | 78.55 | 5 |
| [PromptSRC (ours)]() | **84.26** | **76.10** | **79.97** | 20 |
## Installation
For installation and other package requirements, please follow the instructions detailed in [INSTALL.md](docs/INSTALL.md).
## Data Preparation
Please follow the instructions at [DATASETS.md](docs/DATASETS.md) to prepare all datasets.
## Model Zoo
### Vision-Language prompting methods
| Name (configs) | Model checkpoints |
|---------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------:|
| [Independent V-L Prompting](configs/trainers/IVLP/vit_b16_c2_ep20_batch4_4+4ctx.yaml) | [link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/syed_wasim_mbzuai_ac_ae/EuIwh-yMh_JBqB2Y_o8Jl14BPDKDRHC0JBPE1BugIeZiSQ?e=AJ8MhY) |
| [PromptSRC](configs/trainers/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx.yaml) | [link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/syed_wasim_mbzuai_ac_ae/EqFXPs2Zl9pKp39w3SqlR7QBDACTv-AgCXH6_cGflrUFwg?e=l33EBA) |
## Evaluation
Please refer to the [EVAL.md](docs/EVAL.md) for detailed instructions on using the evaluation scripts and reproducing the official results using our pre-trained models.
## Training
Please refer to the [TRAIN.md](docs/TRAIN.md) for detailed instructions on training PromptSRC and IVLP baseline from scratch.
<hr />
## Citation
If you find our work, this repository, or pretrained models useful, please consider giving a star :star: and citation.
```bibtex
@article{khattak2023PromptSRC,
title={Self-regulating Prompts: Foundational Model Adaptation without Forgetting},
author={khattak, Muhammad Uzair and Wasim, Syed Talal and Muzzamal, Naseer and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
journal={arXiv:},
year={2023}
}
```
## Contact
If you have any questions, please create an issue on this repository or contact at uzair.khattak@mbzuai.ac.ae or syed.wasim@mbzuai.ac.ae.
## Acknowledgements
Our code is based on [MaPLe](https://github.com/muzairkhattak/multimodal-prompt-learning), along with [Co-CoOp and CoOp](https://github.com/KaiyangZhou/CoOp) repository. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.

1
clip/__init__.py Normal file
View File

@@ -0,0 +1 @@
from .clip import *

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

221
clip/clip.py Normal file
View File

@@ -0,0 +1,221 @@
import hashlib
import os
import urllib
import warnings
from typing import Union, List
import torch
from PIL import Image
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from tqdm import tqdm
from .model import build_model
from .simple_tokenizer import SimpleTokenizer as _Tokenizer
try:
from torchvision.transforms import InterpolationMode
BICUBIC = InterpolationMode.BICUBIC
except ImportError:
BICUBIC = Image.BICUBIC
if torch.__version__.split(".") < ["1", "7", "1"]:
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
__all__ = ["available_models", "load", "tokenize"]
_tokenizer = _Tokenizer()
_MODELS = {
"RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
"RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
"RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
"RN50x16": "https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt",
"ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
"ViT-B/16": "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt",
}
def _download(url: str, root: str = os.path.expanduser("~/.cache/clip")):
os.makedirs(root, exist_ok=True)
filename = os.path.basename(url)
expected_sha256 = url.split("/")[-2]
download_target = os.path.join(root, filename)
if os.path.exists(download_target) and not os.path.isfile(download_target):
raise RuntimeError(f"{download_target} exists and is not a regular file")
if os.path.isfile(download_target):
if hashlib.sha256(open(download_target, "rb").read()).hexdigest() == expected_sha256:
return download_target
else:
warnings.warn(f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file")
with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
with tqdm(total=int(source.info().get("Content-Length")), ncols=80, unit='iB', unit_scale=True) as loop:
while True:
buffer = source.read(8192)
if not buffer:
break
output.write(buffer)
loop.update(len(buffer))
if hashlib.sha256(open(download_target, "rb").read()).hexdigest() != expected_sha256:
raise RuntimeError(f"Model has been downloaded but the SHA256 checksum does not not match")
return download_target
def _transform(n_px):
return Compose([
Resize(n_px, interpolation=BICUBIC),
CenterCrop(n_px),
lambda image: image.convert("RGB"),
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
])
def available_models() -> List[str]:
"""Returns the names of available CLIP models"""
return list(_MODELS.keys())
def load(name: str, device: Union[str, torch.device] = "cuda" if torch.cuda.is_available() else "cpu", jit=False):
"""Load a CLIP model
Parameters
----------
name : str
A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict
device : Union[str, torch.device]
The device to put the loaded model
jit : bool
Whether to load the optimized JIT model or more hackable non-JIT model (default).
Returns
-------
model : torch.nn.Module
The CLIP model
preprocess : Callable[[PIL.Image], torch.Tensor]
A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
"""
if name in _MODELS:
model_path = _download(_MODELS[name])
elif os.path.isfile(name):
model_path = name
else:
raise RuntimeError(f"Model {name} not found; available models = {available_models()}")
try:
# loading JIT archive
model = torch.jit.load(model_path, map_location=device if jit else "cpu").eval()
state_dict = None
except RuntimeError:
# loading saved state dict
if jit:
warnings.warn(f"File {model_path} is not a JIT archive. Loading as a state dict instead")
jit = False
state_dict = torch.load(model_path, map_location="cpu")
if not jit:
model = build_model(state_dict or model.state_dict()).to(device)
if str(device) == "cpu":
model.float()
return model, _transform(model.visual.input_resolution)
# patch the device names
device_holder = torch.jit.trace(lambda: torch.ones([]).to(torch.device(device)), example_inputs=[])
device_node = [n for n in device_holder.graph.findAllNodes("prim::Constant") if "Device" in repr(n)][-1]
def patch_device(module):
try:
graphs = [module.graph] if hasattr(module, "graph") else []
except RuntimeError:
graphs = []
if hasattr(module, "forward1"):
graphs.append(module.forward1.graph)
for graph in graphs:
for node in graph.findAllNodes("prim::Constant"):
if "value" in node.attributeNames() and str(node["value"]).startswith("cuda"):
node.copyAttributes(device_node)
model.apply(patch_device)
patch_device(model.encode_image)
patch_device(model.encode_text)
# patch dtype to float32 on CPU
if str(device) == "cpu":
float_holder = torch.jit.trace(lambda: torch.ones([]).float(), example_inputs=[])
float_input = list(float_holder.graph.findNode("aten::to").inputs())[1]
float_node = float_input.node()
def patch_float(module):
try:
graphs = [module.graph] if hasattr(module, "graph") else []
except RuntimeError:
graphs = []
if hasattr(module, "forward1"):
graphs.append(module.forward1.graph)
for graph in graphs:
for node in graph.findAllNodes("aten::to"):
inputs = list(node.inputs())
for i in [1, 2]: # dtype can be the second or third argument to aten::to()
if inputs[i].node()["value"] == 5:
inputs[i].node().copyAttributes(float_node)
model.apply(patch_float)
patch_float(model.encode_image)
patch_float(model.encode_text)
model.float()
return model, _transform(model.input_resolution.item())
def tokenize(texts: Union[str, List[str]], context_length: int = 77, truncate: bool = False) -> torch.LongTensor:
"""
Returns the tokenized representation of given input string(s)
Parameters
----------
texts : Union[str, List[str]]
An input string or a list of input strings to tokenize
context_length : int
The context length to use; all CLIP models use 77 as the context length
truncate: bool
Whether to truncate the text in case its encoding is longer than the context length
Returns
-------
A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]
"""
if isinstance(texts, str):
texts = [texts]
sot_token = _tokenizer.encoder["<|startoftext|>"]
eot_token = _tokenizer.encoder["<|endoftext|>"]
all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]
result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
for i, tokens in enumerate(all_tokens):
if len(tokens) > context_length:
if truncate:
tokens = tokens[:context_length]
tokens[-1] = eot_token
else:
raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
result[i, :len(tokens)] = torch.tensor(tokens)
return result

699
clip/model.py Normal file
View File

@@ -0,0 +1,699 @@
from collections import OrderedDict
from typing import Tuple, Union
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1):
super().__init__()
# all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
self.bn3 = nn.BatchNorm2d(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = None
self.stride = stride
if stride > 1 or inplanes != planes * Bottleneck.expansion:
# downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
self.downsample = nn.Sequential(OrderedDict([
("-1", nn.AvgPool2d(stride)),
("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
("1", nn.BatchNorm2d(planes * self.expansion))
]))
def forward(self, x: torch.Tensor):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.avgpool(out)
out = self.bn3(self.conv3(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class AttentionPool2d(nn.Module):
def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
super().__init__()
self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
self.k_proj = nn.Linear(embed_dim, embed_dim)
self.q_proj = nn.Linear(embed_dim, embed_dim)
self.v_proj = nn.Linear(embed_dim, embed_dim)
self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
self.num_heads = num_heads
def forward(self, x):
x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1) # NCHW -> (HW)NC
x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0) # (HW+1)NC
x = x + self.positional_embedding[:, None, :].to(x.dtype) # (HW+1)NC
x, _ = F.multi_head_attention_forward(
query=x, key=x, value=x,
embed_dim_to_check=x.shape[-1],
num_heads=self.num_heads,
q_proj_weight=self.q_proj.weight,
k_proj_weight=self.k_proj.weight,
v_proj_weight=self.v_proj.weight,
in_proj_weight=None,
in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
bias_k=None,
bias_v=None,
add_zero_attn=False,
dropout_p=0,
out_proj_weight=self.c_proj.weight,
out_proj_bias=self.c_proj.bias,
use_separate_proj_weight=True,
training=self.training,
need_weights=False
)
return x[0]
class ModifiedResNet(nn.Module):
"""
A ResNet class that is similar to torchvision's but contains the following changes:
- There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
- Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
- The final pooling layer is a QKV attention instead of an average pool
"""
def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
super().__init__()
self.output_dim = output_dim
self.input_resolution = input_resolution
# the 3-layer stem
self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(width // 2)
self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(width // 2)
self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
self.bn3 = nn.BatchNorm2d(width)
self.avgpool = nn.AvgPool2d(2)
self.relu = nn.ReLU(inplace=True)
# residual layers
self._inplanes = width # this is a *mutable* variable used during construction
self.layer1 = self._make_layer(width, layers[0])
self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
embed_dim = width * 32 # the ResNet feature dimension
self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)
def _make_layer(self, planes, blocks, stride=1):
layers = [Bottleneck(self._inplanes, planes, stride)]
self._inplanes = planes * Bottleneck.expansion
for _ in range(1, blocks):
layers.append(Bottleneck(self._inplanes, planes))
return nn.Sequential(*layers)
def forward(self, x):
def stem(x):
for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2), (self.conv3, self.bn3)]:
x = self.relu(bn(conv(x)))
x = self.avgpool(x)
return x
x = x.type(self.conv1.weight.dtype)
x = stem(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.attnpool(x)
return x
class LayerNorm(nn.LayerNorm):
"""Subclass torch's LayerNorm to handle fp16."""
def forward(self, x: torch.Tensor):
orig_type = x.dtype
ret = super().forward(x.type(torch.float32))
return ret.type(orig_type)
class QuickGELU(nn.Module):
def forward(self, x: torch.Tensor):
return x * torch.sigmoid(1.702 * x)
class ResidualAttentionBlock(nn.Module):
def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_head)
self.ln_1 = LayerNorm(d_model)
self.mlp = nn.Sequential(OrderedDict([
("c_fc", nn.Linear(d_model, d_model * 4)),
("gelu", QuickGELU()),
("c_proj", nn.Linear(d_model * 4, d_model))
]))
self.ln_2 = LayerNorm(d_model)
self.attn_mask = attn_mask
def attention(self, x: torch.Tensor):
self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
def forward(self, x: torch.Tensor):
x = x + self.attention(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
class ResidualAttentionBlock_IVLP(nn.Module):
def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None, add_prompt=False,
text_layer=False, i=0, design_details=None):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_head)
self.ln_1 = LayerNorm(d_model)
self.mlp = nn.Sequential(OrderedDict([
("c_fc", nn.Linear(d_model, d_model * 4)),
("gelu", QuickGELU()),
("c_proj", nn.Linear(d_model * 4, d_model))
]))
self.ln_2 = LayerNorm(d_model)
# Only add learnable tokens if flag is set True
# For the first iteration i, we should not add the learnable parameters
# as it is already been taken care of in the very start, for both text
# and the visual branch
self.text_layer = text_layer
self.attn_mask = attn_mask
if i != 0:
self.add_prompt = add_prompt
if self.add_prompt:
if self.text_layer:
self.n_ctx_text = design_details["language_ctx"] # hyperparameter
ctx_vectors = torch.empty(self.n_ctx_text, d_model)
else:
self.n_ctx_visual = design_details["vision_ctx"] # hyperparameter
ctx_vectors = torch.empty(self.n_ctx_visual, d_model)
# Code snippet for per layer visual prompts
nn.init.normal_(ctx_vectors, std=0.02)
self.VPT_shallow = nn.Parameter(ctx_vectors)
else:
self.add_prompt = False
def attention(self, x: torch.Tensor):
self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
def forward(self, x: torch.Tensor):
# Will need to append the learnable tokens for this layer here
# Check if flag was set for this layer or not
if self.add_prompt:
# Also see if this is textual transformer layer or not
if not self.text_layer:
# Remove the outputs produced by learnable tokens of previous layer
prefix = x[0:x.shape[0] - self.n_ctx_visual, :, :]
# Create/configure learnable tokens of this layer
visual_context = self.VPT_shallow.expand(x.shape[1], -1, -1).permute(1, 0, 2).half()
# Add the learnable tokens of this layer with the input, by replacing the previous
# layer learnable tokens
x = torch.cat([prefix, visual_context], dim=0)
else:
# Appending the learnable tokens in different way
# x -> [77, NCLS, DIM]
# First remove the learnable tokens from previous layer
prefix = x[:1, :, :]
suffix = x[1 + self.n_ctx_text:, :, :]
# Create/configure learnable tokens of this layer
textual_context = self.VPT_shallow.expand(x.shape[1], -1, -1).permute(1, 0, 2).half()
# Add the learnable tokens of this layer with the input, replaced by previous
# layer learnable tokens
x = torch.cat([prefix, textual_context, suffix], dim=0)
x = x + self.attention(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
class ResidualAttentionBlock_MaPLe(nn.Module):
def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None, design_details=None,
text_layer=False, i=0):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_head)
self.ln_1 = LayerNorm(d_model)
self.mlp = nn.Sequential(OrderedDict([
("c_fc", nn.Linear(d_model, d_model * 4)),
("gelu", QuickGELU()),
("c_proj", nn.Linear(d_model * 4, d_model))
]))
self.ln_2 = LayerNorm(d_model)
# For the first iteration i, we do not need to add the learnable parameters here
# as it will be added in the beginning, for both text and the vision branch
self.text_layer = text_layer
self.attn_mask = attn_mask
# This must be consistent with the config file prompt
self.compound_prompt_nctx = design_details['maple_length']
if i == 0:
self.first_layer = True
else:
self.first_layer = False
def attention(self, x: torch.Tensor):
self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
def forward(self, inputs):
# For the first layer, we do not need to add any duplicate, as it is already added
# as the shallow version
x = inputs[0]
compound_prompts_deeper = inputs[1]
counter = inputs[2]
if not self.first_layer:
if len(compound_prompts_deeper) > 0:
# This means that deeper compound prompts are turned on
# Here it behaves differently for text and visual side
# Forward function is same for both
if not self.text_layer:
# First check if the ith layer needs compound prompts or not
if not (counter > len(compound_prompts_deeper) - 1):
# Remove the outputs produced by learnable tokens of previous layer
prefix = x[0:x.shape[0] - self.compound_prompt_nctx, :, :]
# Create/configure learnable tokens of this layer
visual_context = compound_prompts_deeper[counter] # extract the correct index
visual_context = visual_context.expand(x.shape[1], -1, -1).permute(1, 0, 2).half()
# Add the learnable tokens of this layer with the input, by replacing previous
# layer learnable tokens
x = torch.cat([prefix, visual_context], dim=0)
# Once done, update the counter, so that the next time, it does not use same learnable tokens
counter += 1
else:
# First check if the ith layer needs compound prompts or not
if not (counter > len(compound_prompts_deeper) - 1):
# Appending the learnable tokens in different way
# x -> [77, NCLS, DIM]
# First remove the learnable tokens from previous layer
prefix = x[:1, :, :]
suffix = x[1 + self.compound_prompt_nctx:, :, :]
# Create/configure learnable tokens of this layer
textual_context = compound_prompts_deeper[counter]
textual_context = textual_context.expand(x.shape[1], -1, -1).permute(1, 0, 2).half()
# Add the learnable tokens of this layer with the input, replaced by previous
# layer learnable tokens
x = torch.cat([prefix, textual_context, suffix], dim=0)
# Once done, update the counter, so that the next time, it does not use same learnable tokens
counter += 1
x = x + self.attention(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return [x, compound_prompts_deeper, counter] # return again as a list, so that nn.seq can work
class Transformer(nn.Module):
def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None, prompts_needed=0,
text_layer=False, design_details=None):
super().__init__()
self.width = width
self.layers = layers
# Implements respective encoder blocks for a given design choice
current_trainer = design_details['trainer']
if current_trainer == 'IVLP' or current_trainer == 'VPT':
self.resblocks = nn.Sequential(*[ResidualAttentionBlock_IVLP(width, heads, attn_mask, True,
text_layer, i,
design_details) if prompts_needed > i
else ResidualAttentionBlock_IVLP(width, heads, attn_mask, False,
text_layer, i, design_details)
for i in range(layers)])
elif current_trainer == 'MaPLe':
self.resblocks = nn.Sequential(
*[ResidualAttentionBlock_MaPLe(width, heads, attn_mask, design_details, text_layer, i)
for i in range(layers)])
else:
# Corresponds to default CoOp or CoCoOp
assert current_trainer == 'CoOp' or current_trainer == 'CoCoOp'
self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])
def forward(self, x: torch.Tensor):
return self.resblocks(x)
class VisionTransformer(nn.Module):
def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int,
output_dim: int, design_details):
super().__init__()
self.input_resolution = input_resolution
self.output_dim = output_dim
self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
if design_details["vision_depth"] == 0:
self.VPT_shallow = False
else:
self.VPT_shallow = True
if self.VPT_shallow:
# Add visual prompt tokens here
n_ctx = design_details["vision_ctx"] # hyperparameter
ctx_vectors = torch.empty(n_ctx, width)
nn.init.normal_(ctx_vectors, std=0.02)
self.VPT = nn.Parameter(ctx_vectors)
# self.VPT.half()
scale = width ** -0.5
self.class_embedding = nn.Parameter(scale * torch.randn(width))
self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
self.ln_pre = LayerNorm(width)
# hyper-parameter if need to add prompt embeddings inside to the input
# of transformer block or not:
self.prompt_till_layer_visual = design_details["vision_depth"]
self.transformer = Transformer(width, layers, heads, prompts_needed=self.prompt_till_layer_visual,
design_details=design_details)
self.ln_post = LayerNorm(width)
self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
def forward(self, x: torch.Tensor):
x = self.conv1(x) # shape = [*, width, grid, grid]
x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
x = torch.cat(
[self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype,
device=x.device),
x], dim=1) # shape = [*, grid ** 2 + 1, width]
x = x + self.positional_embedding.to(x.dtype)
# After positional embeddings, we will attach prompts with the model, remember only those
# are trainable parameters here in whole image encoder.
if self.VPT_shallow:
visual_ctx = self.VPT.expand(x.shape[0], -1, -1).half()
x = torch.cat([x, visual_ctx], dim=1)
else:
assert self.prompt_till_layer_visual == 0
# Normal code as before
x = self.ln_pre(x)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
x = self.ln_post(x[:, 0, :])
if self.proj is not None:
x = x @ self.proj
return x
class VisionTransformer_MaPLe(nn.Module):
def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int,
design_details):
super().__init__()
self.input_resolution = input_resolution
self.output_dim = output_dim
self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
self.VPT_shallow = True
scale = width ** -0.5
self.class_embedding = nn.Parameter(scale * torch.randn(width))
self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
self.ln_pre = LayerNorm(width)
# hyper-parameter if need to add prompt embeddings inside to the input
# of transformer block or not:
self.prompt_till_layer_visual = 0
self.transformer = Transformer(width, layers, heads, design_details=design_details)
self.ln_post = LayerNorm(width)
self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
def forward(self, x: torch.Tensor, shared_ctx, compound_deeper_prompts):
x = self.conv1(x) # shape = [*, width, grid, grid]
x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
x = torch.cat(
[self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device),
x], dim=1) # shape = [*, grid ** 2 + 1, width]
x = x + self.positional_embedding.to(x.dtype)
# After positional embeddings, we will attach prompts with the model, remember only those
# are trainable parameters here in whole image encoder.
if self.VPT_shallow:
visual_ctx = shared_ctx.expand(x.shape[0], -1, -1).half()
x = torch.cat([x, visual_ctx], dim=1)
else:
assert self.prompt_till_layer_visual == 0
# Normal code as before
x = self.ln_pre(x)
x = x.permute(1, 0, 2) # NLD -> LND
# Again combine the inputs, so nn.sequential can work
outputs = self.transformer([x, compound_deeper_prompts, 0]) # third argument is counter
x = outputs[0]
x = x.permute(1, 0, 2) # LND -> NLD
x = self.ln_post(x[:, 0, :])
if self.proj is not None:
x = x @ self.proj
return x
class CLIP(nn.Module):
def __init__(self,
embed_dim: int,
# vision
image_resolution: int,
vision_layers: Union[Tuple[int, int, int, int], int],
vision_width: int,
vision_patch_size: int,
# text
context_length: int,
vocab_size: int,
transformer_width: int,
transformer_heads: int,
transformer_layers: int,
design_details
):
super().__init__()
self.context_length = context_length
trainer = design_details['trainer']
if isinstance(vision_layers, (tuple, list)):
vision_heads = vision_width * 32 // 64
self.visual = ModifiedResNet(
layers=vision_layers,
output_dim=embed_dim,
heads=vision_heads,
input_resolution=image_resolution,
width=vision_width
)
else:
vision_heads = vision_width // 64
if trainer == "MaPLe":
self.visual = VisionTransformer_MaPLe(
input_resolution=image_resolution,
patch_size=vision_patch_size,
width=vision_width,
layers=vision_layers,
heads=vision_heads,
output_dim=embed_dim,
design_details=design_details
)
else:
self.visual = VisionTransformer(
input_resolution=image_resolution,
patch_size=vision_patch_size,
width=vision_width,
layers=vision_layers,
heads=vision_heads,
output_dim=embed_dim,
design_details=design_details
)
# hyper-parameter if need to add prompt embeddings inside to the input
# of transformer block or not:
prompt_till_layer_text = design_details['language_depth']
self.transformer = Transformer(
width=transformer_width,
layers=transformer_layers,
heads=transformer_heads,
attn_mask=self.build_attention_mask(),
prompts_needed=prompt_till_layer_text,
text_layer=True,
design_details=design_details
)
self.vocab_size = vocab_size
self.token_embedding = nn.Embedding(vocab_size, transformer_width)
self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
self.ln_final = LayerNorm(transformer_width)
self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
self.initialize_parameters()
def initialize_parameters(self):
nn.init.normal_(self.token_embedding.weight, std=0.02)
nn.init.normal_(self.positional_embedding, std=0.01)
if isinstance(self.visual, ModifiedResNet):
if self.visual.attnpool is not None:
std = self.visual.attnpool.c_proj.in_features ** -0.5
nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)
for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
for name, param in resnet_block.named_parameters():
if name.endswith("bn3.weight"):
nn.init.zeros_(param)
proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
attn_std = self.transformer.width ** -0.5
fc_std = (2 * self.transformer.width) ** -0.5
for block in self.transformer.resblocks:
nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
if self.text_projection is not None:
nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)
def build_attention_mask(self):
# lazily create causal attention mask, with full attention between the vision tokens
# pytorch uses additive attention mask; fill with -inf
mask = torch.empty(self.context_length, self.context_length)
mask.fill_(float("-inf"))
mask.triu_(1) # zero out the lower diagonal
return mask
@property
def dtype(self):
return self.visual.conv1.weight.dtype
def encode_image(self, image):
return self.visual(image.type(self.dtype))
def encode_text(self, text):
x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]
x = x + self.positional_embedding.type(self.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
x = self.ln_final(x).type(self.dtype)
# x.shape = [batch_size, n_ctx, transformer.width]
# take features from the eot embedding (eot_token is the highest number in each sequence)
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
return x
def forward(self, image, text):
image_features = self.encode_image(image)
text_features = self.encode_text(text)
# normalized features
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# cosine similarity as logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logit_scale * text_features @ image_features.t()
# shape = [global_batch_size, global_batch_size]
return logits_per_image, logits_per_text
def convert_weights(model: nn.Module):
"""Convert applicable model parameters to fp16"""
def _convert_weights_to_fp16(l):
if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
l.weight.data = l.weight.data.half()
if l.bias is not None:
l.bias.data = l.bias.data.half()
if isinstance(l, nn.MultiheadAttention):
for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
tensor = getattr(l, attr)
if tensor is not None:
tensor.data = tensor.data.half()
for name in ["text_projection", "proj"]:
if hasattr(l, name):
attr = getattr(l, name)
if attr is not None:
attr.data = attr.data.half()
model.apply(_convert_weights_to_fp16)
def build_model(state_dict: dict, design_details):
vit = "visual.proj" in state_dict
if vit:
vision_width = state_dict["visual.conv1.weight"].shape[0]
vision_layers = len(
[k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
image_resolution = vision_patch_size * grid_size
else:
counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in
[1, 2, 3, 4]]
vision_layers = tuple(counts)
vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
vision_patch_size = None
assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
image_resolution = output_width * 32
embed_dim = state_dict["text_projection"].shape[1]
context_length = state_dict["positional_embedding"].shape[0]
vocab_size = state_dict["token_embedding.weight"].shape[0]
transformer_width = state_dict["ln_final.weight"].shape[0]
transformer_heads = transformer_width // 64
transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))
model = CLIP(
embed_dim,
image_resolution, vision_layers, vision_width, vision_patch_size,
context_length, vocab_size, transformer_width, transformer_heads, transformer_layers, design_details
)
for key in ["input_resolution", "context_length", "vocab_size"]:
if key in state_dict:
del state_dict[key]
convert_weights(model)
try:
model.load_state_dict(state_dict)
except:
missing_keys, _ = model.load_state_dict(state_dict, strict=False)
print('Weights not found for some missing keys: ', missing_keys)
return model.eval()

132
clip/simple_tokenizer.py Normal file
View File

@@ -0,0 +1,132 @@
import gzip
import html
import os
from functools import lru_cache
import ftfy
import regex as re
@lru_cache()
def default_bpe():
return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
@lru_cache()
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
cs = [chr(n) for n in cs]
return dict(zip(bs, cs))
def get_pairs(word):
"""Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
def basic_clean(text):
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()
def whitespace_clean(text):
text = re.sub(r'\s+', ' ', text)
text = text.strip()
return text
class SimpleTokenizer(object):
def __init__(self, bpe_path: str = default_bpe()):
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
merges = merges[1:49152-256-2+1]
merges = [tuple(merge.split()) for merge in merges]
vocab = list(bytes_to_unicode().values())
vocab = vocab + [v+'</w>' for v in vocab]
for merge in merges:
vocab.append(''.join(merge))
vocab.extend(['<|startoftext|>', '<|endoftext|>'])
self.encoder = dict(zip(vocab, range(len(vocab))))
self.decoder = {v: k for k, v in self.encoder.items()}
self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)
def bpe(self, token):
if token in self.cache:
return self.cache[token]
word = tuple(token[:-1]) + ( token[-1] + '</w>',)
pairs = get_pairs(word)
if not pairs:
return token+'</w>'
while True:
bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word)-1 and word[i+1] == second:
new_word.append(first+second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = ' '.join(word)
self.cache[token] = word
return word
def encode(self, text):
bpe_tokens = []
text = whitespace_clean(basic_clean(text)).lower()
for token in re.findall(self.pat, text):
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
return bpe_tokens
def decode(self, tokens):
text = ''.join([self.decoder[token] for token in tokens])
text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
return text

49409
clip_words.csv Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "Caltech101"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "DescribableTextures"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "EuroSAT"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "FGVCAircraft"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "Food101"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "ImageNet"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "ImageNetA"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "ImageNetR"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "ImageNetSketch"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "ImageNetV2"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "OxfordFlowers"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "OxfordPets"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "StanfordCars"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "SUN397"

View File

@@ -0,0 +1,2 @@
DATASET:
NAME: "UCF101"

View File

@@ -0,0 +1,35 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 1
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 10
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
COCOOP:
N_CTX: 16
CTX_INIT: ""
PREC: "fp16"

View File

@@ -0,0 +1,35 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 1
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 10
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
COCOOP:
N_CTX: 4
CTX_INIT: ""
PREC: "fp16"

View File

@@ -0,0 +1,35 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 1
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 10
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
COCOOP:
N_CTX: 4
CTX_INIT: "a photo of a"
PREC: "fp16"

View File

@@ -0,0 +1,35 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 1
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 10
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
COCOOP:
N_CTX: 8
CTX_INIT: ""
PREC: "fp16"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 200
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN101"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 50
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN101"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 200
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN50"

View File

@@ -0,0 +1,33 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 200
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN50"
TRAINER:
COOP:
CTX_INIT: "a photo of a"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 100
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN50"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 50
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN50"

View File

@@ -0,0 +1,33 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 50
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "RN50"
TRAINER:
COOP:
CTX_INIT: "a photo of a"

View File

@@ -0,0 +1,17 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 200
TEST:
BATCH_SIZE: 200
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
MODEL:
BACKBONE:
NAME: "RN50"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 200
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "ViT-B/16"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 100
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "ViT-B/16"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 50
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "ViT-B/16"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 200
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "ViT-B/32"

View File

@@ -0,0 +1,29 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 32
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.002
MAX_EPOCH: 50
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 5
MODEL:
BACKBONE:
NAME: "ViT-B/32"

View File

@@ -0,0 +1,39 @@
# Independent Vision Language Prompting
DATALOADER:
TRAIN_X:
BATCH_SIZE: 4
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.0025
MAX_EPOCH: 20
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
IVLP:
N_CTX_VISION: 4
N_CTX_TEXT: 4
CTX_INIT: "a photo of a"
PREC: "fp16"
PROMPT_DEPTH_VISION: 9
PROMPT_DEPTH_TEXT: 9

View File

@@ -0,0 +1,36 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 4
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.0035
MAX_EPOCH: 2
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
MAPLE:
N_CTX: 2
CTX_INIT: "a photo of a"
PREC: "fp16"
PROMPT_DEPTH: 9

View File

@@ -0,0 +1,36 @@
DATALOADER:
TRAIN_X:
BATCH_SIZE: 4
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.0026
MAX_EPOCH: 2
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
MAPLE:
N_CTX: 2
CTX_INIT: "a photo of a"
PREC: "fp16"
PROMPT_DEPTH: 3

View File

@@ -0,0 +1,43 @@
# PromptSRC: Prompting with Self-regularizing constraints
DATALOADER:
TRAIN_X:
BATCH_SIZE: 4
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.0025
MAX_EPOCH: 20
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
PROMPTSRC:
N_CTX_VISION: 4
N_CTX_TEXT: 4
CTX_INIT: "a photo of a"
PREC: "fp16"
PROMPT_DEPTH_VISION: 9
PROMPT_DEPTH_TEXT: 9
TEXT_LOSS_WEIGHT: 25
IMAGE_LOSS_WEIGHT: 10
GPA_MEAN: 15
GPA_STD: 1

View File

@@ -0,0 +1,43 @@
# PromptSRC: Prompting with Self-regularizing constraints
DATALOADER:
TRAIN_X:
BATCH_SIZE: 4
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.0025
MAX_EPOCH: 20
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
PROMPTSRC:
N_CTX_VISION: 4
N_CTX_TEXT: 4
CTX_INIT: "a photo of a"
PREC: "fp16"
PROMPT_DEPTH_VISION: 3
PROMPT_DEPTH_TEXT: 3
TEXT_LOSS_WEIGHT: 25
IMAGE_LOSS_WEIGHT: 10
GPA_MEAN: 6
GPA_STD: 10

View File

@@ -0,0 +1,47 @@
# PromptSRC: Prompting with Self-regularizing constraints
DATALOADER:
TRAIN_X:
BATCH_SIZE: 4
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 8
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "sgd"
LR: 0.0025
MAX_EPOCH: 50
LR_SCHEDULER: "cosine"
WARMUP_EPOCH: 1
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "ViT-B/16"
TRAINER:
PROMPTSRC:
N_CTX_VISION: 4
N_CTX_TEXT: 4
CTX_INIT: "a photo of a"
PREC: "fp16"
PROMPT_DEPTH_VISION: 9
PROMPT_DEPTH_TEXT: 9
TEXT_LOSS_WEIGHT: 25
IMAGE_LOSS_WEIGHT: 10
# Use the below configuration for: ImageNet, Caltech101, OxfordPets, Food101, UCF101 and SUN397
GPA_MEAN: 30
GPA_STD: 30
# Use the below configuration for: StanfordCars, Flowers102, FGVCAircraft, DTD and EuroSAT
# GPA_MEAN: 45
# GPA_STD: 5

0
datasets/__init__.py Normal file
View File

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

59
datasets/caltech101.py Normal file
View File

@@ -0,0 +1,59 @@
import os
import pickle
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
from .dtd import DescribableTextures as DTD
IGNORED = ["BACKGROUND_Google", "Faces_easy"]
NEW_CNAMES = {
"airplanes": "airplane",
"Faces": "face",
"Leopards": "leopard",
"Motorbikes": "motorbike",
}
@DATASET_REGISTRY.register()
class Caltech101(DatasetBase):
dataset_dir = "caltech-101"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "101_ObjectCategories")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_Caltech101.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
train, val, test = DTD.read_and_split_data(self.image_dir, ignored=IGNORED, new_cnames=NEW_CNAMES)
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)

95
datasets/dtd.py Normal file
View File

@@ -0,0 +1,95 @@
import os
import pickle
import random
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import listdir_nohidden, mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class DescribableTextures(DatasetBase):
dataset_dir = "dtd"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "images")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_DescribableTextures.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
train, val, test = self.read_and_split_data(self.image_dir)
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
@staticmethod
def read_and_split_data(image_dir, p_trn=0.5, p_val=0.2, ignored=[], new_cnames=None):
# The data are supposed to be organized into the following structure
# =============
# images/
# dog/
# cat/
# horse/
# =============
categories = listdir_nohidden(image_dir)
categories = [c for c in categories if c not in ignored]
categories.sort()
p_tst = 1 - p_trn - p_val
print(f"Splitting into {p_trn:.0%} train, {p_val:.0%} val, and {p_tst:.0%} test")
def _collate(ims, y, c):
items = []
for im in ims:
item = Datum(impath=im, label=y, classname=c) # is already 0-based
items.append(item)
return items
train, val, test = [], [], []
for label, category in enumerate(categories):
category_dir = os.path.join(image_dir, category)
images = listdir_nohidden(category_dir)
images = [os.path.join(category_dir, im) for im in images]
random.shuffle(images)
n_total = len(images)
n_train = round(n_total * p_trn)
n_val = round(n_total * p_val)
n_test = n_total - n_train - n_val
assert n_train > 0 and n_val > 0 and n_test > 0
if new_cnames is not None and category in new_cnames:
category = new_cnames[category]
train.extend(_collate(images[:n_train], label, category))
val.extend(_collate(images[n_train : n_train + n_val], label, category))
test.extend(_collate(images[n_train + n_val :], label, category))
return train, val, test

73
datasets/eurosat.py Normal file
View File

@@ -0,0 +1,73 @@
import os
import pickle
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
from .dtd import DescribableTextures as DTD
NEW_CNAMES = {
"AnnualCrop": "Annual Crop Land",
"Forest": "Forest",
"HerbaceousVegetation": "Herbaceous Vegetation Land",
"Highway": "Highway or Road",
"Industrial": "Industrial Buildings",
"Pasture": "Pasture Land",
"PermanentCrop": "Permanent Crop Land",
"Residential": "Residential Buildings",
"River": "River",
"SeaLake": "Sea or Lake",
}
@DATASET_REGISTRY.register()
class EuroSAT(DatasetBase):
dataset_dir = "eurosat"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "2750")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_EuroSAT.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
train, val, test = DTD.read_and_split_data(self.image_dir, new_cnames=NEW_CNAMES)
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def update_classname(self, dataset_old):
dataset_new = []
for item_old in dataset_old:
cname_old = item_old.classname
cname_new = NEW_CLASSNAMES[cname_old]
item_new = Datum(impath=item_old.impath, label=item_old.label, classname=cname_new)
dataset_new.append(item_new)
return dataset_new

71
datasets/fgvc_aircraft.py Normal file
View File

@@ -0,0 +1,71 @@
import os
import pickle
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class FGVCAircraft(DatasetBase):
dataset_dir = "fgvc_aircraft"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "images")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
classnames = []
with open(os.path.join(self.dataset_dir, "variants.txt"), "r") as f:
lines = f.readlines()
for line in lines:
classnames.append(line.strip())
cname2lab = {c: i for i, c in enumerate(classnames)}
train = self.read_data(cname2lab, "images_variant_train.txt")
val = self.read_data(cname2lab, "images_variant_val.txt")
test = self.read_data(cname2lab, "images_variant_test.txt")
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def read_data(self, cname2lab, split_file):
filepath = os.path.join(self.dataset_dir, split_file)
items = []
with open(filepath, "r") as f:
lines = f.readlines()
for line in lines:
line = line.strip().split(" ")
imname = line[0] + ".jpg"
classname = " ".join(line[1:])
impath = os.path.join(self.image_dir, imname)
label = cname2lab[classname]
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

51
datasets/food101.py Normal file
View File

@@ -0,0 +1,51 @@
import os
import pickle
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
from .dtd import DescribableTextures as DTD
@DATASET_REGISTRY.register()
class Food101(DatasetBase):
dataset_dir = "food-101"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "images")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_Food101.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
train, val, test = DTD.read_and_split_data(self.image_dir)
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)

91
datasets/imagenet.py Normal file
View File

@@ -0,0 +1,91 @@
import os
import pickle
from collections import OrderedDict
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import listdir_nohidden, mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class ImageNet(DatasetBase):
dataset_dir = "imagenet"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "images")
self.preprocessed = os.path.join(self.dataset_dir, "preprocessed.pkl")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.preprocessed):
with open(self.preprocessed, "rb") as f:
preprocessed = pickle.load(f)
train = preprocessed["train"]
test = preprocessed["test"]
else:
text_file = os.path.join(self.dataset_dir, "classnames.txt")
classnames = self.read_classnames(text_file)
train = self.read_data(classnames, "train")
# Follow standard practice to perform evaluation on the val set
# Also used as the val set (so evaluate the last-step model)
test = self.read_data(classnames, "val")
preprocessed = {"train": train, "test": test}
with open(self.preprocessed, "wb") as f:
pickle.dump(preprocessed, f, protocol=pickle.HIGHEST_PROTOCOL)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train = data["train"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
data = {"train": train}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, test = OxfordPets.subsample_classes(train, test, subsample=subsample)
super().__init__(train_x=train, val=test, test=test)
@staticmethod
def read_classnames(text_file):
"""Return a dictionary containing
key-value pairs of <folder name>: <class name>.
"""
classnames = OrderedDict()
with open(text_file, "r") as f:
lines = f.readlines()
for line in lines:
line = line.strip().split(" ")
folder = line[0]
classname = " ".join(line[1:])
classnames[folder] = classname
return classnames
def read_data(self, classnames, split_dir):
split_dir = os.path.join(self.image_dir, split_dir)
folders = sorted(f.name for f in os.scandir(split_dir) if f.is_dir())
items = []
for label, folder in enumerate(folders):
imnames = listdir_nohidden(os.path.join(split_dir, folder))
classname = classnames[folder]
for imname in imnames:
impath = os.path.join(split_dir, folder, imname)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

46
datasets/imagenet_a.py Normal file
View File

@@ -0,0 +1,46 @@
import os
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import listdir_nohidden
from .imagenet import ImageNet
TO_BE_IGNORED = ["README.txt"]
@DATASET_REGISTRY.register()
class ImageNetA(DatasetBase):
"""ImageNet-A(dversarial).
This dataset is used for testing only.
"""
dataset_dir = "imagenet-adversarial"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "imagenet-a")
text_file = os.path.join(self.dataset_dir, "classnames.txt")
classnames = ImageNet.read_classnames(text_file)
data = self.read_data(classnames)
super().__init__(train_x=data, test=data)
def read_data(self, classnames):
image_dir = self.image_dir
folders = listdir_nohidden(image_dir, sort=True)
folders = [f for f in folders if f not in TO_BE_IGNORED]
items = []
for label, folder in enumerate(folders):
imnames = listdir_nohidden(os.path.join(image_dir, folder))
classname = classnames[folder]
for imname in imnames:
impath = os.path.join(image_dir, folder, imname)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

46
datasets/imagenet_r.py Normal file
View File

@@ -0,0 +1,46 @@
import os
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import listdir_nohidden
from .imagenet import ImageNet
TO_BE_IGNORED = ["README.txt"]
@DATASET_REGISTRY.register()
class ImageNetR(DatasetBase):
"""ImageNet-R(endition).
This dataset is used for testing only.
"""
dataset_dir = "imagenet-rendition"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "imagenet-r")
text_file = os.path.join(self.dataset_dir, "classnames.txt")
classnames = ImageNet.read_classnames(text_file)
data = self.read_data(classnames)
super().__init__(train_x=data, test=data)
def read_data(self, classnames):
image_dir = self.image_dir
folders = listdir_nohidden(image_dir, sort=True)
folders = [f for f in folders if f not in TO_BE_IGNORED]
items = []
for label, folder in enumerate(folders):
imnames = listdir_nohidden(os.path.join(image_dir, folder))
classname = classnames[folder]
for imname in imnames:
impath = os.path.join(image_dir, folder, imname)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

View File

@@ -0,0 +1,43 @@
import os
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import listdir_nohidden
from .imagenet import ImageNet
@DATASET_REGISTRY.register()
class ImageNetSketch(DatasetBase):
"""ImageNet-Sketch.
This dataset is used for testing only.
"""
dataset_dir = "imagenet-sketch"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "images")
text_file = os.path.join(self.dataset_dir, "classnames.txt")
classnames = ImageNet.read_classnames(text_file)
data = self.read_data(classnames)
super().__init__(train_x=data, test=data)
def read_data(self, classnames):
image_dir = self.image_dir
folders = listdir_nohidden(image_dir, sort=True)
items = []
for label, folder in enumerate(folders):
imnames = listdir_nohidden(os.path.join(image_dir, folder))
classname = classnames[folder]
for imname in imnames:
impath = os.path.join(image_dir, folder, imname)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

46
datasets/imagenetv2.py Normal file
View File

@@ -0,0 +1,46 @@
import os
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import listdir_nohidden
from .imagenet import ImageNet
@DATASET_REGISTRY.register()
class ImageNetV2(DatasetBase):
"""ImageNetV2.
This dataset is used for testing only.
"""
dataset_dir = "imagenetv2"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
image_dir = "imagenetv2-matched-frequency-format-val"
self.image_dir = os.path.join(self.dataset_dir, image_dir)
text_file = os.path.join(self.dataset_dir, "classnames.txt")
classnames = ImageNet.read_classnames(text_file)
data = self.read_data(classnames)
super().__init__(train_x=data, test=data)
def read_data(self, classnames):
image_dir = self.image_dir
folders = list(classnames.keys())
items = []
for label in range(1000):
class_dir = os.path.join(image_dir, str(label))
imnames = listdir_nohidden(class_dir)
folder = folders[label]
classname = classnames[folder]
for imname in imnames:
impath = os.path.join(class_dir, imname)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

View File

@@ -0,0 +1,89 @@
import os
import pickle
import random
from scipy.io import loadmat
from collections import defaultdict
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import read_json, mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class OxfordFlowers(DatasetBase):
dataset_dir = "oxford_flowers"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "jpg")
self.label_file = os.path.join(self.dataset_dir, "imagelabels.mat")
self.lab2cname_file = os.path.join(self.dataset_dir, "cat_to_name.json")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_OxfordFlowers.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
train, val, test = self.read_data()
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def read_data(self):
tracker = defaultdict(list)
label_file = loadmat(self.label_file)["labels"][0]
for i, label in enumerate(label_file):
imname = f"image_{str(i + 1).zfill(5)}.jpg"
impath = os.path.join(self.image_dir, imname)
label = int(label)
tracker[label].append(impath)
print("Splitting data into 50% train, 20% val, and 30% test")
def _collate(ims, y, c):
items = []
for im in ims:
item = Datum(impath=im, label=y - 1, classname=c) # convert to 0-based label
items.append(item)
return items
lab2cname = read_json(self.lab2cname_file)
train, val, test = [], [], []
for label, impaths in tracker.items():
random.shuffle(impaths)
n_total = len(impaths)
n_train = round(n_total * 0.5)
n_val = round(n_total * 0.2)
n_test = n_total - n_train - n_val
assert n_train > 0 and n_val > 0 and n_test > 0
cname = lab2cname[str(label)]
train.extend(_collate(impaths[:n_train], label, cname))
val.extend(_collate(impaths[n_train : n_train + n_val], label, cname))
test.extend(_collate(impaths[n_train + n_val :], label, cname))
return train, val, test

186
datasets/oxford_pets.py Normal file
View File

@@ -0,0 +1,186 @@
import os
import pickle
import math
import random
from collections import defaultdict
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import read_json, write_json, mkdir_if_missing
@DATASET_REGISTRY.register()
class OxfordPets(DatasetBase):
dataset_dir = "oxford_pets"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "images")
self.anno_dir = os.path.join(self.dataset_dir, "annotations")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_OxfordPets.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = self.read_split(self.split_path, self.image_dir)
else:
trainval = self.read_data(split_file="trainval.txt")
test = self.read_data(split_file="test.txt")
train, val = self.split_trainval(trainval)
self.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = self.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def read_data(self, split_file):
filepath = os.path.join(self.anno_dir, split_file)
items = []
with open(filepath, "r") as f:
lines = f.readlines()
for line in lines:
line = line.strip()
imname, label, species, _ = line.split(" ")
breed = imname.split("_")[:-1]
breed = "_".join(breed)
breed = breed.lower()
imname += ".jpg"
impath = os.path.join(self.image_dir, imname)
label = int(label) - 1 # convert to 0-based index
item = Datum(impath=impath, label=label, classname=breed)
items.append(item)
return items
@staticmethod
def split_trainval(trainval, p_val=0.2):
p_trn = 1 - p_val
print(f"Splitting trainval into {p_trn:.0%} train and {p_val:.0%} val")
tracker = defaultdict(list)
for idx, item in enumerate(trainval):
label = item.label
tracker[label].append(idx)
train, val = [], []
for label, idxs in tracker.items():
n_val = round(len(idxs) * p_val)
assert n_val > 0
random.shuffle(idxs)
for n, idx in enumerate(idxs):
item = trainval[idx]
if n < n_val:
val.append(item)
else:
train.append(item)
return train, val
@staticmethod
def save_split(train, val, test, filepath, path_prefix):
def _extract(items):
out = []
for item in items:
impath = item.impath
label = item.label
classname = item.classname
impath = impath.replace(path_prefix, "")
if impath.startswith("/"):
impath = impath[1:]
out.append((impath, label, classname))
return out
train = _extract(train)
val = _extract(val)
test = _extract(test)
split = {"train": train, "val": val, "test": test}
write_json(split, filepath)
print(f"Saved split to {filepath}")
@staticmethod
def read_split(filepath, path_prefix):
def _convert(items):
out = []
for impath, label, classname in items:
impath = os.path.join(path_prefix, impath)
item = Datum(impath=impath, label=int(label), classname=classname)
out.append(item)
return out
print(f"Reading split from {filepath}")
split = read_json(filepath)
train = _convert(split["train"])
val = _convert(split["val"])
test = _convert(split["test"])
return train, val, test
@staticmethod
def subsample_classes(*args, subsample="all"):
"""Divide classes into two groups. The first group
represents base classes while the second group represents
new classes.
Args:
args: a list of datasets, e.g. train, val and test.
subsample (str): what classes to subsample.
"""
assert subsample in ["all", "base", "new"]
if subsample == "all":
return args
dataset = args[0]
labels = set()
for item in dataset:
labels.add(item.label)
labels = list(labels)
labels.sort()
n = len(labels)
# Divide classes into two halves
m = math.ceil(n / 2)
print(f"SUBSAMPLE {subsample.upper()} CLASSES!")
if subsample == "base":
selected = labels[:m] # take the first half
else:
selected = labels[m:] # take the second half
relabeler = {y: y_new for y_new, y in enumerate(selected)}
output = []
for dataset in args:
dataset_new = []
for item in dataset:
if item.label not in selected:
continue
item_new = Datum(
impath=item.impath,
label=relabeler[item.label],
classname=item.classname
)
dataset_new.append(item_new)
output.append(dataset_new)
return output

75
datasets/stanford_cars.py Normal file
View File

@@ -0,0 +1,75 @@
import os
import pickle
from scipy.io import loadmat
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class StanfordCars(DatasetBase):
dataset_dir = "stanford_cars"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.split_path = os.path.join(self.dataset_dir, "split_zhou_StanfordCars.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.dataset_dir)
else:
trainval_file = os.path.join(self.dataset_dir, "devkit", "cars_train_annos.mat")
test_file = os.path.join(self.dataset_dir, "cars_test_annos_withlabels.mat")
meta_file = os.path.join(self.dataset_dir, "devkit", "cars_meta.mat")
trainval = self.read_data("cars_train", trainval_file, meta_file)
test = self.read_data("cars_test", test_file, meta_file)
train, val = OxfordPets.split_trainval(trainval)
OxfordPets.save_split(train, val, test, self.split_path, self.dataset_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def read_data(self, image_dir, anno_file, meta_file):
anno_file = loadmat(anno_file)["annotations"][0]
meta_file = loadmat(meta_file)["class_names"][0]
items = []
for i in range(len(anno_file)):
imname = anno_file[i]["fname"][0]
impath = os.path.join(self.dataset_dir, image_dir, imname)
label = anno_file[i]["class"][0, 0]
label = int(label) - 1 # convert to 0-based index
classname = meta_file[label][0]
names = classname.split(" ")
year = names.pop(-1)
names.insert(0, year)
classname = " ".join(names)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

80
datasets/sun397.py Normal file
View File

@@ -0,0 +1,80 @@
import os
import pickle
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class SUN397(DatasetBase):
dataset_dir = "sun397"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "SUN397")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_SUN397.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
classnames = []
with open(os.path.join(self.dataset_dir, "ClassName.txt"), "r") as f:
lines = f.readlines()
for line in lines:
line = line.strip()[1:] # remove /
classnames.append(line)
cname2lab = {c: i for i, c in enumerate(classnames)}
trainval = self.read_data(cname2lab, "Training_01.txt")
test = self.read_data(cname2lab, "Testing_01.txt")
train, val = OxfordPets.split_trainval(trainval)
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def read_data(self, cname2lab, text_file):
text_file = os.path.join(self.dataset_dir, text_file)
items = []
with open(text_file, "r") as f:
lines = f.readlines()
for line in lines:
imname = line.strip()[1:] # remove /
classname = os.path.dirname(imname)
label = cname2lab[classname]
impath = os.path.join(self.image_dir, imname)
names = classname.split("/")[1:] # remove 1st letter
names = names[::-1] # put words like indoor/outdoor at first
classname = " ".join(names)
item = Datum(impath=impath, label=label, classname=classname)
items.append(item)
return items

84
datasets/ucf101.py Normal file
View File

@@ -0,0 +1,84 @@
import os
import pickle
import re
from dassl.data.datasets import DATASET_REGISTRY, Datum, DatasetBase
from dassl.utils import mkdir_if_missing
from .oxford_pets import OxfordPets
@DATASET_REGISTRY.register()
class UCF101(DatasetBase):
dataset_dir = "ucf101"
def __init__(self, cfg):
root = os.path.abspath(os.path.expanduser(cfg.DATASET.ROOT))
self.dataset_dir = os.path.join(root, self.dataset_dir)
self.image_dir = os.path.join(self.dataset_dir, "UCF-101-midframes")
self.split_path = os.path.join(self.dataset_dir, "split_zhou_UCF101.json")
self.split_fewshot_dir = os.path.join(self.dataset_dir, "split_fewshot")
mkdir_if_missing(self.split_fewshot_dir)
if os.path.exists(self.split_path):
train, val, test = OxfordPets.read_split(self.split_path, self.image_dir)
else:
cname2lab = {}
filepath = os.path.join(self.dataset_dir, "ucfTrainTestlist/classInd.txt")
with open(filepath, "r") as f:
lines = f.readlines()
for line in lines:
label, classname = line.strip().split(" ")
label = int(label) - 1 # conver to 0-based index
cname2lab[classname] = label
trainval = self.read_data(cname2lab, "ucfTrainTestlist/trainlist01.txt")
test = self.read_data(cname2lab, "ucfTrainTestlist/testlist01.txt")
train, val = OxfordPets.split_trainval(trainval)
OxfordPets.save_split(train, val, test, self.split_path, self.image_dir)
num_shots = cfg.DATASET.NUM_SHOTS
if num_shots >= 1:
seed = cfg.SEED
preprocessed = os.path.join(self.split_fewshot_dir, f"shot_{num_shots}-seed_{seed}.pkl")
if os.path.exists(preprocessed):
print(f"Loading preprocessed few-shot data from {preprocessed}")
with open(preprocessed, "rb") as file:
data = pickle.load(file)
train, val = data["train"], data["val"]
else:
train = self.generate_fewshot_dataset(train, num_shots=num_shots)
val = self.generate_fewshot_dataset(val, num_shots=min(num_shots, 4))
data = {"train": train, "val": val}
print(f"Saving preprocessed few-shot data to {preprocessed}")
with open(preprocessed, "wb") as file:
pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)
subsample = cfg.DATASET.SUBSAMPLE_CLASSES
train, val, test = OxfordPets.subsample_classes(train, val, test, subsample=subsample)
super().__init__(train_x=train, val=val, test=test)
def read_data(self, cname2lab, text_file):
text_file = os.path.join(self.dataset_dir, text_file)
items = []
with open(text_file, "r") as f:
lines = f.readlines()
for line in lines:
line = line.strip().split(" ")[0] # trainlist: filename, label
action, filename = line.split("/")
label = cname2lab[action]
elements = re.findall("[A-Z][^A-Z]*", action)
renamed_action = "_".join(elements)
filename = filename.replace(".avi", ".jpg")
impath = os.path.join(self.image_dir, renamed_action, filename)
item = Datum(impath=impath, label=label, classname=renamed_action)
items.append(item)
return items

99
docs/Co-CoOp.md Normal file
View File

@@ -0,0 +1,99 @@
# Conditional Prompt Learning for Vision-Language Models (Co-CoOp, CVPR'22)
[![paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2203.05557)
We provide the scripts in [scripts/cocoop](../scripts/cocoop) to reproduce Co-CoOp results (CVPR'22).
Make sure to configure the dataset paths in environment variable `DATA` and run the commands from the main directory `PromptSRC/`.
## Generalization From Base to New Classes
This corresponds to the experiments in Section 4.1, i.e., Table 1.
You will need both `scripts/cocoop/base2new_train.sh` and `scripts/cocoop/base2new_test.sh`. The former trains a model on bash classes while the latter evaluates the trained model on new classes. Both scripts have two input arguments, i.e., `DATASET` and `SEED`.
`DATASET` takes as input a dataset name, like `imagenet` or `caltech101`. The valid names are the files' names in `CoOp/configs/datasets/`.
Below we provide an example on how to evaluate the model on ImageNet.
```bash
# seed=1
bash scripts/cocoop/base2new_train.sh imagenet 1
bash scripts/cocoop/base2new_test.sh imagenet 1
# seed=2
bash scripts/cocoop/base2new_train.sh imagenet 2
bash scripts/cocoop/base2new_test.sh imagenet 2
# seed=3
bash scripts/cocoop/base2new_train.sh imagenet 3
bash scripts/cocoop/base2new_test.sh imagenet 3
```
When the evaluation is done, you can use `parse_test_res.py` to automatically calculate the average results. For instance, after you finish the evaluation (including `base2new_train.sh` and `base2new_test.sh`) on ImageNet using the aforementioned commands, you would get
```
output
| base2new/
| | test_new/
| | | imagenet/
| | | | shots_16/
| | | | | CoCoOp/
| | | | | | vit_b16_c4_ep10_batch1_ctxv1/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
| | train_base/
| | | imagenet/
| | | | shots_16/
| | | | | CoCoOp/
| | | | | | vit_b16_c4_ep10_batch1_ctxv1/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
```
Then, to get the average performance on the base classes, run
```bash
python parse_test_res.py output/base2new/train_base/imagenet/shots_16/CoCoOp/vit_b16_c4_ep10_batch1_ctxv1
```
To get the average performance on the new classes, run
```bash
python parse_test_res.py output/base2new/test_new/imagenet/shots_16/CoCoOp/vit_b16_c4_ep10_batch1_ctxv1 --test-log
```
## Cross-Dataset Transfer
This corresponds to the experiments in Section 4.2, i.e., Table 2.
The relevant scripts are `scripts/cocoop/xd_train.sh` and `scripts/cocoop/xd_test.sh` where the `DATASET` variable is set to the default, namely `imagenet`. To train the model, run
```bash
# seed=1
bash scripts/cocoop/xd_train.sh 1
# seed=2
bash scripts/cocoop/xd_train.sh 2
# seed=3
bash scripts/cocoop/xd_train.sh 3
```
Then, you evaluate the model on other datasets, e.g.,
```bash
for SEED in 1 2 3
do
bash scripts/cocoop/xd_test.sh caltech101 ${SEED}
bash scripts/cocoop/xd_test.sh oxford_pets ${SEED}
bash scripts/cocoop/xd_test.sh stanford_cars ${SEED}
done
```
## Domain Generalization
This corresponds to the experiments in Section 4.3, i.e., Table 3.
The steps are similar to those discussed in "Cross-Dataset Transfer" except you evaluate the model on the variants of ImageNet, i.e., `imagenetv2`, `imagenet_sketch`, `imagenet_a` and `imagenet_r`.

99
docs/CoOp.md Normal file
View File

@@ -0,0 +1,99 @@
# Conditional Prompt Learning for Vision-Language Models (Co-CoOp, CVPR'22)
[![paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2203.05557)
We provide the scripts in [scripts/cocoop](../scripts/cocoop) to reproduce Co-CoOp results (CVPR'22).
Make sure to configure the dataset paths in environment variable `DATA` and run the commands from the main directory `PromptSRC/`.
## Generalization From Base to New Classes
This corresponds to the experiments in Section 4.1, i.e., Table 1.
You will need both `scripts/cocoop/base2new_train.sh` and `scripts/cocoop/base2new_test.sh`. The former trains a model on bash classes while the latter evaluates the trained model on new classes. Both scripts have two input arguments, i.e., `DATASET` and `SEED`.
`DATASET` takes as input a dataset name, like `imagenet` or `caltech101`. The valid names are the files' names in `CoOp/configs/datasets/`.
Below we provide an example on how to evaluate the model on ImageNet.
```bash
# seed=1
bash scripts/cocoop/base2new_train.sh imagenet 1
bash scripts/cocoop/base2new_test.sh imagenet 1
# seed=2
bash scripts/cocoop/base2new_train.sh imagenet 2
bash scripts/cocoop/base2new_test.sh imagenet 2
# seed=3
bash scripts/cocoop/base2new_train.sh imagenet 3
bash scripts/cocoop/base2new_test.sh imagenet 3
```
When the evaluation is done, you can use `parse_test_res.py` to automatically calculate the average results. For instance, after you finish the evaluation (including `base2new_train.sh` and `base2new_test.sh`) on ImageNet using the aforementioned commands, you would get
```
output
| base2new/
| | test_new/
| | | imagenet/
| | | | shots_16/
| | | | | CoCoOp/
| | | | | | vit_b16_c4_ep10_batch1_ctxv1/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
| | train_base/
| | | imagenet/
| | | | shots_16/
| | | | | CoCoOp/
| | | | | | vit_b16_c4_ep10_batch1_ctxv1/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
```
Then, to get the average performance on the base classes, run
```bash
python parse_test_res.py output/base2new/train_base/imagenet/shots_16/CoCoOp/vit_b16_c4_ep10_batch1_ctxv1
```
To get the average performance on the new classes, run
```bash
python parse_test_res.py output/base2new/test_new/imagenet/shots_16/CoCoOp/vit_b16_c4_ep10_batch1_ctxv1 --test-log
```
## Cross-Dataset Transfer
This corresponds to the experiments in Section 4.2, i.e., Table 2.
The relevant scripts are `scripts/cocoop/xd_train.sh` and `scripts/cocoop/xd_test.sh` where the `DATASET` variable is set to the default, namely `imagenet`. To train the model, run
```bash
# seed=1
bash scripts/cocoop/xd_train.sh 1
# seed=2
bash scripts/cocoop/xd_train.sh 2
# seed=3
bash scripts/cocoop/xd_train.sh 3
```
Then, you evaluate the model on other datasets, e.g.,
```bash
for SEED in 1 2 3
do
bash scripts/cocoop/xd_test.sh caltech101 ${SEED}
bash scripts/cocoop/xd_test.sh oxford_pets ${SEED}
bash scripts/cocoop/xd_test.sh stanford_cars ${SEED}
done
```
## Domain Generalization
This corresponds to the experiments in Section 4.3, i.e., Table 3.
The steps are similar to those discussed in "Cross-Dataset Transfer" except you evaluate the model on the variants of ImageNet, i.e., `imagenetv2`, `imagenet_sketch`, `imagenet_a` and `imagenet_r`.

233
docs/DATASETS.md Normal file
View File

@@ -0,0 +1,233 @@
# How to install datasets
### Acknowledgement: This readme file for installing datasets has been borrowed directly from [MaPLe's](https://github.com/muzairkhattak/multimodal-prompt-learning) official repository.
We recommend putting all datasets under the same folder (say `$DATA`) to ease management and following the instructions below to organize datasets to avoid modifying the source code. The file structure should look like:
```
$DATA/
| imagenet/
| caltech-101/
| oxford_pets/
| stanford_cars/
```
If you have some datasets already installed somewhere else, you can create symbolic links in `$DATA/dataset_name` that point to the original data to avoid duplicate download.
Datasets list:
- [ImageNet](#imagenet)
- [Caltech101](#caltech101)
- [OxfordPets](#oxfordpets)
- [StanfordCars](#stanfordcars)
- [Flowers102](#flowers102)
- [Food101](#food101)
- [FGVCAircraft](#fgvcaircraft)
- [SUN397](#sun397)
- [DTD](#dtd)
- [EuroSAT](#eurosat)
- [UCF101](#ucf101)
- [ImageNetV2](#imagenetv2)
- [ImageNet-Sketch](#imagenet-sketch)
- [ImageNet-A](#imagenet-a)
- [ImageNet-R](#imagenet-r)
The instructions to prepare each dataset are detailed below. To ensure reproducibility and fair comparison for future work, we provide fixed train/val/test splits for all datasets except ImageNet where the validation set is used as test set. The fixed splits are either from the original datasets (if available) or created by us.
### ImageNet
- Create a folder named `imagenet/` under `$DATA`.
- Create `images/` under `imagenet/`.
- Download the dataset from the [official website](https://image-net.org/index.php) and extract the training and validation sets to `$DATA/imagenet/images`. The directory structure should look like
```
imagenet/
| images/
| | train/ # contains 1,000 folders like n01440764, n01443537, etc.
| | val/
```
- If you had downloaded the ImageNet dataset before, you can create symbolic links to map the training and validation sets to `$DATA/imagenet/images`.
- Download the `classnames.txt` to `$DATA/imagenet/` from this [link](https://drive.google.com/file/d/1-61f_ol79pViBFDG_IDlUQSwoLcn2XXF/view?usp=sharing). The class names are copied from [CLIP](https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb).
### Caltech101
- Create a folder named `caltech-101/` under `$DATA`.
- Download `101_ObjectCategories.tar.gz` from http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz and extract the file under `$DATA/caltech-101`.
- Download `split_zhou_Caltech101.json` from this [link](https://drive.google.com/file/d/1hyarUivQE36mY6jSomru6Fjd-JzwcCzN/view?usp=sharing) and put it under `$DATA/caltech-101`.
The directory structure should look like
```
caltech-101/
| 101_ObjectCategories/
| split_zhou_Caltech101.json
```
### OxfordPets
- Create a folder named `oxford_pets/` under `$DATA`.
- Download the images from https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz.
- Download the annotations from https://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz.
- Download `split_zhou_OxfordPets.json` from this [link](https://drive.google.com/file/d/1501r8Ber4nNKvmlFVQZ8SeUHTcdTTEqs/view?usp=sharing).
The directory structure should look like
```
oxford_pets/
| images/
| annotations/
| split_zhou_OxfordPets.json
```
### StanfordCars
- Create a folder named `stanford_cars/` under `$DATA`.
- Download the train images http://ai.stanford.edu/~jkrause/car196/cars_train.tgz.
- Download the test images http://ai.stanford.edu/~jkrause/car196/cars_test.tgz.
- Download the train labels https://ai.stanford.edu/~jkrause/cars/car_devkit.tgz.
- Download the test labels http://ai.stanford.edu/~jkrause/car196/cars_test_annos_withlabels.mat.
- Download `split_zhou_StanfordCars.json` from this [link](https://drive.google.com/file/d/1ObCFbaAgVu0I-k_Au-gIUcefirdAuizT/view?usp=sharing).
The directory structure should look like
```
stanford_cars/
| cars_test\
| cars_test_annos_withlabels.mat
| cars_train\
| devkit\
| split_zhou_StanfordCars.json
```
### Flowers102
- Create a folder named `oxford_flowers/` under `$DATA`.
- Download the images and labels from https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz and https://www.robots.ox.ac.uk/~vgg/data/flowers/102/imagelabels.mat respectively.
- Download `cat_to_name.json` from [here](https://drive.google.com/file/d/1AkcxCXeK_RCGCEC_GvmWxjcjaNhu-at0/view?usp=sharing).
- Download `split_zhou_OxfordFlowers.json` from [here](https://drive.google.com/file/d/1Pp0sRXzZFZq15zVOzKjKBu4A9i01nozT/view?usp=sharing).
The directory structure should look like
```
oxford_flowers/
| cat_to_name.json
| imagelabels.mat
| jpg/
| split_zhou_OxfordFlowers.json
```
### Food101
- Download the dataset from https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/ and extract the file `food-101.tar.gz` under `$DATA`, resulting in a folder named `$DATA/food-101/`.
- Download `split_zhou_Food101.json` from [here](https://drive.google.com/file/d/1QK0tGi096I0Ba6kggatX1ee6dJFIcEJl/view?usp=sharing).
The directory structure should look like
```
food-101/
| images/
| license_agreement.txt
| meta/
| README.txt
| split_zhou_Food101.json
```
### FGVCAircraft
- Download the data from https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/archives/fgvc-aircraft-2013b.tar.gz.
- Extract `fgvc-aircraft-2013b.tar.gz` and keep only `data/`.
- Move `data/` to `$DATA` and rename the folder to `fgvc_aircraft/`.
The directory structure should look like
```
fgvc_aircraft/
| images/
| ... # a bunch of .txt files
```
### SUN397
- Create a folder named `sun397/` under `$DATA`.
- Download the images http://vision.princeton.edu/projects/2010/SUN/SUN397.tar.gz.
- Download the partitions https://vision.princeton.edu/projects/2010/SUN/download/Partitions.zip.
- Extract these files under `$DATA/sun397/`.
- Download `split_zhou_SUN397.json` from this [link](https://drive.google.com/file/d/1y2RD81BYuiyvebdN-JymPfyWYcd8_MUq/view?usp=sharing).
The directory structure should look like
```
sun397/
| SUN397/
| split_zhou_SUN397.json
| ... # a bunch of .txt files
```
### DTD
- Download the dataset from https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz and extract it to `$DATA`. This should lead to `$DATA/dtd/`.
- Download `split_zhou_DescribableTextures.json` from this [link](https://drive.google.com/file/d/1u3_QfB467jqHgNXC00UIzbLZRQCg2S7x/view?usp=sharing).
The directory structure should look like
```
dtd/
| images/
| imdb/
| labels/
| split_zhou_DescribableTextures.json
```
### EuroSAT
- Create a folder named `eurosat/` under `$DATA`.
- Download the dataset from http://madm.dfki.de/files/sentinel/EuroSAT.zip and extract it to `$DATA/eurosat/`.
- Download `split_zhou_EuroSAT.json` from [here](https://drive.google.com/file/d/1Ip7yaCWFi0eaOFUGga0lUdVi_DDQth1o/view?usp=sharing).
The directory structure should look like
```
eurosat/
| 2750/
| split_zhou_EuroSAT.json
```
### UCF101
- Create a folder named `ucf101/` under `$DATA`.
- Download the zip file `UCF-101-midframes.zip` from [here](https://drive.google.com/file/d/10Jqome3vtUA2keJkNanAiFpgbyC9Hc2O/view?usp=sharing) and extract it to `$DATA/ucf101/`. This zip file contains the extracted middle video frames.
- Download `split_zhou_UCF101.json` from this [link](https://drive.google.com/file/d/1I0S0q91hJfsV9Gf4xDIjgDq4AqBNJb1y/view?usp=sharing).
The directory structure should look like
```
ucf101/
| UCF-101-midframes/
| split_zhou_UCF101.json
```
### ImageNetV2
- Create a folder named `imagenetv2/` under `$DATA`.
- Go to this github repo https://github.com/modestyachts/ImageNetV2.
- Download the matched-frequency dataset from https://s3-us-west-2.amazonaws.com/imagenetv2public/imagenetv2-matched-frequency.tar.gz and extract it to `$DATA/imagenetv2/`.
- Copy `$DATA/imagenet/classnames.txt` to `$DATA/imagenetv2/`.
The directory structure should look like
```
imagenetv2/
| imagenetv2-matched-frequency-format-val/
| classnames.txt
```
### ImageNet-Sketch
- Download the dataset from https://github.com/HaohanWang/ImageNet-Sketch.
- Extract the dataset to `$DATA/imagenet-sketch`.
- Copy `$DATA/imagenet/classnames.txt` to `$DATA/imagenet-sketch/`.
The directory structure should look like
```
imagenet-sketch/
| images/ # contains 1,000 folders whose names have the format of n*
| classnames.txt
```
### ImageNet-A
- Create a folder named `imagenet-adversarial/` under `$DATA`.
- Download the dataset from https://github.com/hendrycks/natural-adv-examples and extract it to `$DATA/imagenet-adversarial/`.
- Copy `$DATA/imagenet/classnames.txt` to `$DATA/imagenet-adversarial/`.
The directory structure should look like
```
imagenet-adversarial/
| imagenet-a/ # contains 200 folders whose names have the format of n*
| classnames.txt
```
### ImageNet-R
- Create a folder named `imagenet-rendition/` under `$DATA`.
- Download the dataset from https://github.com/hendrycks/imagenet-r and extract it to `$DATA/imagenet-rendition/`.
- Copy `$DATA/imagenet/classnames.txt` to `$DATA/imagenet-rendition/`.
The directory structure should look like
```
imagenet-rendition/
| imagenet-r/ # contains 200 folders whose names have the format of n*
| classnames.txt
```

149
docs/EVAL.md Normal file
View File

@@ -0,0 +1,149 @@
# Evaluating and Reproducing PromptSRC Results
We provide bash scripts in [scripts/](../scripts) directory for evaluating PromptSRC and independent V-L prompting baseline using the provided pre-trained model checkpoints.
Make sure to update the `DATA` variable with dataset path in the script file and run the commands from the main directory `PromptSRC/`.
Below we provide the pre-trained models evaluation instructions for PromptSRC. The same instructions applies for reproducing results for the baseline *independent V-L prompting* and MaPLe.
## PromptSRC
#### (1) Base-to-Novel class generalization setting
The base-to-novel PromptSRC configuration is provided in config file at `configs/trainers/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx.yaml`. No hyper-parameters or other settings should be changed in the config file during evaluation of pre-trained models.
We show an example to reproduce results for imagenet. Follow the instructions below to reproduce results using our pre-trained model weights:
* Download the zipped folder containing base-to-novel generalization pre-trained weights for a single dataset from this [link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/syed_wasim_mbzuai_ac_ae/Em_3tkSj6T9AmhVjmzKTL3gBYNehhvfJl8ke2pU3U0nabA?e=9ecjQA). After unzipping, the directory should look like this:
```
imagenet
| base/
| | seed1/
| | seed2/
| | seed3/
```
Now use the evaluation script `scripts/promptsrc/reproduce_base2novel_setting.sh` and run the commands below to calculate the results over 3 seeds:
```bash
# Other possible dataset values includes [caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat]
# evaluate on base and novel classes for SEED1
bash scripts/promptsrc/reproduce_base2novel_setting.sh imagenet 1 /path/to/imagenet/weights/folder
# evaluate on base and novel classes for SEED2
bash scripts/promptsrc/reproduce_base2novel_setting.sh imagenet 2 /path/to/imagenet/weights/folder
# evaluate on base and novel classes for SEED3
bash scripts/promptsrc/reproduce_base2novel_setting.sh imagenet 3 /path/to/imagenet/weights/folder
```
This should evaluate and save the log files in `output/` directory. To obtain the averaged results, run:
```bash
# prints averaged results for base classes
python output/base2new/test_base/imagenet/shots_16/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx --test-log
# prints averaged results for novel classes
python output/base2new/test_new/imagenet/shots_16/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx --test-log
```
The same above steps can be repeated for other individual datasets by providing respective dataset name and checkpoints path.
#### (2) Cross-dataset and domain generalization setting
In cross-dataset and domain generalization setting, we first train PromptSRC on ImageNet-1k in few-shot manner with 16 shots for all 3 seeds and then evaluate the trained model directly on cross-datasets and out-of-distribution datasets.
We provide the instructions below to reproduce cross-datasets and domain generalization results using our pre-trained imagenet model weights for PromptSRC:
* Download the zipped folder containing pre-trained weights for imagenet from this [link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/syed_wasim_mbzuai_ac_ae/Ekr9qF0cSaVDr0X6OlP2JAEBG1xjlTMjHNLc28g1SjwW-w?e=AA5ABi). After unzipping, the directory should look like this:
```
imagenet
| seed1/
| seed2/
| seed3/
```
Now use the evaluation script `scripts/promptsrc/reproduce_xd.sh` and run the commands below to calculate the results for food101 dataset over 3 seeds:
```bash
# Other possible dataset values for cross-datasets includes [caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat]
# possible dataset values for domain generalization benchmark includes [imagenetv2, imagenet_sketch, imagenet_a, imagenet_r]
# evaluate on given dataset for SEED1
bash scripts/promptsrc/reproduce_xd.sh food101 1 /path/to/imagenet/weights/folder
# evaluate on given dataset for SEED2
bash scripts/promptsrc/reproduce_xd.sh food101 2 /path/to/imagenet/weights/folder
# evaluate on given dataset for SEED3
bash scripts/promptsrc/reproduce_xd.sh food101 3 /path/to/imagenet/weights/folder
```
This should evaluate and save the log files in `output/` directory. To obtain the results averaged over 3 seeds, run:
```bash
# prints averaged results for food101 dataset
python parse_test_res.py output/evaluation/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx_cross_datasets_16shots/food101 --test-log
```
The same above steps can be repeated for other individual datasets by providing respective dataset name and checkpoints path.
#### (3) Few-shot setting
In this setting, PromptSRC is trained on all classes individual datasets with different few-shot splits (K = 1, 2, 4, 8, 16). The PromptSRC config for few-shot setting is available at: `configs/trainers/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot.yaml`.
Follow the instructions below to reproduce PromptSRC few-shot setting results using our pre-trained models:
Now use the evaluation script `scripts/promptsrc/reproduce_few_shot.sh` and run the commands below to calculate the results for imagenet dataset over 3 seeds:
```bash
# reproduce_few_shot.sh calculates results for all 3 seeds for a given K
# Other possible dataset values includes [caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat]
# evaluate on given dataset for K=1 shot
bash scripts/promptsrc/reproduce_few_shot.sh food101 1 /path/to/imagenet/weights/folder
# evaluate on given dataset for K=2 shot
bash scripts/promptsrc/reproduce_few_shot.sh food101 2 /path/to/imagenet/weights/folder
# evaluate on given dataset for K=4 shot
bash scripts/promptsrc/reproduce_few_shot.sh food101 4 /path/to/imagenet/weights/folder
# evaluate on given dataset for K=8 shot
bash scripts/promptsrc/reproduce_few_shot.sh food101 8 /path/to/imagenet/weights/folder
# evaluate on given dataset for K=16 shot
bash scripts/promptsrc/reproduce_few_shot.sh food101 16 /path/to/imagenet/weights/folder
```
This should evaluate and save the log files in `output/` directory. To obtain the results averaged over 3 seeds for all shots, run:
```bash
# prints averaged results for food101 dataset for K=1
python parse_test_res.py output/few_shot/food101/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot_1shots/food101 --test-log
# prints averaged results for food101 dataset for K=2
python parse_test_res.py output/few_shot/food101/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot_2shots/food101 --test-log
# prints averaged results for food101 dataset for K=4
python parse_test_res.py output/few_shot/food101/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot_4shots/food101 --test-log
# prints averaged results for food101 dataset for K=8
python parse_test_res.py output/few_shot/food101/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot_8shots/food101 --test-log
# prints averaged results for food101 dataset for K=16
python parse_test_res.py output/few_shot/food101/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot_16shots/food101 --test-log
```
The same above steps can be repeated for other individual datasets by providing respective dataset name and checkpoints path.
<br>
## Training and Evaluating the independent V-L prompting baseline results
For IVLP baseline method, we provide its corresponding default configs and evaluation scripts as follows.
```
configs
| datasets/
| trainers/
| | CoCoOp/
| | CoOp/
| | MaPLe/
| | IVLP/
| | PromptSRC/
```
```
scripts
| cocoop/
| coop/
| maple/
| independent-vlp/
| promptsrc/
```
Please use the corresponding config and script files and follow the same instructions as provided for PromptSRC in order to evaluate and reproduce results of IVLP baseline approach. The pretrained weights for IVLP baseline are provided [at this link](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/syed_wasim_mbzuai_ac_ae/EuIwh-yMh_JBqB2Y_o8Jl14BPDKDRHC0JBPE1BugIeZiSQ?e=oJnJwy).
This repository also supports using official [CoOp](CoOp.md) and [Co-CoOp](Co-CoOp.md) configs and models.

48
docs/INSTALL.md Normal file
View File

@@ -0,0 +1,48 @@
# Installation
### Acknowledgement: This readme file for installing datasets is modified from [MaPLe's](https://github.com/muzairkhattak/multimodal-prompt-learning) official repository.
This codebase is tested on Ubuntu 20.04.2 LTS with python 3.8. Follow the below steps to create environment and install dependencies.
* Setup conda environment (recommended).
```bash
# Create a conda environment
conda create -y -n promptsrc python=3.8
# Activate the environment
conda activate promptsrc
# Install torch (requires version >= 1.8.1) and torchvision
# Please refer to https://pytorch.org/ if you need a different cuda version
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
```
* Install dassl library.
```bash
# Instructions borrowed from https://github.com/KaiyangZhou/Dassl.pytorch#installation
# Clone this repo
git clone https://github.com/KaiyangZhou/Dassl.pytorch.git
cd Dassl.pytorch/
# Install dependencies
pip install -r requirements.txt
# Install this library (no need to re-build if the source code is modified)
python setup.py develop
cd ..
```
* Clone PromptSRC code repository and install requirements
```bash
# Clone PromptSRC code base
git clone https://github.com/muzairkhattak/PromptSRC.git
cd PromptSRC/
# Install requirements
pip install -r requirements.txt
# Update setuptools package
pip install setuptools==59.5.0
```

211
docs/MaPLe.md Normal file
View File

@@ -0,0 +1,211 @@
# Training and Evaluation
We provide bash scripts in [scripts/](../scripts) for each prompting variant including MaPLe, vision, language and independent V-L prompting.
Make sure to configure the dataset paths in environment variable `DATA` and run the commands from the main directory `multimodal-prompt-learning/`.
Below we provide training and evaluation instructions for MaPLe. The same instructions applies for all other variants including *Vision (VPT), Language and independent V-L prompting*.
### Training time and compute
We train MaPLe on each dataset with a batch size of 4 using a **single** NVIDIA A100 GPU.
Training MaPle on ImageNet for 5 epochs takes 1 hour for a single seed. So results for 3 seeds takes around 3 hours. For all remaining 10 datasets, it combinedly takes around 4 hours (for all 3 seeds) on a single A100 GPU. To ease reproduction of MaPLe results, we have provided [training logs](https://drive.google.com/drive/folders/1EvuvgR8566bL0T7ucvAL3LFVwuUPMRas?usp=sharing) for all datasets.
## MaPLe
#### (1) Base-to-Novel class generalization setting
The default training settings are provided in config file at `configs/trainers/MaPLe/vit_b16_c2_ep5_batch4_2ctx.yaml`. All hyper-parameters such as prompt length, prompt depth, etc., can be modified using this config file.
Below, we provide instructions to train MaPLe on imagenet.
```bash
# Other possible dataset values includes [caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat]
# seed=1
# trains and evaluates on base classes
bash scripts/maple/base2new_train_maple.sh imagenet 1
# evaluates on novel classes
bash scripts/maple/base2new_test_maple.sh imagenet 1
# seed=2
# trains and evaluates on base classes
bash scripts/maple/base2new_train_maple.sh imagenet 2
# evaluates on novel classes
bash scripts/maple/base2new_test_maple.sh imagenet 2
# seed=3
# trains and evaluates on base classes
bash scripts/maple/base2new_train_maple.sh imagenet 3
# evaluates on novel classes
bash scripts/maple/base2new_test_maple.sh imagenet 3
```
#### Averaging results over 3 seeds:
Once the above trainings and evaluations are completed, the `output/` directory should have the following structure:
```
output
| base2new/
| | test_new/
| | | imagenet/
| | | | shots_16/
| | | | | MaPLe/
| | | | | | vit_b16_c2_ep5_batch4_2ctx/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
| | train_base/
| | | imagenet/
| | | | shots_16/
| | | | | MaPLe/
| | | | | | vit_b16_c2_ep5_batch4_2ctx/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
```
Now use the script `parse_test_res.py` and run the commands below to calculate the averaged results:
```bash
# prints averaged results for base classes
python parse_test_res.py output/base2new/train_base/imagenet/shots_16/MaPLe/vit_b16_c2_ep5_batch4_2ctx
# averaged results for novel classes
python parse_test_res.py output/base2new/test_new/imagenet/shots_16/MaPLe/vit_b16_c2_ep5_batch4_2ctx --test-log
```
The above steps can be repeated for other individual datasets.
#### Reproducing results using pre-trained weights for base-to-novel generalization setting
We show an example to reproduce results for imagenet. Follow the instructions below to reproduce results using our pre-trained model weights:
* Download the zipped folder containing pre-trained weights for a single dataset from this [link](https://drive.google.com/drive/folders/1-tB6BUDBzs9CXTOJ7p5hM4Svq1tL_mGz?usp=sharing). Additionally we also provide the log files for both training and evaluation. After unzipping, the directory should look like this:
```
imagenet
| base/
| | seed1/
| | seed2/
| | seed3/
| novel/
| | seed1/
| | seed2/
| | seed3/
```
Now use the evaluation script `scripts/maple/reproduce_maple.sh` and run the commands below to calculate the averaged results:
```bash
# evaluate on base and novel classes for SEED1
bash scripts/maple/reproduce_maple.sh imagenet 1 /path/to/imagenet/weights/folder
# evaluate on base and novel classes for SEED2
bash scripts/maple/reproduce_maple.sh imagenet 2 /path/to/imagenet/weights/folder
# evaluate on base and novel classes for SEED3
bash scripts/maple/reproduce_maple.sh imagenet 3 /path/to/imagenet/weights/folder
```
This should evaluate and save the log files in `output/` directory. To obtain the averaged results, run:
```bash
# prints averaged results for base classes
python parse_test_res.py output/base2new/train_base/imagenet/shots_16/MaPLe/vit_b16_c2_ep5_batch4_2ctx
# averaged results for novel classes
python parse_test_res.py output/base2new/test_new/imagenet/shots_16/MaPLe/vit_b16_c2_ep5_batch4_2ctx --test-log
```
#### (2) Cross-Dataset Transfer
We provide instructions to train MaPLe on imageNet using all 1000 classes and then evaluating it directly on new downstream datasets.
We provide cross-dataset config for MaPLe: `configs/MaPLe/vit_b16_c2_ep5_batch4_2ctx_cross_datasets.yaml`.
* Firstly, train MaPLe on imagenet in few-shot manner (for all 3 seeds).
```bash
# seed=1
bash scripts/maple/xd_train_maple.sh imagenet 1
# seed=2
bash scripts/maple/xd_train_maple.sh imagenet 2
# seed=3
bash scripts/maple/xd_train_maple.sh imagenet 3
```
* Now evaluate imageNet model on downstream datasets.
```bash
for SEED in 1 2 3
do
bash scripts/maple/xd_test_maple.sh caltech101 ${SEED}
bash scripts/maple/xd_test_maple.sh oxford_pets ${SEED}
bash scripts/maple/xd_test_maple.sh stanford_cars ${SEED}
done
```
#### (3) Domain Generalization
We use imagenet trained MaPLe model for domain generalization experiments. The steps are similar to above cross-dataset experiments, however, model is evaluated on imagenet variants.
* Evaluate imageNet model on variants of imagenet (domain shift datasets).
```bash
for SEED in 1 2 3
do
bash scripts/maple/xd_test_maple.sh imagenetv2 ${SEED}
bash scripts/maple/xd_test_maple.sh imagenet_sketch ${SEED}
bash scripts/maple/xd_test_maple.sh imagenet_a ${SEED}
bash scripts/maple/xd_test_maple.sh imagenet_r ${SEED}
done
```
You can obtain averaged results by using the script `parse_test_res.py` and following the similar steps as provided in base-to-novel generalization experiments.
<br>
#### Reproducing official results for cross-dataset and domain generalization setting
We provide the instructions below to reproduce domain-generalization and cross-datasets results using our pre-trained imagenet model weights for MaPLe:
* Download the zipped folder containing pre-trained weights for imagenet from this [link](https://drive.google.com/drive/folders/1bmhvmNZc13WJ5U71qt0t8k91wyuoemVF?usp=sharing). Additionally, we also provide the log files for both training and evaluation. After unzipping, the directory should look like this:
```
imagenet
| seed1/
| seed2/
| seed3/
```
Now use the evaluation script `scripts/maple/reproduce_maple_xd.sh` and run the commands below to calculate the averaged results:
```bash
# evaluate on given dataset for SEED1
bash scripts/maple/reproduce_maple_xd.sh food101 1 /path/to/imagenet/weights/folder
# evaluate on given dataset for SEED2
bash scripts/maple/reproduce_maple_xd.sh food101 2 /path/to/imagenet/weights/folder
# evaluate on given dataset for SEED3
bash scripts/maple/reproduce_maple_xd.sh food101 3 /path/to/imagenet/weights/folder
```
This should evaluate and save the log files in `output/` directory. To obtain the averaged results, run:
```bash
# prints averaged results for food101 dataset
python parse_test_res.py output/evaluation/MaPLe/vit_b16_c2_ep5_batch4_2ctx_cross_datasets_16shots/food101 --test-log
```
#### Training and Evaluating other variants
For other variants including vision, language and independent V-L prompting techniques, we provide their corresponding configs and scripts as follows.
```
configs
| datasets/
| trainers/
| | CoCoOp/
| | CoOp/
| | MaPLe/
| | IVLP/
| | VPT/
```
```
scripts
| cocoop/
| coop/
| language-prompting/
| maple/
| independent-vlp/
```
Please use the corresponding config and script files and follow the same instructions as provided for MaPLe in order to train and evaluate the other variants. Same instructions can be followed to reproduce results of other variants using provided pretrained weights.

169
docs/TRAIN.md Normal file
View File

@@ -0,0 +1,169 @@
# PromptSRC Training
We provide bash scripts in [scripts/](../scripts) for training PromptSRC and independent V-L prompting baseline.
Make sure to update the `DATA` variable with dataset path in the script file and run the commands from the main directory `PromptSRC/`.
Below we provide training and testing instructions for PromptSRC. The same instructions are applicable for the baseline *independent V-L prompting* approach, MaPLe, CoOp and CoCoOp.
### Training time and compute
We train PromptSRC on each dataset with a batch size of 4 using a **single** NVIDIA A100 GPU.
Training PromptSRC on ImageNet for 20 epochs takes around 6 hours for a single seed. So results for 3 seeds takes around 18 hours. For all remaining 10 datasets, it combinedly takes around around 8 hours (for all 3 seeds) on a single A100 GPU.
## PromptSRC
#### (1) Base-to-Novel class generalization setting
The base-to-novel PromptSRC configuration is provided in config file at `configs/trainers/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx.yaml`. All hyper-parameters such as GPA STD, GPA Mean, SCL loss weights coefficients, prompt length and prompt depth etc., can be modified using this config file.
Run the commands below to train PromptSRC on ImageNet.
```bash
# Other possible dataset values includes [caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat]
# seed=1
# trains and evaluates on base classes
bash scripts/promptsrc/base2new_train.sh imagenet 1
# evaluates on novel classes
bash scripts/promptsrc/base2new_test.sh imagenet 1
# seed=2
# trains and evaluates on base classes
bash scripts/promptsrc/base2new_train.sh imagenet 2
# evaluates on novel classes
bash scripts/promptsrc/base2new_test.sh imagenet 2
# seed=3
# trains and evaluates on base classes
bash scripts/promptsrc/base2new_train.sh imagenet 3
# evaluates on novel classes
bash scripts/promptsrc/base2new_test.sh imagenet 3
```
#### Averaging results over 3 seeds:
Once the above trainings and evaluations are completed, the `output/` directory should have the following structure:
```
output
| base2new/
| | test_new/
| | | imagenet/
| | | | shots_16/
| | | | | PromptSRC/
| | | | | | vit_b16_c2_ep20_batch4_4+4ctx/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
| | train_base/
| | | imagenet/
| | | | shots_16/
| | | | | PromptSRC/
| | | | | | vit_b16_c2_ep20_batch4_4+4ctx/
| | | | | | | seed1/
| | | | | | | seed2/
| | | | | | | seed3/
```
Now use the script `parse_test_res.py` and run the commands below to calculate the averaged results:
```bash
# prints averaged results for base classes
python output/base2new/train_base/imagenet/shots_16/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx --test-log
# averaged results for novel classes
python output/base2new/test_new/imagenet/shots_16/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx --test-log
```
The above steps can be repeated for other individual datasets.
#### (2) Cross-Dataset Transfer setting
We provide instructions to train PromptSRC on ImageNet using all 1000 classes with 16 shots and then evaluating it directly on new downstream datasets.
The corresponding cross-dataset config for PromptSRC is available at: `configs/trainers/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx_cross_datasets.yaml`. All PromptSRC hyper-parameters can be modified in this config file.
* Firstly, train PromptSRC on imagenet in few-shot manner (for all 3 seeds).
```bash
# seed=1
bash scripts/promptsrc/xd_train.sh imagenet 1
# seed=2
bash scripts/promptsrc/xd_train.sh imagenet 2
# seed=3
bash scripts/promptsrc/xd_train.sh imagenet 3
```
* Now directly evaluate the ImageNet trained model on downstream cross-datasets.
```bash
# Other possible dataset values includes [imagenet, food101, dtd, ucf101, oxford_flowers, fgvc_aircraft, sun397, eurosat]
for SEED in 1 2 3
do
bash scripts/promptsrc/xd_test.sh caltech101 ${SEED}
bash scripts/promptsrc/xd_test.sh oxford_pets ${SEED}
bash scripts/promptsrc/xd_test.sh stanford_cars ${SEED}
done
```
You can obtain averaged results by using the script `parse_test_res.py` and following the similar steps as provided in base-to-novel generalization experiments.
#### (3) Domain Generalization setting
We use the same ImageNet trained PromptSRC model for domain generalization experiments. The steps are similar to above cross-dataset experiments, however, the trained model is now evaluated on ImageNet variants.
The corresponding domain generalization config for PromptSRC is available at: `configs/trainers/PromptSRC/vit_b16_c2_ep20_batch4_4+4ctx_cross_datasets.yaml`.
* Evaluate ImageNet model on different variants of ImageNet (datasets with domain shifts).
```bash
for SEED in 1 2 3
do
bash scripts/promptsrc/xd_test.sh imagenetv2 ${SEED}
bash scripts/promptsrc/xd_test.sh imagenet_sketch ${SEED}
bash scripts/promptsrc/xd_test.sh imagenet_a ${SEED}
bash scripts/promptsrc/xd_test.sh imagenet_r ${SEED}
done
```
You can obtain averaged results by using the script `parse_test_res.py` and following the similar steps as provided in base-to-novel generalization experiments.
#### (4) Few-shot setting
In this setting, PromptSRC is trained on all classes individual datasets with different few-shot splits (K = 1, 2, 4, 8, 16). The corresponding few-shot setting config for PromptSRC is available at: `configs/trainers/PromptSRC/vit_b16_c2_ep50_batch4_4+4ctx_few_shot.yaml`.
Now use the training script `scripts/promptsrc/few_shot.sh` and run the commands below to calculate the results for imagenet dataset for all shots over 3 seeds:
```bash
# Other possible dataset values includes [caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat]
# train and test on given dataset for K=1 shot
bash scripts/promptsrc/few_shot.sh imagenet 1
# train and test on given dataset for K=2 shot
bash scripts/promptsrc/few_shot.sh imagenet 2
# train and test on given dataset for K=4 shot
bash scripts/promptsrc/few_shot.sh imagenet 4
# train and test on given dataset for K=8 shot
bash scripts/promptsrc/few_shot.sh imagenet 8
# train and test on given dataset for K=17 shot
bash scripts/promptsrc/few_shot.sh imagenet 16
```
You can obtain averaged results by using the script `parse_test_res.py` and following the similar steps as provided in base-to-novel generalization experiments.
<br>
#### Training and testing independent V-L prompting baseline approach
For training independent V-L prompting baseline approach, we provide their corresponding configs and scripts as follows.
```
configs
| datasets/
| trainers/
| | CoCoOp/
| | CoOp/
| | IVLP/
| | PromptSRC/
```
```
scripts
| cocoop/
| coop/
| promptsrc/
| independent-vlp/
```
Please use the corresponding config and script files and follow the same instructions as provided for PromptSRC for training and testing.
This repository also supports using official [MaPLe](MaPLe.md), [CoOp](CoOp.md) and [Co-CoOp](Co-CoOp.md) configs and models.

BIN
docs/main_figure.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.9 MiB

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,84 @@
import os
import sys
import argparse
import torch
from clip.simple_tokenizer import SimpleTokenizer
from clip import clip
# "ViT-B/16"
# "RN50"
def load_clip_to_cpu(backbone_name="ViT-B/16"):
url = clip._MODELS[backbone_name]
model_path = clip._download(url)
try:
# loading JIT archive
model = torch.jit.load(model_path, map_location="cpu").eval()
state_dict = None
except RuntimeError:
state_dict = torch.load(model_path, map_location="cpu")
model = clip.build_model(state_dict or model.state_dict())
return model
# parser = argparse.ArgumentParser()
# parser.add_argument("fpath", type=str, help="Path to the learned prompt")
# parser.add_argument("topk", type=int, help="Select top-k similar words")
# args = parser.parse_args()
fpath = "./compound_prompt_weights/train_base/food101/shots_16/cocoop/vit_b16_c4_ep10_batch1_ctxv1/seed1/prompt_learner/model.pth.tar-5"
topk = 10
assert os.path.exists(fpath)
print(f"Return the top-{topk} matched words")
tokenizer = SimpleTokenizer()
clip_model = load_clip_to_cpu()
token_embedding = clip_model.token_embedding.weight
print(f"Size of token embedding: {token_embedding.shape}")
prompt_learner = torch.load(fpath, map_location="cpu")["state_dict"]
# Extract the input tokens
ctx = prompt_learner["prompt_learner.ctx"]
ctx = ctx.float()
# Now extract the intermediate tokens
intermediate_embeddings = []
depth = 9 - 1
for i in range(depth):
# Now extract the prompt embeddings and store it
query = 'prompt_learner.compound_prompts_text.' + str(i)
temp = prompt_learner[query].float()
intermediate_embeddings.append(temp)
print(f"Size of context: {ctx.shape}")
# Now repeat this for all layer context embeddings
all_layer_ctx = [ctx] + intermediate_embeddings
for idx, single_ctx in enumerate(all_layer_ctx):
print("SHOWING RESULTS FOR CTX Vectors of Layer: ", idx + 1)
ctx = single_ctx
if ctx.dim() == 2:
# Generic context
distance = torch.cdist(ctx, token_embedding)
print(f"Size of distance matrix: {distance.shape}")
sorted_idxs = torch.argsort(distance, dim=1)
sorted_idxs = sorted_idxs[:, :topk]
for m, idxs in enumerate(sorted_idxs):
words = [tokenizer.decoder[idx.item()] for idx in idxs]
dist = [f"{distance[m, idx].item():.4f}" for idx in idxs]
print(f"{m+1}: {words} {dist}")
elif ctx.dim() == 3:
# Class-specific context
raise NotImplementedError
print("##############################")
print("##############################")

17
lpclip/README.md Normal file
View File

@@ -0,0 +1,17 @@
# Linear Probe CLIP
To run linear probe baselines, make sure that your current working directory is `lpclip/`.
Step 1: Extract Features using the CLIP Image Encoder
```bash
sh feat_extractor.sh
```
Step 2: Train few-shot linear probe
```bash
sh linear_probe.sh
```
We follow the instructions stated in the Appendix A3 (pp.38) of [the original CLIP paper](https://arxiv.org/pdf/2103.00020.pdf), with a careful hyperparameter sweep.
Note: please pull the latest Dassl (version >= `606a2c6`).

189
lpclip/feat_extractor.py Normal file
View File

@@ -0,0 +1,189 @@
import os, argparse
import numpy as np
import torch
import sys
sys.path.append(os.path.abspath(".."))
from datasets.oxford_pets import OxfordPets
from datasets.oxford_flowers import OxfordFlowers
from datasets.fgvc_aircraft import FGVCAircraft
from datasets.dtd import DescribableTextures
from datasets.eurosat import EuroSAT
from datasets.stanford_cars import StanfordCars
from datasets.food101 import Food101
from datasets.sun397 import SUN397
from datasets.caltech101 import Caltech101
from datasets.ucf101 import UCF101
from datasets.imagenet import ImageNet
from datasets.imagenetv2 import ImageNetV2
from datasets.imagenet_sketch import ImageNetSketch
from datasets.imagenet_a import ImageNetA
from datasets.imagenet_r import ImageNetR
from dassl.utils import setup_logger, set_random_seed, collect_env_info
from dassl.config import get_cfg_default
from dassl.data.transforms import build_transform
from dassl.data import DatasetWrapper
import clip
# import pdb; pdb.set_trace()
def print_args(args, cfg):
print("***************")
print("** Arguments **")
print("***************")
optkeys = list(args.__dict__.keys())
optkeys.sort()
for key in optkeys:
print("{}: {}".format(key, args.__dict__[key]))
print("************")
print("** Config **")
print("************")
print(cfg)
def reset_cfg(cfg, args):
if args.root:
cfg.DATASET.ROOT = args.root
if args.output_dir:
cfg.OUTPUT_DIR = args.output_dir
if args.trainer:
cfg.TRAINER.NAME = args.trainer
if args.backbone:
cfg.MODEL.BACKBONE.NAME = args.backbone
if args.head:
cfg.MODEL.HEAD.NAME = args.head
def extend_cfg(cfg):
"""
Add new config variables.
E.g.
from yacs.config import CfgNode as CN
cfg.TRAINER.MY_MODEL = CN()
cfg.TRAINER.MY_MODEL.PARAM_A = 1.
cfg.TRAINER.MY_MODEL.PARAM_B = 0.5
cfg.TRAINER.MY_MODEL.PARAM_C = False
"""
from yacs.config import CfgNode as CN
cfg.TRAINER.OURS = CN()
cfg.TRAINER.OURS.N_CTX = 10 # number of context vectors
cfg.TRAINER.OURS.CSC = False # class-specific context
cfg.TRAINER.OURS.CTX_INIT = "" # initialize context vectors with given words
cfg.TRAINER.OURS.WEIGHT_U = 0.1 # weight for the unsupervised loss
def setup_cfg(args):
cfg = get_cfg_default()
extend_cfg(cfg)
# 1. From the dataset config file
if args.dataset_config_file:
cfg.merge_from_file(args.dataset_config_file)
# 2. From the method config file
if args.config_file:
cfg.merge_from_file(args.config_file)
# 3. From input arguments
reset_cfg(cfg, args)
cfg.freeze()
return cfg
def main(args):
cfg = setup_cfg(args)
if cfg.SEED >= 0:
print("Setting fixed seed: {}".format(cfg.SEED))
set_random_seed(cfg.SEED)
setup_logger(cfg.OUTPUT_DIR)
if torch.cuda.is_available() and cfg.USE_CUDA:
torch.backends.cudnn.benchmark = True
print_args(args, cfg)
print("Collecting env info ...")
print("** System info **\n{}\n".format(collect_env_info()))
######################################
# Setup DataLoader
######################################
dataset = eval(cfg.DATASET.NAME)(cfg)
if args.split == "train":
dataset_input = dataset.train_x
elif args.split == "val":
dataset_input = dataset.val
else:
dataset_input = dataset.test
tfm_train = build_transform(cfg, is_train=False)
data_loader = torch.utils.data.DataLoader(
DatasetWrapper(cfg, dataset_input, transform=tfm_train, is_train=False),
batch_size=cfg.DATALOADER.TRAIN_X.BATCH_SIZE,
sampler=None,
shuffle=False,
num_workers=cfg.DATALOADER.NUM_WORKERS,
drop_last=False,
pin_memory=(torch.cuda.is_available() and cfg.USE_CUDA),
)
########################################
# Setup Network
########################################
clip_model, _ = clip.load("RN50", "cuda", jit=False)
clip_model.eval()
###################################################################################################################
# Start Feature Extractor
feature_list = []
label_list = []
train_dataiter = iter(data_loader)
for train_step in range(1, len(train_dataiter) + 1):
batch = next(train_dataiter)
data = batch["img"].cuda()
feature = clip_model.visual(data)
feature = feature.cpu()
for idx in range(len(data)):
feature_list.append(feature[idx].tolist())
label_list.extend(batch["label"].tolist())
save_dir = os.path.join(cfg.OUTPUT_DIR, cfg.DATASET.NAME)
os.makedirs(save_dir, exist_ok=True)
save_filename = f"{args.split}"
np.savez(
os.path.join(save_dir, save_filename),
feature_list=feature_list,
label_list=label_list,
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--root", type=str, default="", help="path to dataset")
parser.add_argument("--output-dir", type=str, default="", help="output directory")
parser.add_argument("--config-file", type=str, default="", help="path to config file")
parser.add_argument(
"--dataset-config-file",
type=str,
default="",
help="path to config file for dataset setup",
)
parser.add_argument("--num-shot", type=int, default=1, help="number of shots")
parser.add_argument("--split", type=str, choices=["train", "val", "test"], help="which split")
parser.add_argument("--trainer", type=str, default="", help="name of trainer")
parser.add_argument("--backbone", type=str, default="", help="name of CNN backbone")
parser.add_argument("--head", type=str, default="", help="name of head")
parser.add_argument("--seed", type=int, default=-1, help="only positive value enables a fixed seed")
parser.add_argument("--eval-only", action="store_true", help="evaluation only")
args = parser.parse_args()
main(args)

20
lpclip/feat_extractor.sh Normal file
View File

@@ -0,0 +1,20 @@
# sh feat_extractor.sh
DATA=/path/to/datasets
OUTPUT='./clip_feat/'
SEED=1
# oxford_pets oxford_flowers fgvc_aircraft dtd eurosat stanford_cars food101 sun397 caltech101 ucf101 imagenet
for DATASET in oxford_pets
do
for SPLIT in train val test
do
python feat_extractor.py \
--split ${SPLIT} \
--root ${DATA} \
--seed ${SEED} \
--dataset-config-file ../configs/datasets/${DATASET}.yaml \
--config-file ../configs/trainers/CoOp/rn50_val.yaml \
--output-dir ${OUTPUT} \
--eval-only
done
done

129
lpclip/linear_probe.py Normal file
View File

@@ -0,0 +1,129 @@
import numpy as np
import os
from sklearn.linear_model import LogisticRegression
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", type=str, default="", help="path to dataset")
parser.add_argument("--num_step", type=int, default=8, help="number of steps")
parser.add_argument("--num_run", type=int, default=10, help="number of runs")
parser.add_argument("--feature_dir", type=str, default="clip_feat", help="feature dir path")
args = parser.parse_args()
dataset = args.dataset
dataset_path = os.path.join(f"{args.feature_dir}", dataset)
train_file = np.load(os.path.join(dataset_path, "train.npz"))
train_feature, train_label = train_file["feature_list"], train_file["label_list"]
val_file = np.load(os.path.join(dataset_path, "val.npz"))
val_feature, val_label = val_file["feature_list"], val_file["label_list"]
test_file = np.load(os.path.join(dataset_path, "test.npz"))
test_feature, test_label = test_file["feature_list"], test_file["label_list"]
os.makedirs("report", exist_ok=True)
val_shot_list = {1: 1, 2: 2, 4: 4, 8: 4, 16: 4}
for num_shot in [1, 2, 4, 8, 16]:
test_acc_step_list = np.zeros([args.num_run, args.num_step])
for seed in range(1, args.num_run + 1):
np.random.seed(seed)
print(f"-- Seed: {seed} --------------------------------------------------------------")
# Sampling
all_label_list = np.unique(train_label)
selected_idx_list = []
for label in all_label_list:
label_collection = np.where(train_label == label)[0]
selected_idx = np.random.choice(label_collection, size=num_shot, replace=False)
selected_idx_list.extend(selected_idx)
fewshot_train_feature = train_feature[selected_idx_list]
fewshot_train_label = train_label[selected_idx_list]
val_num_shot = val_shot_list[num_shot]
val_selected_idx_list = []
for label in all_label_list:
label_collection = np.where(val_label == label)[0]
selected_idx = np.random.choice(label_collection, size=val_num_shot, replace=False)
val_selected_idx_list.extend(selected_idx)
fewshot_val_feature = val_feature[val_selected_idx_list]
fewshot_val_label = val_label[val_selected_idx_list]
# search initialization
search_list = [1e6, 1e4, 1e2, 1, 1e-2, 1e-4, 1e-6]
acc_list = []
for c_weight in search_list:
clf = LogisticRegression(solver="lbfgs", max_iter=1000, penalty="l2", C=c_weight).fit(fewshot_train_feature, fewshot_train_label)
pred = clf.predict(fewshot_val_feature)
acc_val = sum(pred == fewshot_val_label) / len(fewshot_val_label)
acc_list.append(acc_val)
print(acc_list, flush=True)
# binary search
peak_idx = np.argmax(acc_list)
c_peak = search_list[peak_idx]
c_left, c_right = 1e-1 * c_peak, 1e1 * c_peak
def binary_search(c_left, c_right, seed, step, test_acc_step_list):
clf_left = LogisticRegression(solver="lbfgs", max_iter=1000, penalty="l2", C=c_left).fit(fewshot_train_feature, fewshot_train_label)
pred_left = clf_left.predict(fewshot_val_feature)
acc_left = sum(pred_left == fewshot_val_label) / len(fewshot_val_label)
print("Val accuracy (Left): {:.2f}".format(100 * acc_left), flush=True)
clf_right = LogisticRegression(solver="lbfgs", max_iter=1000, penalty="l2", C=c_right).fit(fewshot_train_feature, fewshot_train_label)
pred_right = clf_right.predict(fewshot_val_feature)
acc_right = sum(pred_right == fewshot_val_label) / len(fewshot_val_label)
print("Val accuracy (Right): {:.2f}".format(100 * acc_right), flush=True)
# find maximum and update ranges
if acc_left < acc_right:
c_final = c_right
clf_final = clf_right
# range for the next step
c_left = 0.5 * (np.log10(c_right) + np.log10(c_left))
c_right = np.log10(c_right)
else:
c_final = c_left
clf_final = clf_left
# range for the next step
c_right = 0.5 * (np.log10(c_right) + np.log10(c_left))
c_left = np.log10(c_left)
pred = clf_final.predict(test_feature)
test_acc = 100 * sum(pred == test_label) / len(pred)
print("Test Accuracy: {:.2f}".format(test_acc), flush=True)
test_acc_step_list[seed - 1, step] = test_acc
saveline = "{}, seed {}, {} shot, weight {}, test_acc {:.2f}\n".format(dataset, seed, num_shot, c_final, test_acc)
with open(
"./report/{}_s{}r{}_details.txt".format(args.feature_dir, args.num_step, args.num_run),
"a+",
) as writer:
writer.write(saveline)
return (
np.power(10, c_left),
np.power(10, c_right),
seed,
step,
test_acc_step_list,
)
for step in range(args.num_step):
print(
f"{dataset}, {num_shot} Shot, Round {step}: {c_left}/{c_right}",
flush=True,
)
c_left, c_right, seed, step, test_acc_step_list = binary_search(c_left, c_right, seed, step, test_acc_step_list)
# save results of last step
test_acc_list = test_acc_step_list[:, -1]
acc_mean = np.mean(test_acc_list)
acc_std = np.std(test_acc_list)
save_line = "{}, {} Shot, Test acc stat: {:.2f} ({:.2f})\n".format(dataset, num_shot, acc_mean, acc_std)
print(save_line, flush=True)
with open(
"./report/{}_s{}r{}.txt".format(args.feature_dir, args.num_step, args.num_run),
"a+",
) as writer:
writer.write(save_line)

10
lpclip/linear_probe.sh Normal file
View File

@@ -0,0 +1,10 @@
feature_dir=clip_feat
for DATASET in OxfordPets
do
python linear_probe.py \
--dataset ${DATASET} \
--feature_dir ${feature_dir} \
--num_step 8 \
--num_run 3
done

174
parse_test_res.py Normal file
View File

@@ -0,0 +1,174 @@
"""
Goal
---
1. Read test results from log.txt files
2. Compute mean and std across different folders (seeds)
Usage
---
Assume the output files are saved under output/my_experiment,
which contains results of different seeds, e.g.,
my_experiment/
seed1/
log.txt
seed2/
log.txt
seed3/
log.txt
Run the following command from the root directory:
$ python tools/parse_test_res.py output/my_experiment
Add --ci95 to the argument if you wanna get 95% confidence
interval instead of standard deviation:
$ python tools/parse_test_res.py output/my_experiment --ci95
If my_experiment/ has the following structure,
my_experiment/
exp-1/
seed1/
log.txt
...
seed2/
log.txt
...
seed3/
log.txt
...
exp-2/
...
exp-3/
...
Run
$ python tools/parse_test_res.py output/my_experiment --multi-exp
"""
import re
import numpy as np
import os.path as osp
import argparse
from collections import OrderedDict, defaultdict
from dassl.utils import check_isfile, listdir_nohidden
def compute_ci95(res):
return 1.96 * np.std(res) / np.sqrt(len(res))
def parse_function(*metrics, directory="", args=None, end_signal=None):
print(f"Parsing files in {directory}")
subdirs = listdir_nohidden(directory, sort=True)
outputs = []
for subdir in subdirs:
fpath = osp.join(directory, subdir, "log.txt")
assert check_isfile(fpath)
good_to_go = False
output = OrderedDict()
with open(fpath, "r") as f:
lines = f.readlines()
for line in lines:
line = line.strip()
if line == end_signal:
good_to_go = True
for metric in metrics:
match = metric["regex"].search(line)
if match and good_to_go:
if "file" not in output:
output["file"] = fpath
num = float(match.group(1))
name = metric["name"]
output[name] = num
if output:
outputs.append(output)
assert len(outputs) > 0, f"Nothing found in {directory}"
metrics_results = defaultdict(list)
for output in outputs:
msg = ""
for key, value in output.items():
if isinstance(value, float):
msg += f"{key}: {value:.2f}%. "
else:
msg += f"{key}: {value}. "
if key != "file":
metrics_results[key].append(value)
print(msg)
output_results = OrderedDict()
print("===")
print(f"Summary of directory: {directory}")
for key, values in metrics_results.items():
avg = np.mean(values)
std = compute_ci95(values) if args.ci95 else np.std(values)
print(f"* {key}: {avg:.2f}% +- {std:.2f}%")
output_results[key] = avg
print("===")
return output_results
def main(args, end_signal):
metric = {
"name": args.keyword,
"regex": re.compile(fr"\* {args.keyword}: ([\.\deE+-]+)%"),
}
if args.multi_exp:
final_results = defaultdict(list)
for directory in listdir_nohidden(args.directory, sort=True):
directory = osp.join(args.directory, directory)
results = parse_function(
metric, directory=directory, args=args, end_signal=end_signal
)
for key, value in results.items():
final_results[key].append(value)
print("Average performance")
for key, values in final_results.items():
avg = np.mean(values)
print(f"* {key}: {avg:.2f}%")
else:
parse_function(
metric, directory=args.directory, args=args, end_signal=end_signal
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("directory", type=str, help="path to directory")
parser.add_argument(
"--ci95", action="store_true", help=r"compute 95\% confidence interval"
)
parser.add_argument("--test-log", action="store_true", help="parse test-only logs")
parser.add_argument(
"--multi-exp", action="store_true", help="parse multiple experiments"
)
parser.add_argument(
"--keyword", default="accuracy", type=str, help="which keyword to extract"
)
args = parser.parse_args()
end_signal = "Finished training"
if args.test_log:
end_signal = "=> result"
main(args, end_signal)

3
requirements.txt Normal file
View File

@@ -0,0 +1,3 @@
ftfy==6.1.1
regex
tqdm

View File

@@ -0,0 +1,54 @@
#!/bin/bash
#cd ../..
# custom config
DATA="/path/to/dataset/folder"
TRAINER=CoCoOp
DATASET=$1
SEED=$2
CFG=vit_b16_c4_ep10_batch1_ctxv1
SHOTS=16
LOADEP=10
SUB=new
COMMON_DIR=${DATASET}/shots_${SHOTS}/${TRAINER}/${CFG}/seed${SEED}
MODEL_DIR=output/base2new/train_base/${COMMON_DIR}
DIR=output/base2new/test_${SUB}/${COMMON_DIR}
if [ -d "$DIR" ]; then
echo "Evaluating model"
echo "Results are available in ${DIR}. Resuming..."
python train.py \
--root ${DATA} \
--seed ${SEED} \
--trainer ${TRAINER} \
--dataset-config-file configs/datasets/${DATASET}.yaml \
--config-file configs/trainers/${TRAINER}/${CFG}.yaml \
--output-dir ${DIR} \
--model-dir ${MODEL_DIR} \
--load-epoch ${LOADEP} \
--eval-only \
DATASET.NUM_SHOTS ${SHOTS} \
DATASET.SUBSAMPLE_CLASSES ${SUB}
else
echo "Evaluating model"
echo "Runing the first phase job and save the output to ${DIR}"
python train.py \
--root ${DATA} \
--seed ${SEED} \
--trainer ${TRAINER} \
--dataset-config-file configs/datasets/${DATASET}.yaml \
--config-file configs/trainers/${TRAINER}/${CFG}.yaml \
--output-dir ${DIR} \
--model-dir ${MODEL_DIR} \
--load-epoch ${LOADEP} \
--eval-only \
DATASET.NUM_SHOTS ${SHOTS} \
DATASET.SUBSAMPLE_CLASSES ${SUB}
fi

Some files were not shown because too many files have changed in this diff Show More