Depth Anything 리뷰

zz0622

|2024. 3. 19. 17:15

DPT + DINOv2

Depth Anything

$labeld_data :: D^l {(x_i, d_i)}^M_{i=1}$

$unlabeld_data :: D^u = {u_i}^N_{i=1}$
$D^l$ 로부터 teacher model $T$ 를 학습하고
$T$ 모델을 통해 pesudo depth labels를 $D^u$ 에 할당함
Finally, 우리는 student model $S$ 를 labeled set과 pesudo labeled set으로 학습함

3.1 Learning Labeled Images

이 process는 MiDaS 학습방식과 유사함
첫번 째로 depth value를 $d = {1}/{t}$ 로 변환하고 각 depth map을 0~1 사이로 변환함
multi-dataset joint training 가능하게 하기 위해 affine-invariant loss를 채택했고
이는 unknown scale, shift를 무시할 수 있음

$\mathcal{L}_l=\frac{1}{H W} \sum_{i=1}^{H W} \rho\left(d_i^*, d_i\right)$

$\hat{d}_i = \frac{d_i - t(d)}{s(d)}$

$$
t(d) = \text{median}(d), \quad s(d) = \frac{1}{HW} \sum_{i=1}^{HW} |d_i - t(d)|
$$

6개의 public datasets을 통해 1.5M labeld images를 모음

MiDaS는 12 training dataset을 썼으나 우리는 NYUv2, KITTI는 안썼음, zero-shot 성능 평가를 하기 위해,

encoder DINOv2 preatrained_weight 를 통해 강화된 teacher model을 사용했고
학습할 때 sky region을 detection해서 disparity value를 0으로 치환하기 위해 semantic segmentation model을 채택함

Unleashing the Power of Unlabeled Images

unlabeled image들을 사용함으로써 data coverage를 향상시킴
62M개의 이미지를 모음(Table 1)
$pesudo\ label :: \hat{D}^u = {(u_i, T(u_i))|u_i \in D^u}^N_{i=1}$

prior works에 따라 $T$ -> $S$ 로 fine-tuning 하는 것이 아닌
$S$ 를 re-initialize시켰음

pilot프로젝트에서 self-training 파이프라인에 대한 개선은 되지 않았음
이미 충분히 labeled data 통해 학습된 것들에 대해 예측된 추가적인 unlabeled image통해 추가적인 지식을 얻을 수 없었음
특히 $T,S$ 가 같은 사전학습과 같은 구조를 차용하는 것에 대해 $D^u$ 에 대해 일관된 FP를 나타내는 경향이 있었음,
명백한 self-training 절차 없이

이 딜레마를 해결하기 위해
$S$ 에게 더 도전적인 과제를 주었음 unlabeled dataset에 대해 더 최적화하고, 더 추가적인 지식을 얻기 위해 ,

저자는 학습을 하는동안 unlabeled image에 대해 강한 혼란을 주었음
이는 $S$ 모델을 더 활발하게 extra visual knowledge를 찾고 invariant representation을 얻게 해주었음
이러한 이점은 open world에서 더 강건하게 작동할 수 있게 도와주었음

저자는 이를 행하기 위해 두가지 형태의 perturbation을 주었음

strong color distortions
- color jittering
- Gaussian blurring
strong spatial distortion
- Cutmix

$u_{ab} = u_a \odot M + u_b \odot (1-M)$

unlabeled loss $\mathcal{L}u $에 대해 유효한 M, 1-M영역에 대해 각각 loss를 구함$ $
\mathcal{L}_u^M = \rho(S(u_{ab}) \odot M, T(u_a) \odot M)

$$

$$
\mathcal{L}^{1-M}_u = \rho(S(u_{ab}) \odot (1-M), T(u_b) \odot (1-M))

$\mathcal{L}_u=\frac{\sum M}{H W} \mathcal{L}_u^M+\frac{\sum(1-M)}{H W} \mathcal{L}_u^{1-M}$

Cutmix를 50% 확률로 적용하였음 , 하지만 pesudo labeling에 사용하기 위해 $T$ 인풋으로 넣을 땐 어떤 distortion도 없이 사용함

Semantifc-Assited Perception

저자는 depth estimation에서 high-level semantic-related 정보는 유익하다는 것을 신뢰함.
또한 unlabeld image를 사용하는 특정한 맥락에서는 다른 task로 부터의 이러한 보조 지도 신호는 우리의 pesudo depth label의 잠재적인 noise들과 싸울 수 있음

따라서 초기시도에서 RAM+ GroundingDINO+HQ-SAM 모델을 통해 unlabeled image에 semantic segmentation map을 할당함 이는 4K classes를 가지고 있음

joint-training stage에서 모델은 depth, segmentation prediction 을 진행하였음
공유된 encoder와 두개의 독립적인 decoder로 불행히도 이 초기 시도는 실패했음

이에 대해 저자는 깊이 사색했고, image를 decoding 하며 discrete한 class로 변할 때 너무 많은 semantic 정보를 잃는다 생각했고, 이러한 semantic mask의 제한적인 정보는 저자의 depth model에 좋은 성능을 올리지 못함을 깨달았음

저자는 저자의 depth estimation task에 보조 지도를 하기 위해 더 informative 한 semantic signal을 찾는 것을 목표로 하였음

저자는 DINOv2의 semantic-related tasks의 성능에 대해 놀람, fine-tuning 없는 frozen weights에 대해 좋은 성능을 띄는 것에 대해
이러한 단서를 동기부여삼아 저자는 저자의 depth model의 강한 semantic capability를 보조 feature alignment loss로 전이하는 것을 제안했습니다.

Feature space는 high-dimensional 하고 continous 하기 때문에 discrete mask에 비해 더 많은 semantic 정보를 포함하고 있기 때문에 더 좋은 성능을 발휘 할 수 있음

$\mathcal{L}_{\text{feat}} = 1 - \frac{1}{HW} \sum_{i=1}^{HW} \cos \left(f_i, f_i^{\prime}\right)$

2개의 vector간의 cosine similarity를 계산함
$f$ 는 depth model $S$ 로 부터 추출된 feature, $f^{\prime}$ 은 frozen DINOv2 encoder로 부터의 feature

[우리는 [19]와 같은 work(f를 새로운 space로 할당하는)를 따르지 않았음, randomly 초기화된 projector들은 큰 loss가 할당되어 초기단계의 전체 loss를 지배하기 때문에] 뭔말인지 모르겠음

feature를 할당하는데 있어서 또 다른 key point는
DINOv2 처럼 semantic encoder는 object의 다른 부분에서도 유사한 특징을 추출하는 경향이 있음 e.g. car front and read 하지만 depth estimation에서는 같은 객체임에도 불구하고 다른 part라면 다양한 depth map을 가질 수 있음
즉,depth model에서 frozen encoder를 사용하면서 완전히 같은 feature을 추출하도록 강요하는 것은 이득이 되지 못함
이 issue를 해결하기 위해서 관용 margin $\alpha$ 를 설정하였음 feature alignment에 있어서
만약 cosine similarity( f_i, f'_i)가 $alpha$ 를 넘어서면 $\mathcal{L}_{\text{feat}}
\mathcal{L}{feat} $에 고려되지 않음 이는저자의 method가 DINOv2로 부터의 semantic-aware representation와 part-level discriminative representation 모두를 고려할 수 있게 됨, side effect로 저자의 encoder가 downstream MDE datasets에 뛰어난 성능을 보일 뿐만 아니라 segmentation task에서도 좋은 성능을 보여줄 수 있게 되었음 Finally, 전체적인 loss는$ \mathcal{L}_l, \mathcal{L}_u, \mathcal{L}_{\text{feat}}
$ 의 평균임

Experiment

4.1 Implementation Details

labeled image를 input으로 teacher model을 20 epoch간 학습함
teach model(ViT-L/16)에 대해 unlabeled image는 annotate되고 student model이 모든 unlabeled image를 통해 전체를 학습함

각 batch당 labeled와 unlabeled image를 1:2 로 하나의 batch로 넣어 학습함
pre-trained encoder에 대해선 5e-6 learning rate를 사용하고, randomly 초기화되는 decoder는 10X larger learning rate를 사용함
AdamW와 linear schedule사용
labeled image에 대해 horizontal flip만 적용함
tolerance margin $\alpha$ 는 0.15

Depth Anything
1. 3.1 Learning Labeled Images
Unleashing the Power of Unlabeled Images
Semantifc-Assited Perception
Experiment
1. 4.1 Implementation Details

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`