41. Align before Fuse : Vision and Language Representation Learning with Momentum Distillation

41. Align before Fuse : Vision and Language Representation Learning with Momentum Distillation