InternVL2 시리즈 퀵 스타트

transformers를 사용하여 InternVL2 시리즈를 실행하는 예제 코드를 제공합니다.

또한 온라인 데모에서 InternVL2 시리즈 모델을 경험해 보시기 바랍니다.

모델이 정상적으로 작동하려면 transformers==4.37.2를 사용하세요.

모델 준비

모델 이름	타입	파라미터	다운로드	크기
InternVL2-1B	MLLM	0.9B	🤗 HF 링크	1.8 GB
InternVL2-2B	MLLM	2.2B	🤗 HF 링크	4.2 GB
InternVL2-4B	MLLM	4.2B	🤗 HF 링크	7.8 GB
InternVL2-8B	MLLM	8.1B	🤗 HF 링크	16 GB
InternVL2-26B	MLLM	25.5B	🤗 HF 링크	48 GB
InternVL2-40B	MLLM	40.1B	🤗 HF 링크	75 GB
InternVL2-Llama3-76B	MLLM	76.3B	🤗 HF 링크	143 GB

필요에 따라 위 모델 가중치를 다운로드하여 pretrained/ 폴더에 넣으십시오.

cd pretrained/
# pip install -U huggingface_hub
# OpenGVLab/InternVL2-1B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-1B --local-dir InternVL2-1B
# OpenGVLab/InternVL2-2B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
# OpenGVLab/InternVL2-4B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-4B --local-dir InternVL2-4B
# OpenGVLab/InternVL2-8B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B
# OpenGVLab/InternVL2-26B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-26B --local-dir InternVL2-26B
# OpenGVLab/InternVL2-40B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-40B --local-dir InternVL2-40B
# OpenGVLab/InternVL2-Llama3-76B 다운로드
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

디렉토리 구조는 다음과 같습니다.

pretrained
├── InternVL2-1B
├── InternVL2-2B
├── InternVL2-4B
├── InternVL2-8B
├── InternVL2-26B
├── InternVL2-40B
└── InternVL2-Llama3-76B

모델 로딩

16-bit (bf16 / fp16)

BNB 8-bit 양자화

BNB 4-bit 양자화

다중 GPU

코드를 이렇게 작성하는 이유는 다중 GPU 추론 중에 텐서가 동일한 장치에 있지 않아 발생하는 오류를 피하기 위함입니다. 대규모 언어 모델(LLM)의 첫 번째 레이어와 마지막 레이어가 동일한 장치에 있도록 보장함으로써 이러한 오류를 방지합니다.

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
        'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
    # 첫 번째 GPU는 ViT에 사용되므로 절반의 GPU로 취급합니다.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map