3.3.24 [LLM] 자연어-이미지 멀티모달: 텍스트 기반 이미지 생성, Image Captioning

Notice

Recent Posts

Recent Comments

Link

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Developer's Development

3.3.24 [LLM] 자연어-이미지 멀티모달: 텍스트 기반 이미지 생성, Image Captioning 본문

LLM

3.3.24 [LLM] 자연어-이미지 멀티모달: 텍스트 기반 이미지 생성, Image Captioning

mylee 2025. 9. 28. 21:26

자연어-이미지 멀티모달 개요

멀티모달 학습

서로 다른 종류의 데이터를 함꼐 처리하고 학습하는 방법을 의미한다. 예를 들어, 텍스트와 이미지, 텍스트와 음성, 이미지와 센서 데이터 등 이질적인 데이터를 하나의 모델 또는 시스템에서 통합적으로 처리하는 기술이다.

멀티모달 학습의 핵심

각 모달리티의 표현을 추출한 뒤, 이들을 효과적으로 결합하고 상호작용을 학습하는 것

특히 텍스트와 이미지를 결합하는 경우, 사람의 직관적인 인지 능력에 가까운 AI 응용이 가능

대표적인 활용 분야

이미지 캡셔닝: 시각 정보를 문장으로 표현 (img → txt)

텍스트 기반 이미지 생성: 상상한 장면을 그림으로 표현 (txt → img)

이미지 기반 검색: 문장을 통해 유사한 이미지를 검색

VQA (Visual Question Answering): 이미지에 대한 자연어 질문에 답변

Image Captioning

주어진 이미지를 보고 그 내용을 설명하는 자연어 문장을 생성하는 문제이다.

인간은 이미지를 보면 자연스럽게 설명을 떠올릴 수 있지만, 기계에서는 이 작업이 매우 복잡한 인식 및 생성 과정을 포함한다.

기본 구조: CNN + RNN

CNN (Convolutional Neural Network)

: 이미지를 입력받아 중요한 시각적 특징들을 추출한다. ex) ResNet, Inception 등

RNN (Recurrent Neural Netrwork) 또는 LSTM/GRU

: CNN에서 추출된 특징을 기반으로 순차적인 단어를 예측하여 문장을 생성한다.

Attention Mechanism

기존 모델의 한계를 극복하기 위해, 이미지의 특정 영역에 주의를 집중할 수 있는 Attemtion 메커니즘이 도입되었다.

- "Show, Attend and Tell" 모델이 대표적이며, 이미지의 각 영역마다 가중치를 두고 문장을 생성한다. 이는 마치 사람이 어떤 부분을 더 오래 응시하면서 설명하는 것과 유사하다.

대표 모델

Show and Tell

Show, Attend and Tell

Transformer 기반 모델

BLIP, BLIP-2 (2022~2023)

데이터셋

MS COCO

Flickr8k, Flickr30k

NoCaps, Conceptual Captions

텍스티 기반 이미지 생성 (Text-to-Image Generation)

사용자가 작성한 텍스트를 입력으로 받아 그에 어울리는 이미지를 AI가 생성하는 기술이다.

예를 들어, "우주 정거장에서 바라본 지구의 일출"이라는 문장을 입력하면, 그에 부합하는 고해상도의 이미지를 자동으로 생성한다.

이러한 기술은 광고, 예술, 게임, 디자인, 영화 콘셉트 아트 등에서 창의성을 보완해주는 도구로 매우 주목받고 있다.

CLIP (Constrastive Language-Image Pre-training)
DALL-E
Stable Diffusion

DrawBench 평가 방식

텍스트 기반 이미지 생성 모델의 성능을 비교/평가하기 위한 기준을 제시한다. Google에서 제안한 이 벤치마크는 다양한 유형의 문장을 제시하고, 각 모델이 얼마나 적절하고 창의적인 이미지를 생성했는지를 비교한다.

👉🏻 주요 평가 지표

- 정확도, 다양성, 현실성, 창의성

👉🏻 비교 대상

- Imagen (Google), Parti (Google Brain), DALL-E 2, Stable Diffusion 등

실습 (Image Captioning 기본 구조: CNN + RNN, colab)

# 단어사전
vocab = {
    0: "<pad>",
    1: "<start>",
    2: "<end>",
    3: "a",
    4: "dog",
    5: "is",
    6: "sitting",
    7: "on",
    8: "grass"
}

# 이미지 전처리
from torchvision import transforms
from PIL import Image

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

image = Image.open('dog.png').convert('RGB')
image_tensor = transform(image).unsqueeze(0)

# VGG 모델 로드 (이미지로부터 특징 추출)
from torchvision.models import vgg16
import torch

vgg = vgg16(pretrained=True).features	# 특징 추출 레이어
for param in vgg.parameters():
    param.requires_grad = False

with torch.no_grad():	# 기울기 추적 x
    features = vgg(image_tensor)
    features = features.view(features.size(0), -1).unsqueeze(1)
    
# 단어사전을 토대로 학습 입/출력 데이터 생성
caption = [1, 3, 4, 5, 6, 7, 8, 2]
input_seq = torch.tensor([caption[:-1]])
target_seq = torch.tensor([caption[1:]])

# RNN 계열 Caption 생성 모델 생성 (이미지 특징 + 이전 단어 -> 다음 단어 예측)
import torch.nn as nn

class CaptionGenerator(nn.Module):
    def __init__(self, feature_dim, embed_dim, hidden_dim, vocab_size):
        super(CaptionGenerator, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.decoder = nn.Linear(hidden_dim, vocab_size)
        self.init_linear = nn.Linear(feature_dim, embed_dim)

    def forward(self, features, captions):
        embedded_features = self.init_linear(features)
        embeds = self.embed(captions)
        inputs = torch.cat((embedded_features, embeds), dim=1)
        hiddens, _ = self.lstm(inputs)
        outputs = self.decoder(hiddens)
        return outputs[:, 1:, :]

import torch.optim as optim

# 모델 학습
model = CaptionGenerator(feature_dim=25088, embed_dim=256, hidden_dim=512, vocab_size=len(vocab))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(20):
    model.train()
    optimizer.zero_grad()

    outputs = model(features, input_seq)
    loss = criterion(outputs.squeeze(0), target_seq.squeeze())
    loss.backward()
    optimizer.step()
    
# 모델 예측
model.eval()
with torch.no_grad():
    generated =[]
    input_word = torch.tensor([[1]])
    embed_feat = model.init_linear(features)
    hidden = None

    for _ in range(10):
        embed_input = model.embed(input_word)
        lstm_input = torch.cat((embed_feat, embed_input), dim=1) if len(generated) == 0 else embed_input
        out, hidden = model.lstm(lstm_input, hidden)
        pred = model.decoder(out[:, -1, :])
        pred_id = pred.argmax(dim=-1).item()

        if pred_id == 2:
            break

        generated.append(pred_id)
        input_word = torch.tensor([[pred_id]])
        embed_feat = None
    
    sentence = " ".join([vocab[idx] for idx in generated])
    print("생성된 캡션:", sentence)		# 생성된 캡션: a dog is sitting on grass

실습 (CLIP & BLIP, colab)

CLIP (Constrative Language-Image Pretraining)

!pip install git+https://github.com/openai/CLIP.git

import torch
import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, processor = clip.load("ViT-B/32", device=device)

from PIL import Image

image = processor(Image.open("dog.png")).unsqueeze(0).to(device)
caption_options = [
    "a dog on the grass",
    "a cat on the grass",
    "a pug sitting",
    "a cat on the table"
]
captions = clip.tokenize(caption_options).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(captions)
    logits_per_image, _ = model(image, captions)	# 이미지 기준으로 각 캡션의 유사도 점수 계산
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print(image_features)
print(text_features)
print("CLIP이 뽑은 Best Caption:", caption_options[probs.argmax()])	# CLIP이 뽑은 Best Caption: a dog on the grass

BLIP (Bootstrapped Languate-Image Pretraining)

from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

image = Image.open('dog.png').convert('RGB')
inputs = processor(images=image, return_tensors='pt')

output = model.generate(**inputs)
print(output)	
# tensor([[30522,  1037,  2235,  2317, 17022,  3564,  2006,  1037,  4799,  2723,
           102]])	# 인덱스 값(인코딩된 상태)
           
print("BLIP이 생성한 Caption:", processor.decode(output[0], skip_special_tokens=True))
# BLIP이 생성한 Caption: a small white puppy sitting on a wooden floor

실습 (BLIP: VQA, colab)

https://huggingface.co/Salesforce/blip-vqa-base

Salesforce/blip-vqa-base · Hugging Face

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Model card for BLIP trained on visual question answering- base architecture (with ViT base backbone). Pull figure from BLIP official repo TL;DR Authors

huggingface.co

from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image

model_id = "Salesforce/blip-vqa-base"

processor = BlipProcessor.from_pretrained(model_id)
model = BlipForQuestionAnswering.from_pretrained(model_id)

img_path = 'dog.png'
image = Image.open(img_path).convert('RGB')

question = "What  is the dog doing?"

inputs = processor(
    images=image,
    text=question,
    return_tensors="pt",
)

print(inputs)

import torch

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_beams=3
    )

print(outputs)
print("Q:", question )
print("A:", processor.decode(outputs[0], skip_special_tokens=True))
"""
tensor([[30522,  3564,   102]])
Q: What  is the dog doing?
A: sitting
"""

실습 (midjourney-mini 모델 활용 텍스트 이미지 생성, colab)

inference API 호출

API_URL = "https://api-inference.huggingface.co/models/midjourney-community/midjourney-mini"
headers = {"Authorization": "Bearer hf_token"}

import requests

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.content
    
results = query({
    "inputs": "3 puppies eating doughnuts delicioously in space"
})

results		# 지금은 Not Found 상태

from PIL import Image
import io

image = Image.open(io.BytesIO(results))
image.save("result.png")

DiffusionPipeline 이용

!pip install diffusers

from diffusers import DiffusionPipeline
import os
from uuid import uuid4
from google.colab import userdata

HF_TOKEN = userdata.get("HF_TOKEN")

pipeline = DiffusionPipeline.from_pretrained("midjourney-community/midjourney-mini", token=HF_TOKEN)

def generate_image(prompt: str, save_dir: str = './generated_img'):
    try:
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)
        
        file_name = f"{uuid4()}.png"
        file_path = os.path.join(save_dir, file_name)
        
        image = pipeline(prompt).images[0]
        image.save(file_path)
        
        return file_path
    except Exception as e:
        print(f"이미지 생성 중 {e} 오류 발생!!!")
        return None

generate_image(input("이미지로 만들 프롬프트 입력:"))
"""
이미지로 만들 프롬프트 입력:3 puppies eating doghnuts deliciously in space
"""

실습 (Stable Diffusion, colab)

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

from diffusers import StableDiffusionPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipeline = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    use_safetensors=True
).to(device)

def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
set_seed(42)

prompt = "A tranquil mountain landscape at sunset, watercolor style."

results = pipeline(
    prompt,
    num_inference_steps=50,
    guidance_scale=7.5
)
display(results.images[0])

prompts = [
    "A futuristic city skyline at sunrise",
    "Portrait of a medieval knight in armor",
    "A colorful abstract painting featuring geometric shapes"
]

results = pipeline(
    prompts,
    num_inference_steps=50,
    guidance_scale=7.5,
    width=768,
    height=512
)
print(results.images)

import matplotlib.pyplot as plt

images = results.images
n = len(images)

plt.figure(figsize=(4, 3 * n))
for i, image in enumerate(images):
    plt.subplot(n, 1, i+1)
    plt.imshow(image)
    plt.title(prompts[i])
    plt.axis("off")
plt.tight_layout()
plt.show()

실습 (DALL-E, colab)

from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

DALL-E 기본 프롬프트

1. 프롬프트 구성 주어(주체) → 행동/상황 → 스타일 순으로 쓰면 안정적이다.

2. style 파라미터로 vivid(강렬) vs natural(자연) 톤을 바꿔본다.

3. quality="hd"를 넣으면 더 선명하지만 시간이 늘고 비용이 높다.

from openai import OpenAI

client = OpenAI()

response = client.images.generate(
    model='dall-e-3',
    prompt='머그컵 안에 작은 다람쥐가 들어있는 그림',
    size='1024x1024',
    quality='hd',
    n=1
)

print(response)
print("생성된 이미지:", response.data[0].url)		# url로 들어가면 그림이 뜸 (1시간동안 유효)
print("수정된 프롬프트:", response.data[0].revised_prompt)
"""
ImagesResponse(created=1758587030, background=None, data=[Image(b64_json=None, revised_prompt='An illustration of a small squirrel snugly nestled inside a mug', url='https://oaidalleapiprodscus.blob.core.windows.net/private/org-5SQhhPsxMy8IOdDCYA1twagr/user-DtVAcy2wGhnWC4aizlKsBFwe/img-DvhZFLc4XOukPRTjAulaSCG2.png?st=2025-09-22T23%3A23%3A50Z&se=2025-09-23T01%3A23%3A50Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=6e4237ed-4a31-4e1d-a677-4df21834ece0&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2025-09-22T01%3A38%3A36Z&ske=2025-09-23T01%3A38%3A36Z&sks=b&skv=2024-08-04&sig=j167HdAKRf/jqsFJJVoYdKMTdublF8i7wtYqwg369vo%3D')], output_format=None, quality=None, size=None, usage=None)
생성된 이미지: https://oaidalleapiprodscus.blob.core.windows.net/private/org-5SQhhPsxMy8IOdDCYA1twagr/user-DtVAcy2wGhnWC4aizlKsBFwe/img-DvhZFLc4XOukPRTjAulaSCG2.png?st=2025-09-22T23%3A23%3A50Z&se=2025-09-23T01%3A23%3A50Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=6e4237ed-4a31-4e1d-a677-4df21834ece0&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2025-09-22T01%3A38%3A36Z&ske=2025-09-23T01%3A38%3A36Z&sks=b&skv=2024-08-04&sig=j167HdAKRf/jqsFJJVoYdKMTdublF8i7wtYqwg369vo%3D
수정된 프롬프트: An illustration of a small squirrel snugly nestled inside a mug
"""

👉🏻 style 파라미터로 톤 변경

from IPython.display import Image

vivid_res = client.images.generate(
    model='dall-e-3',
    prompt='우주복을 입은 다람쥐가 우주를 유영하는 그림',
    size='1024x1024',
    style='vivid'
)

natural_res = client.images.generate(
    model='dall-e-3',
    prompt='우주복을 입은 다람쥐가 우주를 유영하는 그림',
    size='1024x1024',
    style='natural'
)

print(vivid_res.data[0].revised_prompt)
print(natural_res.data[0].revised_prompt)
display(Image(url=vivid_res.data[0].url, width=500))
display(Image(url=natural_res.data[0].url, width=500))

subject = '해적 선장의 초상화'
styles = [
    "oil painting by Rembrandt, baroque style",
    "steampunk digital illustration, artstation style",
    "photorealistic 35mm film, shallow depth of field",
    "pixel-art, 16-bit retro game sprite"
]

for style in styles:
    prompt = f'{subject}, {style}'
    response = client.images.generate(
        model='dall-e-3',
        prompt=prompt,
        size='1024x1024',
        style='vivid'
    )
    print(f'{style}:')
    display(Image(url=response.data[0].url, width=500))

특정 작가 스타일

- Claude Monet, Vicent van Gogh, Alphonse Mucha, Gustav Klimt, Katsushika Hokusai

import time

artists = [
    ("Vincent van Gogh",
     "A star-lit night over a futuristic city, swirling sky"),
    ("Gustav Klimt",
     "A golden forest goddess standing by a river, rich ornamental patterns"),
    ("Katsushika Hokusai",
     "A giant wave crashing over a cyberpunk harbor, traditional woodblock print aesthetic"),
    ("Claude Monet", "A serene lily pond at twilight, impressionist oil painting"),
    ("Alphonse Mucha", "An elegant Art Nouveau poster of an electric car, flowing lines"),
]

for idx, (artist, subject) in enumerate(artists):
    print(idx, f'[{artist}]', subject)
    prompt = f'{subject}, in the style of {artist}'

    response = client.images.generate(
        model='dall-e-3',
        prompt=prompt,
        size='1024x1024',
        style='vivid'
    )

    display(Image(url=response.data[0].url, width=300))
    time.sleep(10)

👉🏻 생성된 이미지 저장

import requests

img_data = requests.get(response.data[0].url).content
filename = 'image.png'

with open(filename, 'wb') as f:
    f.write(img_data)

# google colab 환경 다운로드
from google.colab import files

files.download(filename)

저작자표시 비영리 변경금지 (새창열림)

'LLM' 카테고리의 다른 글

3.3.23 [LLM] 자연어-이미지 멀티모달: 이미지 딥러닝 응용(스타일 전이 학습, GAN) (1)	2025.09.28
3.3.22 [LLM] 자연어-이미지 멀티모달: 주요 CNN 모델 (0)	2025.09.28
3.3.21 [LLM] 자연어-이미지 멀티모달: CNN 개요 및 구조 (0)	2025.09.16
3.3.20 [LLM] 자연어-이미지 멀티모달: etc (0)	2025.09.15
3.3.19 [LLM] 파인튜닝 기법: DPO (0)	2025.09.15

'LLM' Related Articles

Developer's Development

3.3.24 [LLM] 자연어-이미지 멀티모달: 텍스트 기반 이미지 생성, Image Captioning 본문

3.3.24 [LLM] 자연어-이미지 멀티모달: 텍스트 기반 이미지 생성, Image Captioning

'LLM' 카테고리의 다른 글

티스토리툴바