꿈 많은 사람의 이야기

세로형

Notice

[contact] 컨택 정보 공지

Recent Posts

Recent Comments

Link

03-14 01:12

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

꿈 많은 사람의 이야기

Ollama LLM 스트리밍(streaming) 응답 받는 방법 - Ollama 실시간 응답 받기(Feat. Streamlit) 본문

인공지능(AI)/LLM&RAG

Ollama LLM 스트리밍(streaming) 응답 받는 방법 - Ollama 실시간 응답 받기(Feat. Streamlit)

이수진의 블로그 2025. 4. 28. 09:49

728x170

포스팅 개요

본 포스팅은 Local 환경에서 LLM을 실행시킬 때 많이 활용하는 Ollama를 스트리밍(streaming) 형태로 LLM의 응답(response)를 받는 방법에 대해서 정리한 포스팅입니다. Ollama에게 직접 request 할 때와, Python requests를 이용한 방법 그리고 PoC(Proof-of-Concept)으로 많이 활용하는 Python streamlit으로 웹 페이지를 만들었을 때 활용하는 방법을 기준으로 설명합니다.

Ollama란 무엇인지는 본 포스팅에서 소개하지 않습니다. LLM을 로컬 환경에서 실행하고 서버 형태로도 배포 가능한 Ollama에 대해서 궁금하신 분들은 제 이전 포스팅이나, 다른 글들을 참고하시길 바랍니다.

- Ollama란?: https://lsjsj92.tistory.com/666

Ollama 사용법 - 개인 로컬 환경에서 LLM 모델 실행 및 배포하기

포스팅 개요이번 포스팅은 대규모 언어 모델(Large Language Model, LLM)을 개인 로컬 환경에서 실행하고 배포하기 위한 Ollama 사용법을 정리하는 포스팅입니다. Ollama를 사용하면 유명한 모델들인 LLaMA

lsjsj92.tistory.com

포스팅 본문

본 포스팅은 포스팅 개요에서도 말씀드렸듯, Ollama와 통신하여 LLM의 결과를 받아올 때 실시간 성으로 스트리밍(streaming) 형식으로 LLM의 응답을 받아오는 방법에 대해서 정리합니다. 총 아래와 같은 4개의 방법을 정리해보겠습니다.

1. Ollama와 Curl 명령어로 직접 통신할 때 스트리밍으로 받는 방법

2. Python requests를 활용해서 Ollama API 호출할 때 스트리밍으로 받는 방법

3. Python FastAPI를 활용해 Ollama와 API로 통신할 때 스트리밍으로 받는 방법

4. Python Streamlit 화면에서 Ollama의 스트리밍 통신을 출력하는 방법

하나씩 알아보겠습니다. 참고로 제가 Ollama에서 사용한 LLM 모델은 llama3.2-bllossom-3b-kr 모델입니다.

1. Ollama와 Curl 명령어로 직접 통신할 때 스트리밍으로 받는 방법

Ollama는 REST API를 통해 모델과 통신할 수 있는 엔드포인트(endpoint)를 제공합니다. API를 통해 LLM 모델에 쿼리(사용자 요청)을 보내고 응답을 받을 수 있는데, 이 과정에서 스트리밍 방식을 활용하면 실시간으로 응답을 확인할 수 있습니다.

가장 기본적인 방법은 curl의 -N 옵션을 사용해 스트리밍 요청을 보내는 것입니다. -N 옵션은 버퍼링을 비활성화하여 응답이 생성될 때마다 즉시 출력되도록 합니다. 아래는 ollama API에 streaming request를 보내는 curl 명령어 예시(example)입니다.

curl -N http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2-bllossom-3b-kr:latest",
    "prompt": "안녕하세요! 인공지능에 대해 간단히 설명해주세요.",
    "stream": true
  }'

위 curl 명령어는 stream:true 파라미터로 ollama에게 응답을 스트리밍 형태로 반환하도록 요청합니다. 실제 수행 결과는 위 gif 사진과 같이 ollama에서 API를 제공하고 있는 LLM 모델 결과가 출력되는 것을 확인할 수 있을 것입니다. 그러나, 이 방식으로 받은 출력은 위 예시를 보시면 아시겠지만 굉장히 가독성이 떨어집니다. 왜냐하면 JSON 형태로 반환되기 때문인데요. 이는 각 토큰(token)이 생성될 때마다 별도의 JSON 객체로 반환되기 때문입니다. 실제로 보면 모델 정보, 생성 시간, 응답 텍스트, 완료 여부 등 다양한 메타데이터가 포함되어 있어 사람이 읽기는 가독성이 떨어지죠.

이러한 이유로 더 깔끔한 결과를 얻기 위해서는 응답에서 필요한 부분만 추출하는 파이프라인을 구성할 수 있습니다.

동일하게 curl 명령어로 수행할 때 아래와 같이 수정할 수 있습니다.

curl -sN http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2-bllossom-3b-kr:latest",
    "prompt": "안녕하세요! 인공지능에 대해 간단히 설명해주세요.",
    "stream": true
  }' | while IFS= read -r line; do
    response=$(echo "$line" | grep -o '"response":"[^"]*"' | sed 's/"response":"//;s/"$//')
    printf "%s" "$response"
done

이 명령어는 다음과 같이 설명할 수 있는데요.

1. -s 옵션은 curl의 진행 정보를 숨겨 출력을 깔끔하게 보여줍니다.

2. -N 옵션은 앞서 설명한 것처럼 버퍼링을 비활성화 합니다.

3. 파이프(pipe, | )이후의 while 루프는 각 줄을 순차적으로 처리하도록 해줍니다.

4. while IFS= read -r line 의 의미는 입력을 한 줄씩 읽는다는 의미입니다. IFS=는 입력 필드 구분자를 비활성화하여 공백을 포함한 전체 줄을 보존하도록 합니다.

4. grep -o '"response":"[^"]*"'는 JSON에서 응답 테스트만 추출하도록 합니다.

5. sed 's/"response":"//;s/"$//'는 추출된 문자열에서 따옴표와 필드 이름을 제거합니다.

이러한 curl 명령어 수행 결과는 위 사진과 같습니다. 첫 번째 curl 명령어와 다르게 JSON 표현이 없어지고 깔끔하게 텍스트만 출력되는 것을 확인할 수 있습니다.

2. Python requests를 활용해서 Ollama API 호출할 때 스트리밍으로 받는 방법

앞서 살펴본 curl 명령어를 이용한 방식은 터미널에서 빠르게 테스트하기에 유용하지만, 더 복잡한 애플리케이션을 개발하거나 Python 환경에서 작업할 때는 Python의 requests 라이브러리를 활용하는 것이 더 편리합니다. Python을 통해 Ollama API를 호출하고 응답을 처리하는 방법에 대해 알아보겠습니다.

다음 Python 코드는 requests 라이브러리를 이용하여 Ollama API와 통신하는 예제 코드입니다.

import requests
import json
import sys

def stream_ollama_response(prompt, model="llama3.2-bllossom-3b-kr:latest", api_url="http://localhost:11434/api/generate"):
    """
    Stream responses from an Ollama model
    
    Args:
        prompt (str): The input text to send to the model
        model (str): The Ollama model to use
        api_url (str): The Ollama API URL
    """
    # Prepare the request payload
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": True  # Enable streaming
    }
    
    print("\nStreaming response from model:", model)
    print("-" * 50)
    
    # Make the request with streaming enabled
    with requests.post(api_url, json=payload, stream=True) as response:
        if response.status_code != 200:
            print(f"Error: Received status code {response.status_code}")
            print(response.text)
            return
        
        # Process the streaming response
        full_response = ""
        for line in response.iter_lines():
            if line:
                # Decode the JSON line
                try:
                    json_data = json.loads(line.decode('utf-8'))
                    
                    # Extract and print the response chunk
                    if 'response' in json_data:
                        chunk = json_data['response']
                        sys.stdout.write(chunk)
                        sys.stdout.flush()
                        full_response += chunk
                    
                    # Check if this is the final response
                    if json_data.get('done', False):
                        break
                        
                except json.JSONDecodeError as e:
                    print(f"Error decoding JSON: {e}")
                    print(f"Received data: {line.decode('utf-8')}")
        
        print("\n" + "-" * 50)
        return full_response

def main():
    # Constants
    OLLAMA_MODEL = "llama3.2-bllossom-3b-kr:latest"
    OLLAMA_API_URL = "http://localhost:11434/api/generate"
    
    # Get user input or use a default prompt
    if len(sys.argv) > 1:
        user_prompt = " ".join(sys.argv[1:])
    else:
        user_prompt = input("Enter your prompt (or press Enter for a default Korean prompt): ")
        if not user_prompt:
            user_prompt = "안녕하세요! 인공지능에 대해 간략하게 설명해주세요."
    
    # Stream the response
    stream_ollama_response(user_prompt, OLLAMA_MODEL, OLLAMA_API_URL)

if __name__ == "__main__":
    main()

이 코드에서는 POST 요청을 보내면서 Ollama API와 통신하는데요. 이때 stream=True 파라미터를 설정하여 응답이 오는 대로 실시간 스트리밍 처리를 할 수 있게 합니다. 또한, 돌아오는 데이터가 JSON 형식으로 돌아오다보니, JSON 형식으로 처리를 하게 되는데요. json_data['response']에 실제 응답이 들어있으므로, 실제 텍스트 응답을 추출하니다.

이후 sys.stdout.write(chunk)와 sys.stdout.flush()를 통해 응답을 즉시 콘솔에 출력하도록 코드를 구성하였습니다.

이 Python 코드를 실행하면 아래와 같이 결과가 나오게 됩니다.

스트리밍 형식으로 응답이 나오는 것을 확인할 수 있습니다.

3. Python FastAPI를 활용해 Ollama와 API로 통신할 때 스트리밍으로 받는 방법

지금까지는 Ollama API와 직접 통신하는 방법을 알아보았습니다. 그러나, 실제 서비스를 구축할 때는 중간 API 서버를 두어 클라이언트와 Ollama 사이의 통신을 관리하는 형태로도 구축할 수 있는데요. 이번에는 Python의 FastAPI 프레임워크를 사용하여 중간에 API 서버를 구축하고 이를 통해 Ollama와 스트리밍 방식으로 통신하는 방법을 알아보겠습니다. 아래는 FastAPI를 활용한 스트리밍 API 서버 코드입니다.

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
import json

app = FastAPI()

OLLAMA_API_URL = "http://localhost:11434/api/generate"
OLLAMA_MODEL = "llama3.2-bllossom-3b-kr:latest"  # 예시 모델명

class PromptRequest(BaseModel):
    prompt: str
    model: str = None  # 선택적 모델 지정 가능

@app.post("/generate-stream")
async def generate_stream(request: PromptRequest):
    # 요청에서 모델을 지정했으면 해당 모델 사용, 아니면 기본 모델 사용
    model = request.model if request.model else OLLAMA_MODEL
    
    payload = {
        "model": model,
        "prompt": request.prompt,
        "stream": True
    }

    async def event_stream():
        try:
            async with httpx.AsyncClient(timeout=None) as client:
                async with client.stream("POST", OLLAMA_API_URL, json=payload) as response:
                    if response.status_code != 200:
                        error_content = await response.aread()
                        yield f"오류 발생: {error_content.decode('utf-8')}"
                        return
                    
                    async for line in response.aiter_lines():
                        if line.strip():  # 빈 줄 제거
                            try:
                                data = json.loads(line)
                                content = data.get("response", "")
                                if content:
                                    yield content
                                    
                                # 응답 완료 여부 확인
                                if data.get("done", False):
                                    break
                                    
                            except json.JSONDecodeError:
                                continue
        except Exception as e:
            yield f"스트리밍 처리 중 오류 발생: {str(e)}"

    return StreamingResponse(event_stream(), media_type="text/plain")

# 상태 확인 엔드포인트 추가
@app.get("/health")
async def health_check():
    return {"status": "ok", "model": OLLAMA_MODEL}

300x250

이 FastAPI 코드는 클라이언트의 요청을 받아 Ollama API로 전달하고 Ollama의 응답을 스트리밍 방식으로 클라이언트에게 전달하는 중개 역할을 수행합니다. 위 코드의 핵심을 정리하자면

- API 엔드포인트 생성: /generate-stream 엔드포인트를 통해 사용자의 프롬프트를 받습니다.

- StreamingResponse를 사용하여 클라이언트에게 스트리밍 방식으로 응답을 제공합니다.

- httpx 라이브러리의 비동기 HTTP 클라이언트를 사용하여 Olllama API와 비동기 통신을 구현합니다.

- 핵심 적인 부분은 event_stream 함수로, 이 함수가 Ollama API의 응답을 실시간으로 처리하여 클라이언트에게 스트리밍합니다.

- async for line in response.aiter_lines()를 통해 Ollama의 응답을 한 줄씩 비동기로 읽어오고 필요한 데이터만 추출하여 클라이언트에게 전달합니다.

위 FastAPI 서버를 실행하면 아래와 같이 정상적으로 실행이 될탠데요.

저는 uvicorn app:app --port 8004 --reload와 같이 실행하여 8004번 포트에서 실행이 되도록 하였습니다. 만약, 호스트(host)까지 지정하고 싶다면 uvicorn app:app --host 0.0.0.0 --port 8004 --reload와 같이 실행하시면 됩니다.

이제, 저 API를 호출하여 실제 결과가 잘 나오는지 확인해보겠습니다. FastAPI 서버에 요청을 보내는 방법은 크게 두 가지가 있습니다.

1. Curl 명령어 활용

curl -N -X POST http://localhost:8004/generate-stream \
     -H "Content-Type: application/json" \
     -d '{"prompt": "안녕하세요?"}'

이 명령어는 앞서 살펴본 것처럼 -N 옵션을 사용하여 버퍼링을 비활성화하고, 스트리밍 응답을 실시간으로 출력합니다. 이미 FastAPI 내부에서 깔끔하게 출력하도록 설정하였기 떄문에 위 curl 명령어로도 깔끔한 결과가 나옵니다.

curl 명령어를 실행하면 위와 같은 사진으로 결과가 나오는 것을 확인할 수 있습니다.

2. Python 코드 활용

import asyncio
import httpx

async def main():
    url = "http://localhost:8004/generate-stream"
    payload = {
        "prompt": "안녕하세요? 제 이름은 이수진이라고 합니다."
    }

    async with httpx.AsyncClient(timeout=None) as client:
        async with client.stream("POST", url, json=payload) as response:
            # 청크 단위로 데이터를 처리 (바이트 단위)
            async for chunk in response.aiter_bytes():
                if chunk:
                    # 바이트를 문자열로 디코딩
                    text = chunk.decode('utf-8')
                    print(text, end="", flush=True)

asyncio.run(main())

이 Python 코드는 httpx 라이브러리를 사용하여 FastAPI 서버에 POST 요청을 보내고, 스트리밍 응답을 실시간으로 처리합니다. 이때, response.aiter_bytes를 통해 응답을 바이트 단위로 읽어오고, 이를 문자열로 변환하여 출력합니다.

코드를 실행하면 위 사진과 같은 결과가 나옵니다. 정상적으로 Ollama의 결과가 FastAPI를 거쳐 스트리밍 형식으로 잘 나오는 것을 확인할 수 있습니다.

4. Python Streamlit 화면에서 Ollama의 스트리밍 통신을 출력하는 방법

지금까지 CLI 환경과 API 서버를 통해 Ollama와 통신하는 방법을 알아봤습니다. 이제 Python을 활용한 개발 과정에서 PoC 등으로 많이 활용하는 Streamlit을 활용해 Ollama의 스트리밍 통신을 할 수 있는 방법을 알아보겠습니다. Streamlit 라이브러리는 데이터 애플리케이션 등을 빠르게 개발하고 볼 수 있게 해주는 강력한 라이브러리인데요. 아마 많은 분들이 사용하고 계실거라 생각합니다. 이 Streamlit에서 Ollama에서 serving 중인 LLM과의 스트리밍 통신을 하여 웹 화면에 출력하는 과정을 보겠습니다. 아래는 그 Stsreamlit 코드입니다.

import streamlit as st
import httpx
import asyncio
from typing import Iterator, Callable

# 페이지 설정
st.set_page_config(
    page_title="Ollama 스트리밍 채팅",
    layout="wide"
)

# 앱 제목
st.title("Ollama 스트리밍 채팅 예제")

# API 설정
API_URL = "http://localhost:8004/generate-stream"  # FastAPI 서버 주소

# 세션 상태 초기화
if "messages" not in st.session_state:
    st.session_state.messages = []
if "current_response" not in st.session_state:
    st.session_state.current_response = ""
    
# 함수: 스트리밍 응답 생성
async def generate_streaming_response(prompt: str) -> Iterator[str]:
    """Ollama API를 통해 스트리밍 응답을 비동기적으로 생성합니다."""
    payload = {"prompt": prompt}
    
    # 응답 초기화
    st.session_state.current_response = ""
    
    async with httpx.AsyncClient(timeout=None) as client:
        async with client.stream("POST", API_URL, json=payload) as response:
            async for chunk in response.aiter_bytes():
                if chunk:
                    text = chunk.decode('utf-8')
                    st.session_state.current_response += text
                    # 현재까지의 전체 응답 반환
                    yield st.session_state.current_response

# 함수: 비동기 결과를 Streamlit에서 처리
def stream_response(prompt: str, callback: Callable[[str], None]):
    """비동기 스트리밍 응답을 Streamlit UI에 표시합니다."""
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    
    # 응답 초기화
    st.session_state.current_response = ""
    
    async def process_response():
        async for response_text in generate_streaming_response(prompt):
            # 콜백 함수를 호출하여 UI 업데이트
            callback(response_text)
            # 작은 딜레이로 UI 업데이트 시간 확보
            await asyncio.sleep(0.00)
    
    # 비동기 처리 실행
    loop.run_until_complete(process_response())
    
    # 최종 응답 반환 (세션 상태에서 가져옴)
    return st.session_state.current_response

# 채팅 기록 표시
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# 사용자 입력
user_input = st.chat_input("메시지를 입력하세요!")

if user_input:
    # 사용자 메시지 표시 및 저장
    with st.chat_message("user"):
        st.markdown(user_input)
    st.session_state.messages.append({"role": "user", "content": user_input})
    
    # 응답 컨테이너 생성
    with st.chat_message("assistant"):
        response_container = st.empty()
        
        # 스트리밍 응답 처리 및 UI 업데이트 함수
        def update_response(text):
            response_container.markdown(text)
        
        # 응답 생성 및 스트리밍
        final_response = stream_response(user_input, update_response)
        
        # 채팅 이력에 응답 저장
        if final_response and final_response.strip():
            st.session_state.messages.append({"role": "assistant", "content": final_response})
        
# 사이드바 정보
with st.sidebar:
    st.subheader("모델 정보")
    st.write("현재 모델: llama3.2-bllossom-3b-kr:latest")
    st.write("API 엔드포인트: " + API_URL)

위 Streamlit 코드를 실행하면 하나의 간단하게 Ollama와 API 통신을 수행할 수 있는 웹 페이지를 실행시킬 수 있습니다. 이때, 저는 앞에서 실행시킨 FastAPI를 활용해서 Ollama와 통신하도록 했습니다. generate_streaming_response 함수는 비동기 방식으로 FastAPI 서버에 요청을 보내고 스트리밍 응답을 처리합니다.

또한, stream_response 함수는 비동기 응답을 streamlit ui에 표시하기 위한 로직을 담고 있습니다. 위 streamlit을 실행시키면 아래와 같은 결과를 받을 수 있습니다.

원하는 메세지를 입력하면 Ollama와 API 통신을 하고, 그 결과를 스트리밍 형식으로 출력하는 것을 확인할 수 있습니다.

마무리

이번 포스팅은 LLM을 배포하고 serving하는 Ollama와 API 통신을 할 때 스트리밍(streaming) 형태로 데이터를 받아오는 방법에 대해서 알아봤습니다.

도움이 되시길 바랍니다.

그리드형

저작자표시 동일조건 (새창열림)

'인공지능(AI) > LLM&RAG' 카테고리의 다른 글

google colab에서 Ollama 사용하기 - 코랩(colab) ollama LLM API 사용법 (1)	2025.06.04
블로그 Q&A 챗봇(Chatbot) RAG 만들어보기 - LangChain + Ollama + FastAPI + Streamlit + PGVector (2)	2025.05.05
프롬프트 엔지니어링이란? 효과적인 LLM 사용을 위한 프롬프트 작성 방법과 기법들 (4)	2025.04.19
AI Agent란? Agent와 RAG와의 차이점은 무엇인가?(AI Agent의 방법론과 써야하는 이유에 대해서) (2)	2025.03.17
LLM의 보안을 우회할 수 있을까? Anthropic의 Universal Jailbreak(탈옥) 실험 연구(Constitutional Classifiers: Defending against Universal Jailbreaks) (0)	2025.03.12