How to Emulate Ollama

17th Nov 2024 python ollama socket llamaindex create-llama curl

Hey,

I wanted to share some code snippets from my latest experiment, where I attempted to emulate an Ollama server.

While working with create-llama, I needed to capture the exact queries sent to the Ollama server and the responses received from it. To do this, I created a mock Ollama server. However, it turned out to be more challenging than I initially expected! :)

I’m not entirely sure, but I think I may have discovered a bug during my experiment! 🥳

I used Python's socket module to simulate the server communication, which definitely added some complexity to the project. In this post, I'll share the Python socket program I wrote, along with other relevant code snippets and commands I used during the process.

After experimenting with curl for a while, I figured out how to handle the headers and responses. However, sending data in chunks via a socket wasn't as straightforward as I'd anticipated.

To better understand the Ollama server's behavior, I also ran several commands, such as:

curl -v http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "just say yes!"
}'

curl -v http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text:latest",
  "prompt": "merhaba",
  "options":{"num_ctx":4096,"top_p":0.9,"temperature":0.7}
}'

Here's the first query LlamaIndex sent to the Ollama server:

Connection from ('127.0.0.1', 34094)
Received data: POST /api/embeddings HTTP/1.1
host: 127.0.0.1:11434
connection: keep-alive
Content-Type: application/json
Accept: application/json
User-Agent: ollama-js/0.5.10 (x64 linux Node.js/v20.18.0)
accept-language: *
sec-fetch-mode: cors
accept-encoding: gzip, deflate
content-length: 112

{"model":"mxbai-embed-large:latest","prompt":"merhaba","options":{"num_ctx":4096,"top_p":0.9,"temperature":0.7}}

While using an HTTP server would have made things much simpler, I found that forcing myself to tackle the challenges with raw sockets was a valuable learning experience. Sometimes, making things difficult can lead to deeper understanding! 😉

Here's the Python socket program I developed:

import socket
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind(('0.0.0.0', 11434))
server_socket.listen(5)
print("Server listening on port 11434")

while True:
    conn, addr = server_socket.accept()
    conn.settimeout(5)
    print(f"Connection from {addr}")

    try:
        data = conn.recv(1024)
        print(f"Received data: {data.decode()}")

        # HTTP Response headers
        response_headers = [
            "HTTP/1.1 200 OK",
            "Content-Type: application/x-ndjson",
            "Date: Fri, 15 Nov 2024 15:46:59 GMT",
            "Transfer-Encoding: chunked",  # Telling the client to expect chunked transfer
        ]
        # Send the headers
        conn.sendall(("\r\n".join(response_headers) + "\r\n\r\n").encode())

        # Chunked response data
        chunks = [
            '{"model":"llama3.2","created_at":"2024-11-15T15:46:15.027881945Z","response":"Yes","done":false}',
            '{"model":"llama3.2","created_at":"2024-11-15T15:46:59.248901276Z","response":"","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,4345,2019,10035,0,128009,128006,78191,128007,271,9642,13],"total_duration":242612055,"load_duration":31211312,"prompt_eval_count":29,"prompt_eval_duration":70000000,"eval_count":3,"eval_duration":139000000}',
        ]

        # Sending chunks
        for chunk in chunks:
            chunk_size = hex(len(chunk))[2:]  # Get chunk size in hex (removes '0x' prefix)
            conn.sendall(f"{chunk_size}\r\n".encode())  # Send chunk size
            conn.sendall(chunk.encode())  # Send the actual chunk data
            conn.sendall(b"\r\n")  # End of chunk (CRLF)

        # Final chunk with size 0 to indicate end of data
        conn.sendall(b"0\r\n\r\n")  # End of chunked response

        # Close the connection
        conn.close()

    except Exception as e:
        print(f"Something went wrong: {e}")
        conn.close()

Previous Post Next Post

How to Emulate Ollama

Syndicate