Hey,
I wanted to share some code snippets from my latest experiment, where I attempted to emulate an Ollama server.
While working with create-llama, I needed to capture the exact queries sent to the Ollama server and the responses received from it. To do this, I created a mock Ollama server. However, it turned out to be more challenging than I initially expected! :)
Iām not entirely sure, but I think I may have discovered a bug during my experiment! š„³
I used Python's socket module to simulate the server communication, which definitely added some complexity to the project. In this post, I'll share the Python socket program I wrote, along with other relevant code snippets and commands I used during the process.
After experimenting with curl for a while, I figured out how to handle the headers and responses. However, sending data in chunks via a socket wasn't as straightforward as I'd anticipated.
To better understand the Ollama server's behavior, I also ran several commands, such as:
curl -v http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "just say yes!"
}'
curl -v http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text:latest",
"prompt": "merhaba",
"options":{"num_ctx":4096,"top_p":0.9,"temperature":0.7}
}'
Here's the first query LlamaIndex sent to the Ollama server:
Connection from ('127.0.0.1', 34094)
Received data: POST /api/embeddings HTTP/1.1
host: 127.0.0.1:11434
connection: keep-alive
Content-Type: application/json
Accept: application/json
User-Agent: ollama-js/0.5.10 (x64 linux Node.js/v20.18.0)
accept-language: *
sec-fetch-mode: cors
accept-encoding: gzip, deflate
content-length: 112
{"model":"mxbai-embed-large:latest","prompt":"merhaba","options":{"num_ctx":4096,"top_p":0.9,"temperature":0.7}}
While using an HTTP server would have made things much simpler, I found that forcing myself to tackle the challenges with raw sockets was a valuable learning experience. Sometimes, making things difficult can lead to deeper understanding! š
Here's the Python socket program I developed:
import socket
server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind(('0.0.0.0', 11434))
server_socket.listen(5)
print("Server listening on port 11434")
while True:
conn, addr = server_socket.accept()
conn.settimeout(5)
print(f"Connection from {addr}")
try:
data = conn.recv(1024)
print(f"Received data: {data.decode()}")
# HTTP Response headers
response_headers = [
"HTTP/1.1 200 OK",
"Content-Type: application/x-ndjson",
"Date: Fri, 15 Nov 2024 15:46:59 GMT",
"Transfer-Encoding: chunked", # Telling the client to expect chunked transfer
]
# Send the headers
conn.sendall(("\r\n".join(response_headers) + "\r\n\r\n").encode())
# Chunked response data
chunks = [
'{"model":"llama3.2","created_at":"2024-11-15T15:46:15.027881945Z","response":"Yes","done":false}',
'{"model":"llama3.2","created_at":"2024-11-15T15:46:59.248901276Z","response":"","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,4345,2019,10035,0,128009,128006,78191,128007,271,9642,13],"total_duration":242612055,"load_duration":31211312,"prompt_eval_count":29,"prompt_eval_duration":70000000,"eval_count":3,"eval_duration":139000000}',
]
# Sending chunks
for chunk in chunks:
chunk_size = hex(len(chunk))[2:] # Get chunk size in hex (removes '0x' prefix)
conn.sendall(f"{chunk_size}\r\n".encode()) # Send chunk size
conn.sendall(chunk.encode()) # Send the actual chunk data
conn.sendall(b"\r\n") # End of chunk (CRLF)
# Final chunk with size 0 to indicate end of data
conn.sendall(b"0\r\n\r\n") # End of chunked response
# Close the connection
conn.close()
except Exception as e:
print(f"Something went wrong: {e}")
conn.close()