Generative AI Series

Ollama — Brings runtime to serve LLMs everywhere.

Working with Ollama to run models locally, build LLM applications that can be deployed as docker containers.

A B Vijay Kumar

--

This blog is an ongoing series on GenerativeAI and is a continuation of the previous blogs. In this blog series, we will explore Ollama, and build applications that we can deploy in a distributed architectures using docker.

Ollama is a framework that makes it easy to run powerful language models on your own computer. You don’t need to rely on cloud services or remote servers. You can use Ollama on Mac OS and Linux, and soon on Windows too. Ollama supports a range of models, such as LLaMA-2, uncensored LLaMA, CodeLLaMA, Falcon, Mistral, and others. You can install and run these models with a few simple commands. Ollama lets you harness the potential of local language models for your own projects. Ollama also works with docker, so you can deploy language models in a distributed system (Kubernetes) and serve them to your applications.

In this blog, we will first explore what Ollama is, how to use it, and in subsequent blog, we will build a simple Streamlit chatbot application that uses Langchain to talk to the models served on Ollama. We will then deploy this application as a docker containers using docker-compose.

Ollama can be installed from ollama.ai site. Ollama is a platform that enables you to access and interact with various models using different methods. You can use the CLI, the REST API, or the SDK to communicate with Ollama and run your tasks. Additionally, you can use the Langchain SDK, which is a tool for working with Ollama in a more convenient way. In this blog post, we will show you how to use the Langchain SDK to create a chatbot that can answer your questions.

Using Ollama on the command line is very simple. The following are commands, that you can try to run Ollama on your computer.

  • ollama pull — This command is used to pull a model from the Ollama model hub.
  • ollama rm — This command is used to remove the already downloaded model from the local computer.
  • ollama cp — This command is used to make a copy of the model.
  • ollama list — This command is used to see the list of downloaded models.
  • ollama run — This command is used to run a model, If the model is not already downloaded, it will pull the model and serve it.
  • ollama serve — This command is used to start the server, to serve the downloaded models.

Ollama also provides a way to customize an existing model, by defining our customized version of the model. Let's first explore that, before we start coding.

To create a custom model, we need to define the model species in a Modelfile (very similar to Dockerfile, if you have worked on Docker).

In the following example, we will create our model, based on Mistral, and build a custom persona as Spider-Man. The following shows the source code of Modelfile

The file is self-explanatory, we are instructing that our model is based on mistral (Line #1) and uses a hyperparameter temperature as 1 (Line#2) (highly creative) and responds like Spiderman (Line#3–5) (based on the prompt provided in SYSTEM).

To build this model, we use ollama create command, as shown below.

as you can see, Ollama is downloading the base model and creating a new model, called spiderman, by transferring the base model. We can see the list of all models by executing ollama list command, as shown below.

Now let's run that model and see how it responds to our queries, by executing ollama run spiderman command.

The screenshot below shows the response.

As you can see, this model responds like Spiderman. This is a silly example, but imagine, if you have to customize a model to respond as a persona, this will be very powerful. In the future blogs, we will discuss about the AutoGen agents, where we build agents with various personas to solve a problem.

The models can also be invoked by passing a POST message. Here is a sample curl command.

curl http://localhost:11434/api/generate -d '{
"model": "spiderman",
"prompt":"Who are you?",
"stream":false
}'

you can see the output below. you will find the response as JSON, and we should be able to extract the response from the “response” key.

Ollama is a framework that allows us to run the Ollama server as a docker image. This is very useful for building microservices applications that use Ollama models. We can easily deploy our applications in the docker ecosystem, such as OpenShift, Kubernetes, and others. To run Ollama in docker, we have to use the docker run command, as shown below. Before this, you should have docker installed in your system.

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

you can see the output below

We should then be able to interact with this container using docker exec, as shown below, and run the prompts.

docker exec -it ollama ollama run phi

In the above command, we are running phi within the docker.

Note that docker containers an ephemeral, and whatever models, we pull, will disappear when we restart the container. We will solve this issue in the next blog, where we will build a distributed Streamlit application from ground up. We will be mapping the volume of the container with the host.

Ollama is a powerful tool that enables new ways of creating and running LLM applications on the cloud. It simplifies the development process and offers flexible deployment options. It also allows for easy management and scaling of the applications.

I hope you enjoyed this blog and learned something new. In the next blog, we will see how to integrate Ollama with Langchain, and how to use Docker-Compose to deploy the application on Docker containers.

I will be back with more experiments, in the meantime, have fun, and keep coding!!! see you soon!!!

You can also check out my other blog on Ollama, where we build a chatbot application and deploy on docker.

Ollama — Build a ChatBot with Langchain, Ollama & Deploy on Docker | by A B Vijay Kumar | Feb, 2024 | Medium

References

--

--

No responses yet