PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.9)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

Figuring out the shenanigans

Let’s get on the same page (after last post) as to what is happening when we kill workers in our pool. After starting an IEx session with iex -S mix, starting the Observer with :observer.start(), and killing a worker, I’ve got the following state in my pool manager:

I started with 3 workers, but after killing one of them (0.138.0), I now have 4 workers (one too many). In addition, of the two new workers that appeared (0.204.0 and 0.205.0), only one of them was added to the pool manager’s list of available workers. You’ll notice that it was also the one to be linked. The other worker is just floating around, and will never be assignable to handle work.

Now, if I kill the process with pid 0.204.0, it gets replaced by a single new process. If on the other hand I kill 0.205.0, it will be replaced by two new processes. What gives?

Well, with the changes implemented in the last post, we’ve got the pool manager linked to all workers, and replacing them with new instances when they fail. But remember who else replaces worker processes as they die? That’s right: the worker supervisor. So when we killed a worker process that was spawned during initialization (and therefore linked to the pool manager), it triggered 2 responses: the worker supervisor started a new worker instances (because that’s just what it does), and the pool manager also started a new worker (which it then linked).

The new worker spawned by the worker supervisor isn’t linked by the pool manager (as you can see in the image above: no orange line), so we end up with different results when we kill these newly spawned workers:

  • kill the worker created by the worker supervisor, and it simply gets replaced by a new worker: we don’t end up with any additional workers;
  • kill the worker created by the pool manager, and we end up with an additional worker in our tree: one worker was created by the worker supervisor in response to a worker’s death, the other was created by the pool manager because it saw that a linked worker terminated.

This obviously isn’t the behavior we want in our application. What should we do? We need to tell the worker supervisor to not restart its children.

Supervisors start their children according to a child spec (that we met in back in part 1.2), which has a :restart optional value (see docs). This value dictates whether that particular child will get restarted when it terminates.

First, let’s check our current worker supervisor state by double-clicking on it within the Observer: in the “State” tab, you can see a list of children, each of which indicates permanent as the restart strategy.

Customizing the child spec

We’ve found the problem cause, let’s go for the solution. One possibility would be to ensure that our Doubler worker will always return a child spec with a :temporary restart strategy. We won’t go that route here, because down the road we want PoolToy users to be able to provide their own worker implementations to be used in the pool. So it’s safer if we ensure that workers have temporary restart strategies instead of relying on behavior implemented by others.

Here’s our current code starting workers (lib/pool_toy/pool_man.ex):

defp new_worker(sup, spec) do
  {:ok, pid} = PoolToy.WorkerSup.start_worker(sup, spec)
  true = Process.link(pid)
  pid
end

We tell the dynamic worker supervisor to start a new instance of the spec argument. If you check the code, you’ll see that the value of that argument is always Doubler. How does that become a child specification? Well, Doubler invokes use GenServer: one of its effects is adding a child_spec/1 function to the module. Then when start_child/2 is given a second argument that isn’t a child spec (a child spec being simply a map with certain mandatory key/value pairs), it will convert it into a child spec.

If given a tuple with module and args like {Doubler, [foo: :bar]} it will call Doubler.child_spec([foo: :bar]) to get the child spec instance it needs (the args will then be forwarded to start_link/1). If given simply a module name such as Doubler it will call Doubler.child_spec([]) (so in effect Doubler is equivalent to {Doubler, []}).

In fact, you can try that in an IEx session with iex -S mix:

iex(1)> Doubler.child_spec([])
%{id: Doubler, start: {Doubler, :start_link, [[]]}}

So that explains how our dynamic worker supervisor ends up with the child spec it requires in the code above. And we know that we can provide a valid child spec directly to start_child/2, so maybe we could try:

defp new_worker(sup, spec) do
  {:ok, pid} = PoolToy.WorkerSup.start_worker(sup,
       %{id: spec, start: {spec, :start_link, [[]]}, restart: :temporary})
  true = Process.link(pid)
  pid
end

The problem there is that although that will technically work, it’ll be very brittle. What happens when Doubler changes how it defines its child specification (e.g. by adding default arguments provided to start_link/1)? Our code will disregard the change and will break, that’s what. Event worse, our code won’t work with other ways of specifying spec (e.g. a tuple, or the full map).

The better way would be to generate a child spec from the spec argument, and provide an override value for the :restart attribute. Fortunately, there’s a function for that: Supervisor.child_spec/2. It takes a child spec, tuple of module and args, or module as the first arg, and a keyword list of overrides as the second arg. So we can change our code to (lib/pool_toy/pool_man.ex):

defp new_worker(sup, spec) do
  child_spec = Supervisor.child_spec(spec, restart: :temporary)
  {:ok, pid} = PoolToy.WorkerSup.start_worker(sup, child_spec)
  true = Process.link(pid)
  pid
end

But hold on a minute: if the worker supervisor is no longer restarting terminated processes (since its children are now :temporary and the pool manager handles the restarting), is it even useful anymore? Shouldn’t we refactor and remove the supervisor? Nope, because one aspect of a supervisor is indeed that it handles restarting terminated children, but the other is that it provides a central location to terminate all of its children and prevent memory leaks. In other words, our worker supervisor will allow us to terminate all worker processes by simply terminating the worker supervisor itself. This will be particularly handy in the future, when we want to enable clients to stop pools.

Our changes so far

And with that last change above, we’ve finally got a functional worker pool. Here it is in use:

iex(1)> w1 = PoolToy.PoolMan.checkout()
#PID<0.137.0>
iex(2)> w2 = PoolToy.PoolMan.checkout()
#PID<0.138.0>
iex(3)> w3 = PoolToy.PoolMan.checkout()
#PID<0.139.0>
iex(4)> w4 = PoolToy.PoolMan.checkout()
:full
iex(5)> PoolToy.PoolMan.checkin(w2)
:ok
iex(6)> w4 = PoolToy.PoolMan.checkout()
#PID<0.138.0>
iex(7)> Doubler.compute(w4, 3)
Doubling 3
6
iex(8)> 1..9 |> Enum.map(& Doubler.compute(w1, &1))
Doubling 1
Doubling 2
Doubling 3
Doubling 4
Doubling 5
Doubling 6
Doubling 7
Doubling 8
Doubling 9
[2, 4, 6, 8, 10, 12, 14, 16, 18]

Since we’ve limited our pool size to 3 doublers, we know that at any one time only 3 doubling processes may happen simultaneously: that’s what we wanted by design to prevent becoming overloaded.

We’ve got a single functional pool, but there’s more for us to do (and learn!): clients should be able to start/stop as many pools as they want, with variable sizes and worker modules. We could also enhance our pool functionality to handle bursts of activity by allowing the pool to start a certain number of “extra” overflow workers that will shut down after some cool down time when demand returns to normal levels. We can also allow our clients to decide what to do if no worker is available: return a response immediately saying the pool is at capacity, or queue the client’s demand and execute when a worker becomes available.

Along the way, we’ll find that some choices made so far weren’t such a great design. But worry not! We’ll learn why and how our design will need to evolve to accommodate the new desired requirements.


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir, Uncategorized | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.9)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.8)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

First, let’s point out the problem in our current pool design as we left after last post: our pool manager’s state gets out of whack when a checked out worker dies. Here’s how to witness that with your own eyes (after running iex -S mix from your project’s root folder, of course):

iex(1)> PoolToy.PoolMan.checkout()
#PID<0.137.0>
iex(2)> :observer.start()
:ok

I’ve checked out a process with pid 0.137.0 (note that your value could of course be different) and started the Observer. In there, if I look at the pool manager’s state (quick reminder: in the “Applications” tab, click on pool_toy in the list on the left, then double-click on PoolMan, then go to the “State” tab in the new window), I can see that I now have only 2 available workers, which is correct:

Additionally, if I go to the “Table Viewer” tab in the Observer and double click on the “monitors” table, I can see that we have an entry for a monitor on the process with pid 0.137.0:

Back in the “Applications” tab (where you might have to select pool_toy on the left again), let’s kill the checked out process by right-clicking on it and selecting the last entry in the menu: “kill process”. Click ok on the popup asking for the exit reason (thereby using the default :kill) and you’ll see our application quickly replaces the downed worker in the pool with a fresh new instance (it has a different pid) ready to accept a workload.

For completeness, let’s quickly point out that what we’ve just accomplished in the Observer (killing a specific process) can of course be achieved in code:

iex(1)> w = PoolToy.PoolMan.checkout()
#PID<0.137.0>
iex(2)> Process.exit(w, :kill)
true

And if you consult the docs, you’ll see that the default :kill reason proposed by the Observer is special: it forces the process to die unconditionally (more on that below).

But let’s get back to our earlier train of thought, and turn our attention back to the Observer: if we look at the pool manager’s state and the monitors table, what do we see? Only 2 workers are considered as available by the pool manager, and we will see an entry for the dead worker in the monitors table.

If we return to the “Applications” tab and kill one of the available workers, it gets even worse. Look at the pool manager’s state now, and you can see that it’s still the same. That means that we’ve got pids in there that are dead (and therefore shouldn’t be given to clients when they request a worker), and alive workers that aren’t in the list of available workers. Oh my!

So why is this happening? As you’ve probably guessed, it’s because we’re not handling worker death properly (or some might say we’re not handling it at all): we’ve got a supervisor watching over the workers and restarting them as they die, but the pool manager is never made aware of this, so it can’t react and update its state.

How can we fix it? I know: we can have the worker supervisor monitor its children, and notify the pool manager when one of them dies, along with the pid of the new worker it started to replace the dead one! NO. No no no. No. Supervisors only supervise.

Supervisors only supervise

Supervisors are the cornerstone of OTP reliability, so they must be rock solid. In turn, this means that any code we can put elsewhere doesn’t belong in the supervisor: less code means fewer bugs, which equals more robust supervisors.

Ok, so we can’t have the supervisor monitoring the processes, but someone’s got to do it, right? Let’s have the brains of our operation do it: the pool manager. But let’s think this through a bit before heading straight back to mashing our keyboards.

Our pool manager already makes the dynamic worker supervisor start workers. This way the pool manager knows what the worker pids are. So it’ll be pretty easy to monitor (or link!) them there. This way, when a worker dies, the pool manager gets a message and can react to it (start a new worker, monitor/link the new worker, and clean up the state so its consistent). Sounds like a plan, let’s get back to coding!

Handling worker death

Our pool manager needs to know when workers die. Should we use monitors or links for this? To determine which is more appropriate, we once again need to think about how the pool manager and worker deaths should affect one another.

If process A monitors process B and B dies, process A will receive a :DOWN message as we saw in part 1.6. But if A dies, nothing special happens.

If process A links process B and B dies, process A will be killed. Reciprocally, if A dies, B will be killed.

In our case, we’d like a mix of the two: we want the pool manager to know when a worker dies (so it can start a new worker and add it to the pool), but if the pool manager dies all workers should be killed (because the best way to ensure the pool manager’s state is correct is to start fresh with all new workers that will all be available).

As this is a common use case, OTP provides the ability to trap exits which in Elixir is achieved via Process.flag/2. To learn more about this, let’s see the Erlang docs we’re referred to:

When trap_exit is set to true, exit signals arriving to a process are converted to {‘EXIT’, From, Reason} messages, which can be received as ordinary messages.

Why is this useful? Let’s contrast it to the usual case for linked processes (again from the Erlang docs):

The default behaviour when a process receives an exit signal with an exit reason other than normal, is to terminate and in turn emit exit signals with the same exit reason to its linked processes. An exit signal with reason normal is ignored.

When a process is trapping exits, it does not terminate when an exit signal is received. Instead, the signal is transformed into a message {‘EXIT’,FromPid,Reason}, which is put into the mailbox of the process, just like a regular message.

An exception to the above is if the exit reason is kill, that is if exit(Pid,kill) has been called. This unconditionally terminates the process, regardless of if it is trapping exit signals.

So normally, if process A is linked to process B, when process B crashes it will send an exit signal to process A. Process A, upon receiving the exit signal will terminate for the same reason B did.

However, if we make process A trap exits by calling Process.flag(:trap_exit, true) from within the process, all exit signals it is sent will instead be converted into :EXIT messages that will sit in the process’ mailbox until they are treated. The exception in the last quoted paragraph documents the exception: if the exit reason is :kill, a process cannot trap the exit signal and will be terminated unconditionally. In other words, you can still force a process that trap exits to terminate: just call Process.exit(pid, :kill).

Back to our use case: by having the pool manager link to workers and trap exits, we’ll achieve exactly what we want. If the pool manager goes down, it will take all workers down with it, but if a worker dies, the pool manager won’t be killed: it’ll instead receive a :DOWN message that it can process as it pleases (in our case to replace the dead worker in the pool).

Linking worker processes

Let’s start by refactoring worker starting to extract it to it’s own function (lib/pool_toy/pool_man.ex):

def handle_info(:start_workers, %State{size: size} = state) do
  workers =
    for _ <- 1..size do
      new_worker(PoolToy.WorkerSup, Doubler)
    end

  {:noreply, %{state | workers: workers}}
end

We’ve simply shifted the worker creation to within a new_worker/2 function on line 4. Here’s that function definition (lib/pool_toy/pool_man.ex):

def handle_info(msg, state) do
  # editd for brevity
end

defp new_worker(sup, spec) do
  {:ok, pid} = PoolToy.WorkerSup.start_worker(sup, spec)
  true = Process.link(pid)
  pid
end

Most of that code is the same, we’ve only added line 7 where we call Process.link/1 which will link the calling process to the one given. That handles the linking part, but we still need to handle the exit trapping part. That’s pretty easy (lib/pool_toy/pool_man.ex):

def init(size) do
  Process.flag(:trap_exit, true)
  send(self(), :start_workers)
  monitors = :ets.new(:monitors, [:protected, :named_table])
  {:ok, %State{size: size, monitors: monitors}}
end

Line 2 is the only addition required to make the magic happen: when a worker dies, our pool manager will no longer die with it, since it’ll now receive a special message instead.

Let’s take a quick peek at the Observer again:

You can see that orange lines have appeared: they indicate link relationships. In the above image, we can see that PoolMan is linked to every worker process.

Handling a terminated worker

Let’s first write the code to handle a terminated worker. If a worker dies, we need to remove it from the list of available workers (if it even is in that list), and add a new worker to take its place. Of course, the new worker won’t have been checked out, so it’ll be added to the list of available workers (lib/pool_toy/pool_man.ex):

defp handle_idle_worker(%State{workers: workers} = state, idle_worker) when is_pid(idle_worker) do
  # edited for brevity
end

defp handle_worker_exit(%State{workers: workers} = state, pid) do
  w = workers |> Enum.reject(& &1 == pid)
  %{state | workers: [new_worker(PoolToy.WorkerSup, Doubler) | w]}
end

Right, so now let’s actually handle that mysterious :EXIT message that we’ll receive when a worker dies (because the pool manager is trapping exits) in lib/pool_toy/pool_man.ex:

def handle_info({:DOWN, ref, :process, _, _}, %State{workers: workers, monitors: monitors} = state) do
  # edited for brevity
end

def handle_info({:EXIT, pid, _reason}, %State{workers: workers, monitors: monitors} = state) do
  case :ets.lookup(monitors, pid) do
    [{pid, ref}] ->
      true = Process.demonitor(ref)
      true = :ets.delete(monitors, pid)
      {:noreply, state |> handle_worker_exit(pid)}
    [] ->
      if workers |> Enum.member?(pid) do
        {:noreply, state |> handle_worker_exit(pid)}
      else
        {:noreply, state}
      end
  end
end

The :EXIT messages will contain the pid of the process that died (line 5). Since we’re only linking to worker processes, we know that this pid is going to be a worker. We can then check (on line 6) whether the worker was checked out, and act appropriately.

If the worker was checked out (line 7):

  • stop monitoring the associated client, as we no longer care if it dies (line 8);
  • remove the monitor reference from our ETS monitors table (line 9);
  • handle the terminated worker and return the updated state (line 10).

If the worker wasn’t checked out (line 11):

  • if the worker is in the list of available workers, we handle the terminated worker and return the updated state (lines 12-13);
  • otherwise, we do nothing and return the state as is.

A quick refactor

We’ve got some “magic” module values strewn about our code: PoolToy.WorkerSup as the name for the worker supervisor process, and Doubler as the spec used for worker instances. Ideally, the @name module property we use to name the pool manager process should also get refactored, but as it is handled differently (being passed to GenServer.start_link/1 as opposed to used within the GenServer callback implementations), we’ll leave it aside for simplicity. Don’t worry, we’ll get to it when implementing multiple pools.

Hard coding process names everywhere works for now, but our end goal is to allow users to have multiple pools running at the same time. Our current approach won’t work for that: names must be unique. So let’s clean up our act by moving these values that are specific to a pool instance into the pool manager’s state. We’ll still hard code them for now, but at least that way they’ll be easier to change when the time comes.

Let’s start by adding these values to the state (lib/pool_toy/pool_man.ex):

defmodule State do
  defstruct [
    :size, :monitors,
    worker_sup: PoolToy.WorkerSup, worker_spec: Doubler, workers: []
  ]
end

And then we still need to refactor the rest of our code to work with this change (lib/pool_toy/pool_man.ex):

def handle_info(:start_workers, %State{worker_sup: sup, worker_spec: spec, size: size} = state) do
  workers =
    for _ <- 1..size do
      new_worker(sup, spec)
    end

  {:noreply, %{state | workers: workers}}
end

# edited for brevity

defp handle_worker_exit(%State{workers: workers, worker_sup: sup, worker_spec: spec} = state, pid) do
  w = workers |> Enum.reject(& &1 == pid)
  %{state | workers: [new_worker(sup, spec) | w]}
end

Our changes so far

Awesome! Start an IEx session with iex -S mix, start the Observer, and kill some workers. Does everything work as expected? Experiment a little, and try to figure out what could be going on… And move on to the next post!


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.8)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.7)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

We’ve now realized (in last post) that storing the references to the client monitors in a map was a bit naive: although it works fine to locate the relevant information when a client crashes, it’s not so great to handle the “worker check in” case (where we have to locate the monitor reference by worker pid instead).

You know what’s good for finding data with varying conditions? Databases. Erlang does provide a full-blown database (called Mnesia) which would actually be counter-productive for our use case: since we’ll be storing monitors, if our entire application crashes we want the monitor information to be wiped also (since it will no longer be relevant). Since Mnesia persists data to disk, we’d have to erase the obsolete monitor data ourselves in case of crash.

Instead, we can use an in-memory “database” also provided by Erlang, called ETS (for Erlang Term Storage).

Erlang Term Storage basics

First, here’s a brief description of what an ETS table is:

Data is organized as a set of dynamic tables, which can store tuples. Each table is created by a process. When the process terminates, the table is automatically destroyed. Every table has access rights set at creation.

The first element in the stored tuple will be that tuple’s key (default behavior, can be modified). Retrieving data via this key is very efficient, whereas finding data by matching the rest of the tuple requires the entire table to be searched which can be slow if the table is very large.

Now that introductions have been made, let’s get our hands dirty. The first thing we’ll need is to create a new ETS table instance when our pool manager process initializes itself. Of course, we’ll also need to change the default value of the :monitors attribute in our state, since we won’t be using maps anymore (lib/pool_toy/pool_man.ex):

defmodule State do
  defstruct [:size, :monitors, workers: []]
end

# edited for brevity

def init(size) do
  send(self(), :start_workers)
  monitors = :ets.new(:monitors, [:protected, :named_table])
  {:ok, %State{size: size, monitors: monitors}}
end

(Your keen eyes will probably have noticed that we switched the positions of the :monitors and :workers attributes in the struct on line 2. This is because values that are initialized must come last.)

The magic mainly happens on line 9, where we use new/2 to create the ETS table we’ll be using to track client monitors. The first argument is the (unique) name we give to the table (which must be an atom), followed by the options. Here, the options we give are for our convenience when inspecting the application state with Observer: they are not strictly necessary in our case, just nice to have. :protected means that only the owner process can write to the table, but other processes can read from it: this means that later on we’ll be able to look at the table contents from within Observer. :protected is a default setting, and was only included here for explicitness.

:named_table works much like named processes: if the option is present, the table will be registered under the given name (first argument mentioned above, which in this case is :monitors). This will enable us to work more conveniently with the table from IEx if we wanted to as we can use the table name instead of the reference that was returned upon table creation. Having a named table also means that it will be displayed under that name in the Observer, which will be convenient in itself.

Finally, note that when using the :named_table option, the return value of new/2 will simply be the table name, instead of the table reference (the reference can be obtained from the name with whereis/1). As we’re using a named table in our application, we technically wouldn’t need to store the value returned by new/2. However, since we’re using a named table for convenience instead of design imperatives, our code will consistently use the value returned by new/2 to interact with the ETS table. This way, our code will work even if/when the table is no longer named (feel free to try that later!).

With our monitors table available for use, let’s update the check out code (lib/pool_toy/pool_man.ex):

def handle_call(:checkout, {from, _}, %State{workers: [worker | rest]} = state) do
  %State{monitors: monitors} = state
  monitor(monitors, {worker, from})
  {:reply, worker, %{state | workers: rest}}
end

# edited for brevity

def handle_info(msg, state) do
  # edited for brevity
end

defp monitor(monitors, {worker, client}) do
  ref = Process.monitor(client)
  :ets.insert(monitors, {worker, ref})
  ref
end

Easy, peasy: we just insert the {worker, ref} into our ETS table.

We also have to update the code handling dying client processes. So what happens when a monitored process dies (docs)?

Once the monitored process dies, a message is delivered to the monitoring process in the shape of:

{:DOWN, ref, :process, object, reason}

We’ll get a special message with the reference to the monitor, an object value (which will be the client’s pid in our case), and the reason the client exited (which we’re not interested in, because it makes no difference to us: regardless of why the client exited, we need to check its worker back in).

So let’s handle this new message in our code (lib/pool_toy/pool_man.ex):

def handle_info({:DOWN, ref, :process, _, _}, %State{workers: workers, monitors: monitors} = state) do
  case :ets.match(monitors, {:"$0", ref}) do
    [[pid]] ->
      true = :ets.delete(monitors, pid)
      {:noreply, state |> handle_idle_worker(pid)}
    [] ->
      {:noreply, state}
  end
end

So what does match/2 on line 2 do? It looks through the table, trying to find matches to the provided pattern. What’s a pattern?

A pattern is a term that can contain:

  • Bound parts (Erlang terms)
  • ‘_’ that matches any Erlang term
  • Pattern variables ‘$N’, where N=0,1,…

So in the case above, line 2 says: in the monitors table, return the array of results that have ref as the second element in the tuple, and put the first element in the tuple as the first element in the results array. Note that :"$0" is shorthand for “turn the "$0" string into an atom”: Erlang/ETS expects pattern variables to be atoms formed of a dollar sign followed by an integer.

Line 3 matches on a [[pid]] result. Why the nested array? Well, match/2 will return an array of results. That’s the first level array. Within that results array, each single result is an array of the values corresponding to the pattern variables we provided: we provided only a single pattern, so the (unique) result array contains only a single element. For completeness, let’s also point out that since we have at most one monitor per client process, we don’t expect to ever have more than 1 result.

For a little more context on ETS match results, allow me to provide another example: let’s say we were storing a list of cars and their license plates in a cars ETS table, using the following tuple convention: {license_plate, make, year, color}. If we wanted to see the year and make of red cars, we could call :ets.match(cars, {:"_", :"$1", :"$0", "red"}). Notice that since the pattern order indicates we want the year before the make, an example result would be ["2018", "Tesla"]. As you’ll have noticed, we disregard the “license_plate” field in our query by using the :"_" pattern, so that information is absent from our result.

Back to our actual code: if we have a matching record in our ETS table (line 3), we delete it from the table on line 4 since it’s no longer relevant (the process just crashed). Note that we match the delete result to true, so that we crash in case the returned result is different (although in this case it isn’t so useful as delete/2 will always return true). Finally, we handle our idle worker online 5 (we’ll implement handle_idle_worker/2 soon) and return the updated state.

If there’s no monitoring info in the database (lines 6 and 7), we don’t do anything (this could happen e.g. if the client checked in its worker process before crashing).

But let’s take a brief moment to talk about our lord and savior, the “let it crash” philosophy. It’s subtly present in how we handle the matches above: we expect to have exactly one match in the ETS table, or none at all. Therefore, that’s exactly what our case statements match for! In any other case, we clearly don’t know what’s going on, so we *want* to crash. In particular, on line 6 we match [] an explicit empty result, and not just _ which you might be tempted to have as a programmer with defensive reflexes.

Matching overly lax patterns like _ in this case is counter-productive in OTP land, as it means that our software is potentially running with a corrupted state (e.g. there are 2 entries in the ETS table) which could easily start corrupting other parts of our program as the incorrect information spreads. And once that has happened, it will be near impossible to cleanly recover from. Instead, it is much safer, simpler, and better, to nip the corruption in the bud: as soon as something unexpected happens, terminate and start fresh from a known-good state.

Let’s add the handle_idle_worker/2 function we so conveniently glossed over in the above code (lib/pool_toy/pool_man.ex):

def handle_info(msg, state) do
  # edited for brevity
end

defp handle_idle_worker(%State{workers: workers} = state, idle_worker) when is_pid(idle_worker) do
  %{state | workers: [idle_worker | workers]}
end

We still have to get around to why we introduced an ETS table in the first place: demonitoring client processes when they check their workers back in to the pool. First, let’s update the check in code (lib/pool_toy/pool_man.ex):

def handle_cast({:checkin, worker}, %State{monitors: monitors} = state) do
  case :ets.lookup(monitors, worker) do
    [{pid, ref}] ->
      Process.demonitor(ref)
      true = :ets.delete(monitors, pid)
      {:noreply, state |> handle_idle_worker(pid)}
    [] ->
      {:noreply, state}
  end
end

This time around, we conveniently have the worker pid provided, which is used as the key in our ETS table (being the first value in the tuples we store). We can therefore use lookup/2 to get all information related to the monitor. Notice that lookup/2 will return an array of tuples matching the key (i.e. we don’t have a nested array).

We’ve got a bunch of things going here, but nothing’s too complicated:

  • stop monitoring the client: the worker has been returned to the pool, so we no longer care to know if the client has crashed (line 4);
  • delete the monitor entry in the ETS table, since the monitor has been demonitored (line 5);
  • return the updated state, after handling the newly idle worker pid (line 6).

If there’s no entry in the ETS table for the given pid, we don’t do anything: the pid for alive workers would always be in the ETS table. If the pid that’s being checked in isn’t there, it means it’s not valid: it could belong to a dead worker, or any random process. In any case, we don’t want invalid pids to be added to the list of available workers in our state!

Our changes so far

Let’s take our code for a spin:

iex(1)> me = self()
#PID<0.140.0>
iex(2)> client = spawn(fn ->
...(2)> Process.send(me, {:worker, PoolToy.PoolMan.checkout()}, [])
...(2)> receive do
...(2)> :die -> :ok
...(2)> end
...(2)> end)
#PID<0.148.0>
iex(3)> worker = receive do
...(3)> {:worker, w} -> w
...(3)> end
#PID<0.137.0>
iex(4)> :observer.start()
:ok

This is the first part of the code that proved problematic in the last post. In the Observer, go to the “Table Viewer” tab. What can we see there? A “monitors” entry, which is the name of the ETS table we’re using. Double-click it to see its contents. As you can see, we’ve got an entry in there for our checked out worker. If take a look at the state for PoolMan in the “Applications” tab, you’ll see we’ve got only 2 remaining workers as expected. Let’s continue:

iex(5)> PoolToy.PoolMan.checkin(worker)
:ok

Look in the Observer again: the monitors table is now empty, and the PoolMan state has 3 workers available. Continuing:

iex(6)> worker = PoolToy.PoolMan.checkout()                                   
#PID<0.137.0>
iex(7)> Process.send(client, :die, []) :ok

And our application state is still correct: we’ve got a single entry in the monitors table, and 2 available workers. In other words, killing the client process didn’t affect anything because the worker it was using had been checked in before it died.

So we’re done with our single pool implementation, right? Nope. We’ve still go a small issue to iron out. Try poking around in Observer killing processes in various application states (you can kill a process in the “applications” tab by right-clicking on the process and selecting the last “kill process” option). Does the pool manager state remain correct in all cases? Can you identify the problem? Check your answer in the next post.


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.7)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.6)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

So we’ve got a working pool we can check worker processes in/out of (previous post). However, we’re not handling dying client processes, resulting in not properly freeing up worker processes when clients die.

Babysitting processes with monitors

The first thing we need to do is think about how client and worker processes should affect one another upon death:

  • if a client dies, we want to the pool manager to know so it can put the associated worker process back into the pool of available workers;
  • if a worker dies, we don’t want the client to be affected (in particular, we don’t want to kill the client).

This guides us to using a monitor for the pool manager to keep an eye on the client process’ health (as opposed to a link, which we’ll cover in part 1.8). Before returning an available worker to a client process, we will monitor the client: if it dies, we’ll return the assigned worker pid back into the pool.

Up until now, if a client requested a worker and we had one available, we handled the case like so:

def handle_call(:checkout, _from, %State{workers: [worker | rest]} = state) do
  {:reply, worker, %{state | workers: rest}}
end

But now that we’re concerning ourselves with the well-being of client processes, we need to monitor them. Monitoring a process is straightforward: Process.monitor(pid) this will create a unique reference and monitor the process: when the process dies, we’ll get a message containing the same unique reference (so we can easily determine which process died).

So let’s monitor client processes, and store the references in our state (lib/pool_toy/pool_man.ex):

def handle_call(:checkout, {from, _}, %State{workers: [worker | rest]} = state) do
  %State{monitors: monitors} = state
  ref = Process.monitor(from)
  monitors = monitors |> Map.put(ref, worker)
  {:reply, worker, %{state | workers: rest, monitors: monitors}}
end

Line 3 is where the magic happens: we create a new monitor to track the client process. You’ll note that we’re now actually interested in the mysterious “from” value provided as the second argument to handle_call/3 which the documentation tells us is

a 2-tuple containing the caller’s PID and a term that uniquely identifies the call

At this time we’re not interested in the unique identifier, only the caller’s pid: it’s the calling client process we want to monitor! Right, so we’re monitoring the client with the call on line 3, and we need to store this monitor somewhere in the state. To achieve that, we get the current collection of monitors from the state (line 2), add the new monitor to it on line 4, and finally return the updated state.

You may be wondering why I’m again matching on state values on line 2 when I already have some matching going on in the function head. For more on that, see here.

Naturally, in order to be able to keep track of our monitors in the state, we need to add a monitors attribute to it:

defmodule State do
  defstruct [:size, workers: [], monitors: %{}]
end

At this point, we’re monitoring the client, but we’re still doing absolutely nothing if the client dies… Now that the client is monitored, what happens to it when it dies? (Documentation)

Once the monitored process dies, a message is delivered to the monitoring process in the shape of:

{:DOWN, ref, :process, object, reason}

Ok, so we need to handle that incoming message (lib/pool_toy/pool_man.ex):

def handle_info({:DOWN, ref, :process, _, _}, %State{workers: workers, monitors: monitors} = state) do
  workers =
    case Map.get(monitors, ref) do
      nil -> workers
      worker -> [worker | workers]
    end

  monitors = Map.delete(monitors, ref)

  {:noreply, %{state | workers: workers, monitors: monitors}}
end

def handle_info(msg, state) do
  # edited for brevity
end

When we receive a message notifying us that a monitored process died (line 1), we look up the corresponding monitor reference (lines 3-6). If we find one, we return the associated worker pid to the pool. But hold on: if we always add a monitor when a worker gets checked out (and only remove it when it’s checked in again, although we haven’t implemented that part yet), how could we end up with a client crashing without an associated monitor reference?

Because processes can crash anytime, including when it’s inconvenient ;). Consider the following:

  1. Client checks out worker, and the pool manager starts monitoring the client;
  2. Client checks in worker, the pool manager stops monitoring the client, and then worker is returned to the pool;
  3. Client crashes.

Wait a minute. This looks fine! On step 2 we stop monitoring the client, so when it crashes in step 3, we shouldn’t get a message about the crash, right?

Except we’re dealing with a gen server, where all communication happens via messages and isn’t instantaneous: messages pile up in the server mailbox, and get processed in order. Here’s what could be happening in reality, as opposed to the “mental shortcut” we have above:

  1. Client requests a worker;
  2. A checkout message is sent to the server;
  3. The server processes the checkout request: it starts monitoring the client, and returns a worker pid to the client;
  4. Client wants to check a process back in;
  5. A checkin message is sent to the server (with the worker pid);
  6. The server processes the checkin request: it stops monitoring the client, and returns the worker to the pool;
  7. Client crashes;
  8. A :DOWN message is sent to the server.

Hmm. Still don’t see a problem here… Ah yes: in the real world, several client processes will be interacting with the server. Therefore, the server’s mailbox will contain a bunch of messages from other clients that it will have to process in order. Let’s try this again:

  1. Client requests a worker;
  2. A checkout message is sent to the server;
  3. The server processes the checkout request: it starts monitoring the client, and returns a worker pid to the client;
  4. Client wants to check a process back in;
  5. A checkin message is sent to the server (with the worker pid);
  6. Server is busy processing messages from other clients, doesn’t get around to processing the checkin message for the time being;
  7. Client crashes;
  8. A :DOWN message is sent to the server;
  9. The server has worked through the backlog of messages in its mailbox and finally processes the checkin request: it stops monitoring the client, and returns the worker to the pool;
  10. The server processes more messages in the mailbox (if they came before the :DOWN message);
  11. The server processes the :DOWN message: at this point, there is no active monitor on the (dead) client process, because it was removed when the server processed the checkin message.

It’s important to remember that when a process crashes, any messages it sent out will stay around: they don’t get magically deleted from the mailboxes in other processes. In other words, any time you’re processing a message it could be from a process that is no longer alive by the time you read it.

Our changes so far

Ok, let’s check whether our code now solves our immediate problem:

iex(1)> client = spawn(fn ->
...(1)>   PoolToy.PoolMan.checkout()
...(1)>   receive do
...(1)>     :die -> :ok
...(1)>   end
...(1)> end)
#PID<0.126.0>
iex(2)> Process.alive? client
true
iex(3)> :observer.start()
:ok

We’re going to use a fancier spawned process this time around: instead of dying immediately, it’s going to wait for a :die message before terminating.

Navigate to the “State” tab for the PoolMan process, and you’ll see that we’ve got a monitor reference in the state, and only 2 workers are available. Let’s kill the client and see what happens:

iex(4)> Process.send(client, :die, [])
:ok

We’ve once again got 3 available workers, and no monitor references. Hooray! (Note that you’ll need to close the window for the PoolMan process and reopen it to see the changes: it doesn’t refresh itself.)

There’s still a problem, though. See if you can tell what it is in the Observer (don’t read further than this code sample before thinking about the problem and exploring it for yourself). Kill your previous IEx session if you still have it open, and start a new one:

iex(1)> me = self()
#PID<0.119.0>
iex(2)> client = spawn(fn ->
...(2)>   Process.send(me, {:worker, PoolToy.PoolMan.checkout()}, [])
...(2)>   receive do
...(2)>     :die -> :ok
...(2)>   end
...(2)> end)
#PID<0.127.0>
iex(3)> worker = receive do
...(3)>   {:worker, w} -> w
...(3)> end
#PID<0.116.0>
iex(4)> PoolToy.PoolMan.checkin(worker)
:ok
iex(5)> worker = PoolToy.PoolMan.checkout()
#PID<0.116.0>
iex(6)> :observer.start()
:ok
iex(7)> Process.send(client, :die, [])
:ok

What happens here is that we spawn a process that checks a worker out, and sends it to the IEx process (expression 2). We then retrieve the worker pid (expr 3), and check the worker back in (expr 4). Then we check a worker out again (expr 5), which is the same worker as we had earlier and checked back in. In expression 7, we kill the process we had originally spawned.

Although this spawned process no longer has a worker process (it was returned to the pool in expression 4), killing the spawned client process results in returning the worker from expression 5 back to the pool of available workers, even though we weren’t done with it (since we hadn’t checked it back in yet). Another related issue is that at expression 6 within the Observer you can see that the pool manager’s state has 2 different monitor references active for the same pid. Oops!

The fix, of course, is to stop monitoring client processes when the worker process they borrowed is checked back in to the pool. We’ll do that in the next post.


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.6)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.5)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

After last post, we’ve got a pretty fancy-looking pool:

Unfortunately, we don’t have any sort of mechanism to make use of the worker processes. We need to let client processes check out a worker, do some work with it, and check the worker back in once the client is done with it.

Implementing worker checkin and checkout

In lib/pool_toy/pool_man.ex:

def checkout() do
  GenServer.call(@name, :checkout)
end

def checkin(worker) do
  GenServer.cast(@name, {:checkin, worker})
end

def init(size) do
  # truncated for brevity
end

def handle_call(:checkout, _from, %State{workers: []} = state) do
  {:reply, :full, state}
end

def handle_call(:checkout, _from, %State{workers: [worker | rest]} = state) do
  {:reply, worker, %{state | workers: rest}}
end

def handle_cast({:checkin, worker}, %State{workers: workers} = state) do
  {:noreply, %{state | workers: [worker | workers]}}
end

Pretty easy, right? Add checkout/0 and checkin/1 to the API (lines 1-7), and implement the matching server-side functions (lines13-23). When a client wants to check out a worker, we reply :full if none are available (line 14), and otherwise we provide the pid of the first available worker (line 18). When checking in a worker, we simply add that pid to the list of available worker pids (line 22). Note that since all workers are equal, there is no need to differentiate them and we can therefore always take/return pids from the head of the workers list (i.e. workers at the head of the list will be used more often, but we don’t care in this case).

But wait, how come this time around we didn’t implement catch-all clauses for handle_call/3 and handle_cast/2? After all, we had to add one for handle_info/2 back in part 1.4, why would these be different? Here’s what the getting started guide has to say about it:

Since any message, including the ones sent via send/2, go to handle_info/2, there is a chance unexpected messages will arrive to the server. Therefore, if we don’t define the catch-all clause, those messages could cause our [pool manager] to crash, because no clause would match. We don’t need to worry about such cases for handle_call/3 and handle_cast/2though. Calls and casts are only done via the GenServer API, so an unknown message is quite likely a developer mistake.

In other words, this is the “let it crash” philosophy in action: we shouldn’t receive any unexpected messages in calls or casts. But if we do, it means something went wrong and we should simply crash.

Our changes so far

We can now fire up IEx with iex -S mix and try out our worker pool:

iex(1)> w1 = PoolToy.PoolMan.checkout()
#PID<0.116.0>
iex(2)> w2 = PoolToy.PoolMan.checkout()
#PID<0.117.0>
iex(3)> w3 = PoolToy.PoolMan.checkout()
#PID<0.118.0>
iex(4)> w4 = PoolToy.PoolMan.checkout()
:full
iex(5)> PoolToy.PoolMan.checkin(w1)
:ok
iex(6)> w4 = PoolToy.PoolMan.checkout()
#PID<0.116.0>
iex(7)> Doubler.compute(w4, 21)
Doubling 21
42

Great success! We were able to check out workers until the pool told us it was :full, but then as soon as we checked in a worker (and the pool was therefore no longer full), we were once again able to check out a worker.

So we’re done with the basic pool implementation, right? No. When working in the OTP world, we need to constantly think about how processes failing will impact our software, and how it should react.

In this case for example, what happens if a client checks out a worker, but dies before it can check the worker back in? We’ve got a worker that isn’t doing any work (because the client died), but still isn’t available for other clients to check out because it was never returned to the pool.

But don’t take my word for it, let’s verify the problem in IEx (after either starting a new session, or checking all the above workers back into the pool):

iex(1)> client = spawn(fn -> PoolToy.PoolMan.checkout() end)
#PID<0.121.0>
iex(2)> Process.alive? client
false
iex(3)> :observer.start()

Within the Observer, if you go to the applications tab (selecting pool_toy on the left if it isn’t already displayed) and double-click PoolMan, then navigate to the “state” tab in the newly opened window, you can see that only 2 workers are available. In other words, even though the worker we checked out is no longer is use by the client (since that process died), the worker process will never be assigned any more work because it was never checked back into the pool.

How can we handle this problem? How about babysitting? Coming right up next!


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.5)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.4)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

After the last part, we’ve made some headway but still need some workers:

More useful state

Let’s start by enhancing the state in the pool manager to be a struct: we’re going to have to remember a bunch of things to manage the pool, so a struct will be handy. In lib/pool_toy/pool_man.ex:

defmodule PoolToy.PoolMan do
  use GenServer

  defmodule State do
    defstruct [:size]
  end

  @name __MODULE__

  def start_link(size) when is_integer(size) and size > 0 do
    # edited for brevity
  end

  def init(size) do
    {:ok, %State{size: size}}
  end
end

Starting workers

Our pool manager must now get some worker processes started. Let’s begin with attempting to start a single worker dynamically, from within init/1. Look at the docs for DynamicSupervisor and see if you can figure out how to start a child.

Here we go (still in lib/pool_toy/pool_man.ex):

def init(size) do
  DynamicSupervisor.start_child(PoolToy.WorkerSup, Doubler)
  {:ok, %State{size: size}}
end

start_child/2 requires the dynamic supervisor location and a child spec. Since we’ve named our worker supervisor, we reuse the name to locate it here (but giving its pid as the argument would have worked also). Child specs can be provided the same way as for a normal supervisor: complete child spec map, tuple with arguments, or just a module name (if we don’t need any start arguments). We’ve gone with the latter.

Let’s try it out in IEx. When running iex -S mix we get:

Erlang/OTP 20 [erts-9.3]  [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]

Compiling 6 files (.ex)
Generated pool_toy app

08:52:07.275 [info]  Application pool_toy exited: PoolToy.Application.start(:normal, []) returned an error: shutdown: failed to start child: PoolToy.PoolSup
    ** (EXIT) shutdown: failed to start child: PoolToy.PoolMan
        ** (EXIT) exited in: GenServer.call(PoolToy.WorkerSup, {:start_child, {{Doubler, :start_link, [[]]}, :permanent, 5000, :worker, [Doubler]}}, :infinity)
            ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
** (Mix) Could not start application pool_toy: PoolToy.Application.start(:normal, []) returned an error: shutdown: failed to start child: PoolToy.PoolSup
    ** (EXIT) shutdown: failed to start child: PoolToy.PoolMan
        ** (EXIT) exited in: GenServer.call(PoolToy.WorkerSup, {:start_child, {{Doubler, :start_link, [[]]}, :permanent, 5000, :worker, [Doubler]}}, :infinity)
            ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Oops. What’s going on? Well, let’s read through the error messages:

  1. The pool toy application couldn’t start, because:
  2. The pool supervisor couldn’t start, because:
  3. The pool manager couldn’t start, because:
  4. It tries calling a GenServer process called PoolToy.WorkerSup, which fails because:
  5. the process is not alive or there’s no process currently associated with the given name, possibly because its application isn’t started.

Wait a minute: the worker supervisor should be up and running with that name, since it was started by the pool supervisor! Let’s take a look at that code and see how the pool supervisor goes about its job of starting its children (lib/pool_toy/pool_sup.ex):

def init(args) do
    pool_size = args |> Keyword.fetch!(:size)

    children = [
      {PoolToy.PoolMan, pool_size},
      PoolToy.WorkerSup
    ]

    Supervisor.init(children, strategy: :one_for_all)
  end

Right, so on line 9 the pool supervisor start its children. Which children? The ones defined on lines 4-7. And how are they started? The docs say

When the supervisor starts, it traverses all child specifications and then starts each child in the order they are defined.

So in our code, we start the pool manager and then move on to the worker supervisor. But since the pool manager expects the worker supervisor to already be up and running when the pool manager initializes itself (since it calls the worker sup), our code blows up.

Put another way, our software design has taken the position that the worker supervisor will ALWAYS be available if the pool manager is running. However, in our current implementation we’ve failed this guarantee quite spectacularly. (Take a look at what Fred Hébert has to say about supervisor start up and guarantees.)

The fix is easy enough: tell the pool supervisor to start the worker supervisor before the manager (lib/pool_toy/pool_sup.ex).

children = [
  PoolToy.WorkerSup,
  {PoolToy.PoolMan, pool_size}
]

Try running PoolToy in IEx again, and not only does it work, you can actually see a baby worker in the Observer! Huzzah!

As a quick side note, the current example is quite simple and straightforward but initialization order needs to be properly thought through in your OTP projects. Once again, take a look at Fred’s excellent writing on the subject.

With our first worker process created, let’s finish the job and actually start enough workers to fill the requested pool size. Let’s get started with the naive approach (lib/pool_toy/pool_man.ex):

defmodule State do
  defstruct [:size, workers: []]
end

def init(size) do
  start_worker = fn _ ->
    {:ok, pid} = DynamicSupervisor.start_child(PoolToy.WorkerSup, Doubler)
    pid
  end

  workers = 1..size |> Enum.map(start_worker)

  {:ok, %State{size: size, workers: workers}}
end

Pretty straightforward, right? Since we’re going to need to keep track of the worker pids (to know which ones are available), we add a :workers attribute (initialized to an empty list) to our state on line 2. On lines 6-9 we define a helper function that starts a worker and returns its pid. You’ll note that with the pattern match on line 7 we expect the worker creation to always succeed: if anything but an “ok tuple” is returned, we want to crash. Finally, on line 11 we start the requested number of workers, and add them to the state on line 13.

If you take this for a run in IEx and peek at the state for the PoolMan process, you’ll see we indeed have pids in the :workers state property, and that they match the ones visible in the application tab. Great!

Our changes so far

Deferring initialization work

Actually, it’s not so great: GenServer initialization is synchronous, yet we’re doing a bunch of work in there. This could tie up the client process and contribute to making our project unresponsive: the init/1 callback should always return as soon as possible.

So what are we to do, given we need to start the worker processes when the pool manager is starting up? Well, we have the server process send a message to itself to start the workers. That way, init/1 will return right away, and will then process the next message in its mailbox which should be the one telling it to start the worker processes. Here we go (lib/pool_toy/pool_man.ex):

def init(size) do
  send(self(), :start_workers)
  {:ok, %State{size: size}}
end

def handle_info(:start_workers, %State{size: size} = state) do
  start_worker = fn _ ->
    {:ok, pid} = DynamicSupervisor.start_child(PoolToy.WorkerSup, Doubler)
    pid
  end

  workers = 1..size |> Enum.map(start_worker)

  {:noreply, %{state | workers: workers}}
end

def handle_info(msg, state) do
  IO.puts("Received unexpected message: #{inspect(msg)}")
  {:noreply, state}
end

Take a good long look at this code, and try to digest it. Think about why each line/function is there.

First things first: in init/1, we have the process send a message to itself. Remember that init/1 is executed within the server process of a gen server, so send(self(), ...) will send the given message to the server process’ inbox.

Although we’re sending the message immediately, it’s important that nothing (e.g. starting new workers) will actually get done: the only thing happening is a message being sent. After sending the message, init/1 is free to do whatever it wants, before returning {:ok, state}. Only after init/1 has returned will the server process look in its mailbox and see the :start_workers message we sent (assuming no other message came in before).

Side note: if there’s a possibility in your project that a message could be sent to the gen server before it is fully initialized, you can have it respond with e.g. {:error, :not_ready} (in handle_call, etc.) when the state indicates that prerequisites are missing. It is then up to the client to handle this case when it presents itself.

Ok, onwards: we’ve sent a message to the server process so we need code to handle it. If a message is simply sent to a gen server process (as opposed to using something like GenServer.call/3), it will be treated as an “info” message. So on lines 6-15 we handle that info message, start the worker processes, and store their pids in the server state.

But why did we define another handle_info function on lines 17-20?

A short primer on process mailboxes

Each process in the BEAM has its mailbox where it receives messages sent to it, and from which it can read using receive/1. But as you may recall, receive/1 will pattern match on the message, and process the first message it finds that matches a pattern. What about the messages that don’t match any pattern? They stay in the mailbox.

Every time receive/1 is invoked, the mailbox is consulted, starting with the oldest message and moving chronologically. So if messages don’t match clauses in receive/1 they will just linger in the mailbox and be checked again and again for matching clauses. This is bad: not only does it slow your process down (it’s spending a lot of time needlessly checking messages that will never match), but you also run the risk of running out of memory for the process’ mailbox as the list of unprocessed messages grows until finally the process is killed (and all unread messages are lost).

To prevent this undesirable state of affairs, it is a good practice to have a “catch all” clause in handle_info/2 that will log the fact that an unprocessed message was received while also removing it from the process’ mailbox. In other words, it’s a good practice to have a catch all version of handle_info/2 to prevent unprocessed messages from piling up in the mailbox.

When we use GenServer, default catch all implementations of handle_info/2 (as well as handle_call/3 and handle_cast/2, which we’ll meet later) are created for us and added to the module. But as soon as we define a function with the same name (e.g. we write a handle_info/2 function as above) it will override the one created by the use statement. So when we write a handle_info/2 message-handling function in a gen server, we need to follow through and write a catch all version also.

Since these unexpected messages shouldn’t have been sent to this process, in production code we would most likely log them as errors using the Logger instead of just printing a message.

Improving the internal API

Right now we’re starting worker like so (lib/pool_toy/pool_man.ex):

{:ok, pid} = DynamicSupervisor.start_child(PoolToy.WorkerSup, Doubler)

This isn’t great, because it means that our pool manager needs relatively intimate knowledge about the internals of the worker supervisor (namely, that it is a dynamic supervisor, and not a plain vanilla one). It also means that if we change how children get started (e.g. by setting a default config), we’d have to make that change everywhere that requests new workers to be started.

Let’s move the knowledge of how to start workers where it belongs: within the WorkerSup module (lib/pool_toy/worker_sup.ex):

defmodule PoolToy.WorkerSup do
  use DynamicSupervisor

  @name __MODULE__

  def start_link(args) do
    DynamicSupervisor.start_link(__MODULE__, args, name: @name)
  end

  defdelegate start_worker(sup, spec), to: DynamicSupervisor, as: :start_child

  def init(_arg) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end
end

I’ve reproduced the whole module here, but the only change is on line 10: we define a start_worker/2 function, which in effect proxies the call to DynamicSupervisor.start_child/2. If you’re not familiar with the defdelegate macro, you can learn more about it here, but in essence the line we’ve added is equivalent to:

def start_worker(sup, spec) do
  DynamicSupervisor.start_child(sup, spec)
end

With this addition in place, we can update our pool manager code to make use of it (lib/pool_toy/pool_man.ex):

def handle_info(:start_workers, %State{size: size} = state) do
  workers =
    for _ <- 1..size do
      {:ok, pid} = PoolToy.WorkerSup.start_worker(PoolToy.WorkerSup, Doubler)
      pid
    end

  {:noreply, %{state | workers: workers}}
end

Our changes so far

At this stage in our journey, we’ve got a pool with worker processes. However, right now clients can’t check out worker processes to do work with them. Let’s get to that next!


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.4)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.3)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

In the previous post, we managed to get our pool supervisor to start the pool manager. Now, our next objective is to have the pool supervisor start the worker supervisor, and we’ll be one step closer to out intermediary goal:

Before we just dive in, let’s think about our worker supervisor a bit and how it differs from the pool supervisor. The worker supervisor’s children are all going to be the same process type (i.e. all workers based on the same module), whereas the pool supervisor has different children. Further, the number of children under the worker supervisor’s care will vary: in the future, we’ll want to be able to spin up temporary “overflow” workers to handle bursts in demand. These features make it a great fit for the DynamicSupervisor.

The worker supervisor

Here’s our bare-bones implementation of the worker supervisor (lib/pool_toy/worker_sup.ex):

defmodule PoolToy.WorkerSup do
  use DynamicSupervisor

  @name __MODULE__

  def start_link(args) do
    DynamicSupervisor.start_link(__MODULE__, args, name: @name)
  end

  def init(_arg) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end
end

This should look familiar: it’s not very different from the code we wrote for the pool supervisor. The main differences aside from useing DynamicSupervisor on line 2 are the restart strategy on line 11 (DynamicSupervisor only supports :one_for_one at this time), and the fact that we don’t initialize any children. Dynamic supervisors always start with no children and add them later dynamically (hence their name).

We still need our pool supervisor to start the worker supervisor (lib/pool_toy/pool_sup.ex):

def init(args) do
    pool_size = args |> Keyword.fetch!(:size)
    children = [
      {PoolToy.PoolMan, pool_size},
      PoolToy.WorkerSup
    ]

    Supervisor.init(children, strategy: :one_for_all)
  end

Since this time around we don’t need to pass any special values for the initialization, we don’t need a tuple and can just pass in the module name directly. A quick detour in IEx to try everything out:

iex -S mix
PoolToy.PoolSup.start_link(size: 4)
:observer.start()

In the processes tab, you can see we’ve got entries for PoolSup, PoolMan, and WorkerSup. In addition, if you double click on the pool manager and worker supervisor entries and navigate to their “state” tabs, you’ll see that they both have the pool supervisor as their parent. Great (intermediary) success!

Our changes so far

Throwing the OTP application into the mix

Poking around the Observer using the process list has been helpful, but it’d be even better if we could have a more visual representation of the hierarchy between our processes. You may have noticed the “applications” tab in the Observer: it has pretty charts of process hierarchies, but our processes are nowhere to be found. It turns out, that’s because right now our code is just a bunch of processes, they’re not an actual OTP application.

First of all, what is an OTP application? Well, it’s probably not the same size/scope as the application you’re thinking of: OTP applications are more like components in that they’re bundles of reusable code that gets started/stopped as a unit. In other words, the software you build in OTP (whether it’s with Elixir, Erlang, or something else) will nearly always be contain or depend on several OTP applications.

To turn our project into an application, we need to first write the application module callback (lib/pool_toy/application.ex):

defmodule PoolToy.Application do
  use Application

  def start(_type, _args) do
    children = [
      {PoolToy.PoolSup, [size: 3]}
    ]

    opts = [strategy: :one_for_one]
    Supervisor.start_link(children, opts)
  end
end

Pretty straightforward, right? This code once again closely resembles the code we’ve been writing thus far, except we’re using Application on line 2 and the callback we needed to implement is start/2.

You’ll probably have noticed that we’ve hard-coded a pool size of 3 on line 6. That’s just temporary to get us up and running: in our final implementation, the application will start a top-level supervisor and we’ll once again be able to specify pool sizes as we create them.

So now that we’ve defined how our application should be started, we still need to wire it up within mix so it gets started automatically (mix.exs):

def application do
    [
      mod: {PoolToy.Application, []},
      registered: [
        PoolToy.PoolSup,
        PoolToy.PoolMan,
        PoolToy.WorkerSup
      ],
      extra_applications: [:logger]
    ]
  end

Line 3 is where the magic happens: we specify the module that is the entry point for our application, as well as the start argument. Here, that means the application gets started by calling the start/2 callback in PoolToy.Application and providing it [] as the second argument (the first argument is used to specify how the application is started, which is useful in more complex failover/takeover scenarios which won’t be covered here).

You can safely ignore the registered key/value here: I’ve only included it for completeness. It’s essentially a list of named processes our application will register. The Erlang runtime uses this information to detect name collisions. If you leave it out, it will default to [] and your app will still work.

Our changes so far

Start an IEx session once again (using iex -S mix, remember?), fire up the Observer with :observer.start(), and check out what we can see in the “applications” tab:

Pretty sweet, right? (If you don’t see this, click on “pool_toy” on the left.)

Now that we’ve got a visual representation of what the relationships between our processes look like, let’s mess around with them a bit… Double click on each of our processes, take note of their pids (visible in the window’s title bar), and close their windows. Now, right click on PoolMan and select “Kill process” (press “ok” when prompted for the exit reason). Look at the pid for PoolMan by double clicking on it again: it’s different now, since it was restarted by PoolSup. If you look at WorkerSup‘s pid, you’ll see it’s also different: PoolSup restarted it as well, because we told it to use a :one_for_all strategy. Ah, the magic of Erlang’s supervision trees…

Starting new projects as OTP applications

We went about writing our application in a bit of a round about way: we wrote our code, and introduced the OTP application when we needed/wanted it by writing the application callback module ourselves and adding to mix.exs.

Naturally, Elixir can helps us out here when we’re starting a new project. By passing the --sup option to mix new, which will create a skeleton of an OTP application with a supervision tree (docs). So we could have saved ourselves some work by starting out our project with

mix new pool_toy --sup

The more you know!

Our application is starting to look like the figure at the top of the post, but we still need actual worker processes. Let’s get to that next!


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.3)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.2)

Managing a single pool (continued)

(This post is part of a series on writing a process pool manager in Elixir.)

Continuing from the previous post, we’re working on having a single pool of processes that we can check out to do some work with. When we’re done, it should look like this:

The pool manager

Our pool manager is going to be a GenServer: it needs to be a process so it can maintain its state (e.g. tracking which worker processes are checked out). Let’s start with a super simplified version (lib/pool_toy/pool_man.ex):

defmodule PoolToy.PoolMan do
  use GenServer

  @name __MODULE__

  def start_link(size) when is_integer(size) and size > 0 do
    GenServer.start_link(__MODULE__, size, name: @name)
  end

  def init(size) do
    state = List.duplicate(:worker, size)
    {:ok, state}
  end
end

start_link/2 on line 6 takes a size indicating the number of workers we want to have in the pool. At this time, our pool manager is basically a trivial GenServer whose state consists of a list of identical atoms. These atoms represent workers we’ll implement later: right now, we’ll just use these atoms as a simple stand ins.

Let’s try it out and make sure it’s behaving as intended:

iex -S mix

iex(1)> PoolToy.PoolMan.start_link(3)
{:ok, #PID<0.112.0>}
iex(2)> :observer.start()
:ok

In the Observer, navigate to the pool manager process, and take a look at its state. Having trouble? Follow the steps in part 1.1. You can see that the process’ state is [worker, worker, worker] as expected (don’t forget, those are Erlang’s representation of atoms). Yay!

Our changes so far

And now, for our next trick, let’s have the pool supervisor start the pool manager upon startup.

Starting supervised children

This is what our pool supervisor currently has (lib/pool_toy/pool_sup.ex):

defmodule PoolToy.PoolSup do
  use Supervisor

  @name __MODULE__

  def start_link() do
    Supervisor.start_link(__MODULE__, [], name: @name)
  end

  def init([]) do
    children = []

    Supervisor.init(children, strategy: :one_for_all)
  end
end

We’ve got this convenient children value on line 11, let’s throw the pool manager value in there and see what happens: children = [PoolToy.PoolMan]. Within an IEx session (again, started with iex -S mix), try calling PoolToy.PoolSup.start_link() and you’ll get the following result:

** (EXIT from #PID<0.119.0>) shell process exited with reason: shutdown: failed to start child: PoolToy.PoolMan
    ** (EXIT) an exception was raised:
    ** (FunctionClauseError) no function clause matching in PoolToy.PoolMan.start_link/1
        (pool_toy) lib/pool_toy/pool_man.ex:6: PoolToy.PoolMan.start_link([])
        (stdlib) supervisor.erl:365: :supervisor.do_start_child/2
        (stdlib) supervisor.erl:348: :supervisor.start_children/3
        (stdlib) supervisor.erl:314: :supervisor.init_children/2
        (stdlib) gen_server.erl:365: :gen_server.init_it/2
        (stdlib) gen_server.erl:333: :gen_server.init_it/6
        (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

Darn. It looks like this programming stuff is going to be harder than just making random changes until we get something to work! Let’s take a closer look at the problem: the process couldn’t start PoolMan (line 1), because it was calling start_link([]) (line 4) but no function clause matched (line 3).

So how come PoolMan.start_link/1 is getting called with []? Let’s back up and see what Supervisor.init/2 does (docs). It refers us to more docs for additional information, where we’re told the first argument given to init/2 may be either:

  • a map representing the child specification itself – as outlined in the “Child specification” section
  • a tuple with a module as first element and the start argument as second – such as {Stack, [:hello]}. In this case, Stack.child_spec([:hello]) is called to retrieve the child specification
  • a module – such as Stack. In this case, Stack.child_spec([]) is called to retrieve the child specification

What’s this about a child specification? It’s basically how a supervisor will (re)start and shutdown processes it supervises. Let’s leave it at that for now (but feel free to read more about child specs in the docs).

So based on what we’ve read, since we’re giving just a module within the children variable, at some point PoolMan.child_spec([]) gets called. This is pretty suspicious, because we haven’t defined a child_spec/1 function in the pool manager module. Let’s once again turn to the console to investigate what the heck is going on:

iex(1)> PoolToy.PoolMan.child_spec([])
%{id: PoolToy.PoolMan, start: {PoolToy.PoolMan, :start_link, [[]]}}
iex(2)> PoolToy.PoolMan.child_spec([foo: :bar])
%{id: PoolToy.PoolMan, start: {PoolToy.PoolMan, :start_link, [[foo: :bar]]}}

Let’s take another look at our pool manager module (lib/pool_toy/pool_man.ex):

defmodule PoolToy.PoolMan do
  use GenServer

  @name __MODULE__

  def start_link(size) when is_integer(size) and size > 0 do
    GenServer.start_link(__MODULE__, size, name: @name)
  end

  def init(size) do
    state = List.duplicate(:worker, size)
    {:ok, state}
  end
end

There’s definitely no child_spec/1 defined in there, so where’s it coming from? Well the only candidate is line 2, since use can generate code within our module. Sure enough, the GenServer docs tell us what happened:

use GenServer also defines a child_spec/1 function, allowing the defined module to be put under a supervision tree.

Now that that’s cleared up, let’s look at the resulting child spec again:

iex(2)> PoolToy.PoolMan.child_spec([foo: :bar])
%{id: PoolToy.PoolMan, start: {PoolToy.PoolMan, :start_link, [[foo: :bar]]}}

We’ve got an id attribute that the supervisor uses to differentiate its children: parents do the same thing by giving their kids different names, right? (Besides George Foreman and his sons, I mean.) The other key in there is used to define how the child should be started: as the docs indicate, it is a tuple containing the module and function to call, as well as the arguments to pass in. This is often referred to as an MFA (ie. Module, Function, Arguments). You’ll note the arguments are always wrapped within a single list. So in the example above where we want to pass a keyword list as an argument, it gets wrapped within another list in the MFA tuple.

Fixing the startup problem

Right, so after this scenic detour discussing child specs, let’s get back to our immediate problem:

  1. If we give just a module name to Supervisor.init/2, it will call PoolMan.child_spec([]);
  2. This will generate a child spec with a :start value of `{PoolToy.PoolMan, :start_link, [[]]}`;
  3. The supervisor will attempt to call PoolMan‘s start_link function with [] as the argument;
  4. The process crashes, because PoolMan only defines a start_link/1 function expecting an integer.

To fix this issue, we need the child spec’s start value to be something like `%{id: PoolToy.PoolMan, start: {PoolToy.PoolMan, :start_link, [3]}}` which would make the supervisor call PoolMan‘s start_link/1 function with size 3 and we’d get a pool with 3 workers.

How can this be solved? One option is to override the child_spec/1 function defined by use GenServer, to return the child spec we want, for example (lib/pool_toy/pool_man.ex):

def child_spec(_) do
    %{
      id: @name,
      start: {__MODULE__, :start_link, [3]}
    }
  end

That’ll work, but it’s not the best choice in our case: this isn’t flexible and is overkill for what we’re trying to achieve.

Another possibility is to customize the generated child_spec/1 function by altering the use GenServer statement on line 2 (lib/pool_toy/pool_man.ex):

use GenServer, start: {__MODULE__, :start_link, [3]}

Once again, not great: we want to be able to specify the pool size dynamically.

Referring back to the Supervisor.init/2 docs, we can see the other options we’ve got:

  • a map representing the child specification itself – as outlined in the “Child specification” section
  • a tuple with a module as first element and the start argument as second – such as {Stack, [:hello]}. In this case, Stack.child_spec([:hello]) is called to retrieve the child specification
  • a module – such as Stack. In this case, Stack.child_spec([]) is called to retrieve the child specification

Per the first bullet point, we could also directly provide the child spec to Supervisor.init/2 in lib/pool_toy/pool_sup.ex:

defmodule PoolToy.PoolSup do
  use Supervisor

  @name __MODULE__

  def start_link(args) when is_list(args) do
    Supervisor.start_link(__MODULE__, args, name: @name)
  end

  def init(args) do
    pool_size = args |> Keyword.fetch!(:size)
    children = [
      %{
        id: PoolToy.PoolMan,
        start: {PoolToy.PoolMan, :start_link, [pool_size]}
      }
    ]

    Supervisor.init(children, strategy: :one_for_all)
  end
end

On line 6 we’ve modified start_link to take a keyword list with our options (as recommended by José) and forward it to init/1. In there, we extract the size value on line 11, and use that in our hand made child spec on line 15. In IEx, we can now call PoolToy.PoolSup.start_link(size: 5) and have our pool supervisor start up, and start its pool manager child. So this version also works, but it seems writing our own child spec is a lot of extra work when we were almost there using just the module name…

If look at the 2nd bullet point in the quoted docs above, you’ll find the simpler solution: use a tuple to specify the child spec. Just provide a tuple with the child module as the first element, and the startup args as the second element (lib/pool_toy/pool_sup.ex):

defmodule PoolToy.PoolSup do
  use Supervisor

  @name __MODULE__

  def start_link(args) when is_list(args) do
    Supervisor.start_link(__MODULE__, args, name: @name)
  end

  def init(args) do
    pool_size = args |> Keyword.fetch!(:size)
    children = [{PoolToy.PoolMan, pool_size}]

    Supervisor.init(children, strategy: :one_for_all)
  end
end

Back in our IEx shell, let’s make sure we didn’t break anything:

iex(1)> PoolToy.PoolSup.start_link(size: 5)
{:ok, #PID<0.121.0>}

Our changes so far

Another look at Observer

Let’s take a look in Obeserver with :observer.start(). Go to the “Processes” tab, find the Elixir.PoolToy.PoolMan process, and double-click it to open its process info window. If you now navigate to the “State” tab, you’ll see that we indeed have 5 worker atoms as the state: the size option was properly forwarded down to our pool manager from the pool supervisor! You can also see that there’s a parent pid on this screen: click it.

In the new window, you’ll be looking at the pool manager’s parent process: the window title indicates that its Elixir.PoolToy.PoolSup! So our pool supervisor did indeed start the pool manager as its child, as we intended. Finally, go to the “State” tab in this window for PoolSup and click on the “expand above term” link: you’ll see something like

{state,{local,'Elixir.PoolToy.PoolSup'},
       one_for_all,
       {['Elixir.PoolToy.PoolMan'],
        #{'Elixir.PoolToy.PoolMan' =>
              {child,<0.143.0>,'Elixir.PoolToy.PoolMan',
                     {'Elixir.PoolToy.PoolMan',start_link,[5]},
                     permanent,5000,worker,
                     ['Elixir.PoolToy.PoolMan']}}},
       undefined,3,5,[],0,'Elixir.PoolToy.PoolSup',
       [{size,5}]}

Once again, we’re poking our nose into something that we’re not really supposed to be aware of, but you can probably guess what a lot of information in there corresponds to: we’ve got a one_for_all restart strategy, a single childwith pid <0.143.0> (followed essentially by Erlang’s representation of the child spec used to start this particular child).

Continue on to the next post to add the worker supervisor and convert our app into an OTP application.


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.2)

PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.1)

A new project

(This post is part of a series on writing a process pool manager in Elixir.)

Without further ado, let’s get started with PoolToy:

mix new pool_toy

Pools will contain worker processes that will actually be doing the work (as their name implies). In the real world, workers could be connections to databases, APIs, etc. for which we want to limit concurrency. In this example, we’ll pretend that doubling a number is expensive in terms of computing resources, so we want to avoid having too many computations happening at the same time.

Here’s our worker GenServer (lib/doubler.ex):

defmodule Doubler do
  use GenServer

  def start_link([] = args) do
    GenServer.start_link(__MODULE__, args, [])
  end

  defdelegate stop(pid), to: GenServer

  def compute(pid, n) when is_pid(pid) and is_integer(n) do
    GenServer.call(pid, {:compute, n})
  end

  def init([]) do
    {:ok, nil}
  end

  def handle_call({:compute, n}, _from, state) do
    IO.puts("Doubling #{n}")
    :timer.sleep(500)
    {:reply, 2 * n, state}
  end
end

Nothing fancy going on here: we’ve got a public API with start_link/1, stop/1, and compute/1 functions. As you can see from the return value in init/1, we’re not making use of state here, since our use case is so trivial. Finally, when handling a compute request on the server, we sleep for half a second to simulate expensive processing.

Also of note, the Doubler module is a the top level within the lib folder, and not within a pool_toy subfolder. This is because we really only have Doubler within PoolToy for convenience: if PoolToy became a fully-fledged library, the worker would be provided by the client code.

Our changes so far

Managing a single pool

So what’s our pool going to look like? Roughly (ok, exactly…) like this:

 

No need to praise me for my elite design skills, this is straight from the Observer tool, which we’ll get to know better later on.

At the top level, we’ve got PoolSup which is responsible for supervising an entire pool: after all, once we’ve got several pools being managed, if one pool crashes the others should keep chugging along. Below that, we’ve got PoolMan and WorkerSup. PoolMan is the pool manager: it’s the process we communicate with to interact with the pool (such as borrowing a worker to do some work, then returning it).

WorkerSup supervises the actual worker processes: if one goes down, a new worker will be started in its place. The unnamed workers will be instances of Doubler in our case. After all, we didn’t write that module for nothing…

This nice little attroupement of processes is commonly referred to as a supervision tree: each black line essentially means “supervises” when read from left to right. You’ve probably heard of Erlang’s “let it crash” philosophy, and it goes hand in hand with the supervision tree: if something goes wrong within a process and we don’t know how to fix it or haven’t anticipated that failure, it’s best to simply kill the process and start a new one from a “known good” state.

After all, we do the same thing when our computers or phones start acting up, and no obvious actions seem to fix the issue: reboot it. Rebooting a device will let it start from a clean state, and fixes many problems. In fact, if you’ve ever had to provide some sort of IT support to friends or family, “Have you tried turning it off and on again?” is probably on of your trusty suggested remedies.

But why are we bothering with a worker supervisor and a pool manager? Couldn’t we combine them? While it’s definitely possible to do so technically, it’s a bad idea. In order to provide robust fault-tolerance, supervisors should essentially be near impossible to crash, which means they should have as little code as possible (since the probability of bugs will only increase with more code). Therefore, our design has a worker supervisor that only takes care of supervision (start/stopping processes, etc.), while the manager is the brains of the operation (tracking which workers are available, which process is currently using a checked out worker, etc.).

The pool supervisor

Let’s write our pool supervisor (lib/pool_toy/pool_sup.ex):

defmodule PoolToy.PoolSup do
  use Supervisor

  @name __MODULE__

  def start_link() do
    Supervisor.start_link(__MODULE__, [], name: @name)
  end

  def init([]) do
    children = []

    Supervisor.init(children, strategy: :one_for_all)
  end
end

On line 2, we indicate this module will be a supervisor which will do a few things for us. Don’t worry about what it does for now, we’ll get back to it later.

We define a start_link/0 function on line 6 because, hey, it’s going to start the supervisor and link the calling process to it. Technically, the function could have been given any name, but start_link is conventional as it communicates clearly what the expected outcome is. You’ll note that we also give this supervisor process a name, to make it easier to locate later.

Our code so far

Naming processes

Since we expect there to be a single pool supervisor (at least for now), we can name the process to make it easier to find and work with.

Process names have to be unique, so we need to make sure we don’t accidentally provide a name that is already (or will be) in use. You know what else needs to be unique? Module names! This is one of the main reasons unique processes are named with their module name. As an added bonus, it makes it easy to know at a glance what the process is supposed to be doing (because we know what code it’s running).

But why have Supervisor.start_link(__MODULE__, [], name: @name) when Supervisor.start_link(__MODULE__, [], name: __MODULE__) would work just as well? Because the first argument is actually the current module’s name, while the name option could be anything (i.e. using the module name is just a matter of convention/convenience). By declaring a @name module attribute and using that as the option value, we’re free to have our module and process names change independently.

And now, back to our regularly scheduled programming

The supervisor behaviour requires an init/1 callback to be defined (see docs):

def init([]) do
  children = []

  Supervisor.init(children, strategy: :one_for_all)
end

At some point in the future, our supervisor will supervise some child processes (namely the “pool manager” and “worker supervisor” processes we introduced above). But for now, let’s keep our life simple and child-free ;-)

Finally, we call Supervisor.init/2 (docs) to properly set up the initialization information as the supervisor behaviour expects it. We provided :one_for_all as the supervision strategy. Supervision strategies dictate what a supervisor should do when a supervised process dies. In our case, if a child process dies, we want the supervisor to kill all other surviving supervised processes, before restarting them all.

But is that really necessary, or is it overkill? After all, there are other supervision strategies we could use (only restarting the dead process, or restarting the processes that were started after it), why kill all supervised processes if a single one dies? Let’s take another look at our process tree and think it through:

As mentioned above, PoolMan will take care of the pool management (checking out a worker, keeping track of which workers are busy, etc.), while WorkerSup will supervise the workers (replacing dead ones with fresh instances, creating a new worker if the pool manager asks for it, etc.).

What happens if PoolMan dies? Well, we’ll lose all information about which workers are checked out (and busy) and which ones can be used by clients needing work to be done. So if PoolMan dies, we want to kill WorkerSup also, because then once WorkerSup gets restarted all of its children will be available (and therefore PoolMan will know all workers are available for use).

You might be worried about the poor clients that checked workers to perform a task, and suddenly have that worker killed. The truth is, in the Erlang/Elixir world, you always have to think about processes dying, being unreachable, and so on: after all, the worker process could have died at any time and for whatever reason. In other words, the client process should have code to handle the worker dying and handle that case appropriately: after all, the worker process can die for any number of reasons (e.g. software bug, remote timeout) and not necessarily due to a supervisor killing it. And of course, the client process could very well decide that “appropriately handling the death of a worker process” means “just crash”. We are in the Erlang world, after all :D

Ok, so we’ve determined that if PoolMan dies, we should bring down WorkerSup along with it. What about the other way around? What happens if WorkerSup dies? We’ll have no more workers, and the accounting done within PoolMan will be useless: the list of busy processes (referenced by their pid) will no longer match any existing pid, since the new workers will have been given new pids. So we’ll have to kill PoolMan to ensure it starts back up with an empty state (i.e. no worker processes registered as checked out).

Since we’ve concluded that in the event of a child process dying we should kill the other one, the correct supervision strategy here in :one_for_one.

Poking around in Observer

Let’s start an IEx session and investigate what we’ve got so far. From within the pool_toy folder, run iex -S mix : in case you’ve forgotten, this will start an IEx session and run the mix script, so we’ll have our project available to use within IEx.

First, let’s start the pool supervisor with PoolToy.PoolSup.start_link(). Then, let’s start Erlang’s Observer with :observer.start() (note that autocompletion doesn’t work when calling Erlang modules, so you have to type the whole thing out): since Observer is an Erlang module, the syntax to call it is slightly different (because we use Erlang’s syntax). Here’s a quick (and very inadequate) primer on Erlang syntax: in Erlang, atoms are written in lower snake_case, while upper CamelCase tokens are variables (which cannot be rebound in Erlang). Whereas in Elixir module atom names are CamelCase while “normal” atoms are lower snake_case, in Erlang both are lower snake_case (i.e. a module name is an atom like any other).

To call an Erlang module’s function you join them with a colon. So in Erlang, foo:bar(MyVal) would execute the bar/1 function within the foo module with the MyVal variable as the argument. Finally, back in the Elixir world, we need to prepend the Erlang module’s name with a colon to make it an atom: the Observer module therefore gets referenced as :observer and :observer.start() will call the start/0 function within the Observer module. Whew!

Ok, so a new window should have popped up, containing the Erlang Observer :

If nothing came up, search the web for the error message you get: it’s likely you’re missing a dependency (e.g. wxWidgets).

Click on the “Processes” tab, then on the “Name or Initial Func” header to sort the list of processes, then scroll down to Elixir.PoolToy.PoolSup which is the supervisor process we just created (don’t forget that Elixir prefixes all module names with Elixir). Let’s now open the information window for that particular process by double clicking on it (or right-clicking and selecting the “Process info for <pid>” option):

Notice that I’ve switched to the last tab, because that’s all we’ll be looking at for now. We can see a few things of interest here: our supervisor implements the GenServer behaviour, it’s running, and it’s parent process is <0.115.0> (this pid could very well be different in your case). If you click on the pid, a new window will open, where you’ll find out that the parent is indeed the IEx session (in the “Process Information” tab, the “Current Function” value is Elixir.IEx.Evaluator:loop/1).

We can also see our supervisor has some sort of internal state: click the provided link in the window to see what the state contains. You’ll see something like

{state,{local,'Elixir.PoolToy.PoolSup'}, 
           one_for_all,{[],#{}},undefined,3,5,[],0,'Elixir.PoolToy.PoolSup',[]}

This is the supervisor’s internal state (as Erlang terms), so we’re kind of peeking behind the curtains here, but we can see that the state contains the :one_for_all strategy, and the name of the module where we’ve defined the supervisor behaviour. Don’t worry about the other stuff: we’re looking at the internal details of something we didn’t write, so it’s not really our business to know what all that stuff is. It’ll be much more meaningful later when we observe processes where we defined the state ourselves (because then what the state contains will make sense to us!).

Take a break, and join me in the next post where we’ll implement the pool manager and have our supervisor start it.


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6 (part 1.1)

PoolToy: A (toy) process pool manager in Elixir 1.6

Contents

This series of posts will guide you in writing an OTP application that will manage pools of processes. In particular, it will do so using features introduced in Elixir 1.6 such as the DynamicSupervisor and the Registry.

There’s quite a bit of content I wanted to cover, and I tried to present the material in a way for readers to learn (and retain!) as much as possible given the time spent. One reason for the length is that we’ll be making mistakes on our journey. Improving a skill is about learning from your mistakes, and Elixir/OTP is no different: I want you to see how/why things don’t work out sometimes, and to understand why another design is better suited. In other words, I won’t show you the “best” implementation right away, but my hope is that you’ll be better off for it: you’ll be able to think critically about your own designs and implement corrections as necessary.

Another focus is on the Observer: everybody says it’s a great tool, but beyond that you’re most often left to your own devices. We’ll periodically use the Observer as we develop PoolToy to see what’s going on in our software and to help us diagnose and fix design problems.

Without further ado, here is the contents of this series:

  1. Managing a single pool
    • Part 1.1: create a new mix project, add a worker and pool supervisor
    • Part 1.2: implement the pool manager and have our pool supervisor start it
    • Part 1.3: add the (dynamic) worker supervisor and have our pool supervisor start it ; convert our code into an OTP application
    • Part 1.4: make the worker supervisor start workers when it initializes
    • Part 1.5: implement worker checkin and checkout
    • Part 1.6: monitor client processes
    • Part 1.7: use an ETS table to track client monitors
    • Part 1.8: handle worker deaths
    • Part 1.9: make workers temporary and have the pool manager in charge of restarting them
  2. Managing multiple pools (coming soon)
    1. Get configuration (worker module, pool size/name, etc.) dynamically
    2. Start/stop pools
    3. Find pools with Registry
  3. Enhancing our pool capabilities (coming soon)
    1. Allow overflow workers
    2. Cooldown period before terminating overflow workers
    3. Queueing client demand

Intro

This series of posts assumes that you’re familiar with Elixir syntax and basic concepts (such as GenServers and supervisors). But don’t worry: no need to be an expert, if you’ve read an introduction or two about Elixir you should be fine. If you haven’t read anything about Elixir yet, have a quick look at the excellent Getting started guide.

Processes and supervision are a core tenet of Elixir, Erlang, and the OTP ecosystem. In this series of posts, we’ll see how to create a process pool similar to Poolboy as this will give us ample opportunity to see how processes behave and how they can be handled within a supervision tree.

As a quick reminder, process pools can be used to:

  • limit concurrency (e.g. only N simultaneous connections to a DB, or M simultaneous requests to an API with rate limiting)
  • smooth occasional heavy activity bursts by queueing the excess demand
  • allocate more resources to more important parts of the system (e.g. payment processing gets a bigger pool than report creation)

Hopefully, you won’t have any trouble following along (if you struggle, please let me know!), but if you do here’s a list of alternative learning resources that might give you a different perspective and get you “unstuck”:

  • a blog post on the service/worker pattern (the code is in Erlang, but the concepts fully apply to Elixir).
  • building an OTP application from Fred Hébert’s “Learn you some Erlang”. As you can guess, it’s in Erlang and the design is a bit different (e.g. no DynamicSupervisor in Erlang, so you must use a :simple_one_for_one strategy) but even so it’s definitely worth a read to give you a better understanding about process wrangling.
  • pooly another simple process pool manager written in Elixir, but with older concepts (as it was written before they were introduced). Mainly: it uses a supervisor with a :simple_one_for_one strategy instead of a DynamicSupervisor, and doesn’t use Elixir’s registry to locate the pools.
  • the poolboy application itself (in Erlang), which is probably what you’d reach for if you need pool management in a production application

And if you’re the kind of person who prefers to just dive into the code, here it is.

Now that we’re situated, let’s get started in the next post.


Would you like to see more Elixir content like this? Sign up to my mailing list so I can gauge how much interest there is in this type of content.

Posted in Elixir, Uncategorized | Comments Off on PoolToy: A (toy) process pool manager in Elixir 1.6