Actor starvation problem

3 replies

Sat, 2010-02-20, 17:39

Erik Engbrecht

Joined: 2008-12-19,

Please take a look at: http://stackoverflow.com/questions/2288723/scala-actors-different-behavior-on-jre-1-5-and-1-6/2302738#2302738
I don't have time to dig in a figure out if my analysis is right (the innards of the FJ pools have always left me in a state of confused awe), but I think it's at least close. Basically if one task hogs a thread, and all the other threads in the pool are busy so they don't do any task stealing, then tasks that are directly placed on the dequeue of the hogged thread never get executed.
I'd like to suggest that the Scheduler be modified so that it detects this situation and moves the tasks in the hogged thread's dequeue to other threads. What I'm picturing would work something like this:

Scan the worker threads and record their current task (use a weakreference so we don't get memory leaks!)
Wait a while
Scan the worker threads again, if a thread is still on the same task and has tasks on its dequeue, distribute the tasks to threads that are moving through their dequeues
If all the threads are stuck, and the pool size is less than the max, spawn another worker thread

I'm not sure if this is possible without screwing up the locking mechanisms in the pool and thus hurting performance. Another option could be to have the worker threads descend into task-stealing mode a psuedo-random intervals (probably be counting tasks executed) even if they have tasks to execute on their own dequeue.
-Erik

Tue, 2010-02-23, 11:27

Philipp Haller

Joined: 2009-01-13,

Re: Actor starvation problem

Hi Erik,

> I don't have time to dig in a figure out if my analysis is right (the
> innards of the FJ pools have always left me in a state of confused awe),
> but I think it's at least close. Basically if one task hogs a thread,
> and all the other threads in the pool are busy so they don't do any task
> stealing, then tasks that are directly placed on the dequeue of the
> hogged thread never get executed.

Yes, that's actually what happened. Thanks for hinting at it!

> I'd like to suggest that the Scheduler be modified so that it detects
> this situation and moves the tasks in the hogged thread's dequeue to
> other threads. What I'm picturing would work something like this:
>
> 1. Scan the worker threads and record their current task (use a
> weakreference so we don't get memory leaks!)
> 2. Wait a while
> 3. Scan the worker threads again, if a thread is still on the same
> task and has tasks on its dequeue, distribute the tasks to threads
> that are moving through their dequeues
> 4. If all the threads are stuck, and the pool size is less than the
> max, spawn another worker thread

To solve the problem at hand, I came up with a much simpler solution:
Instead of always putting new tasks into the local dequeue (if
possible), put them into the global dequeue with a chance of, say, 2%.
The rate shouldn't be too high so that work-stealing still has a
positive effect. If it's too low, the actual behavior may not be very
"fair".

This solves the starvation problem in the gears example on
stackoverflow, since the gear actors will sometimes put their next task
into the global queue. When this happens, a different gear is going to
be executed, since the current worker's dequeue is empty causing it to
get a task from the global dequeue.

I added a Boolean `fair` parameter to `ForkJoinScheduler`, which enables
this behavior; it's enabled by default for actors. However, it incurs a
small performance hit: around 10% for chameneos-redux. Fortunately, the
above situation cannot occur with `Reactor`s, since they cannot receive
while holding on to their thread. Therefore, `Reactor` uses a
`ForkJoinScheduler` with `fair` disabled.

The changes are available in trunk (r20950).

> I'm not sure if this is possible without screwing up the locking
> mechanisms in the pool and thus hurting performance. Another option
> could be to have the worker threads descend into task-stealing mode a
> psuedo-random intervals (probably be counting tasks executed) even if
> they have tasks to execute on their own dequeue.

That might be a workable alternative.

Cheers,
Philipp

Tue, 2010-02-23, 13:47

Erik Engbrecht

Joined: 2008-12-19,

Re: Actor starvation problem

Hmmm...that's an interesting solution. I'll have to take a look!

On Tue, Feb 23, 2010 at 5:24 AM, Philipp Haller <philipp.haller@epfl.ch> wrote:

Hi Erik,

I don't have time to dig in a figure out if my analysis is right (the innards of the FJ pools have always left me in a state of confused awe), but I think it's at least close. Basically if one task hogs a thread, and all the other threads in the pool are busy so they don't do any task stealing, then tasks that are directly placed on the dequeue of the hogged thread never get executed.

Yes, that's actually what happened. Thanks for hinting at it!

I'd like to suggest that the Scheduler be modified so that it detects this situation and moves the tasks in the hogged thread's dequeue to other threads. What I'm picturing would work something like this:

1. Scan the worker threads and record their current task (use a
weakreference so we don't get memory leaks!)
2. Wait a while
3. Scan the worker threads again, if a thread is still on the same
task and has tasks on its dequeue, distribute the tasks to threads
that are moving through their dequeues
4. If all the threads are stuck, and the pool size is less than the
max, spawn another worker thread

To solve the problem at hand, I came up with a much simpler solution:
Instead of always putting new tasks into the local dequeue (if possible), put them into the global dequeue with a chance of, say, 2%. The rate shouldn't be too high so that work-stealing still has a positive effect. If it's too low, the actual behavior may not be very "fair".

This solves the starvation problem in the gears example on stackoverflow, since the gear actors will sometimes put their next task into the global queue. When this happens, a different gear is going to be executed, since the current worker's dequeue is empty causing it to get a task from the global dequeue.

I added a Boolean `fair` parameter to `ForkJoinScheduler`, which enables this behavior; it's enabled by default for actors. However, it incurs a small performance hit: around 10% for chameneos-redux. Fortunately, the above situation cannot occur with `Reactor`s, since they cannot receive while holding on to their thread. Therefore, `Reactor` uses a `ForkJoinScheduler` with `fair` disabled.

The changes are available in trunk (r20950).

I'm not sure if this is possible without screwing up the locking mechanisms in the pool and thus hurting performance. Another option could be to have the worker threads descend into task-stealing mode a psuedo-random intervals (probably be counting tasks executed) even if they have tasks to execute on their own dequeue.

That might be a workable alternative.

Cheers,
Philipp

--
http://erikengbrecht.blogspot.com/

Tue, 2010-02-23, 14:57

Erik Engbrecht

Joined: 2008-12-19,

Re: Actor starvation problem

Hmm...you might want to try using a dedicated instance of Random rather than the global Random object. If something else if using that instance a lot you'll end up with a weird source of contention. Also, I'm not sure how much performance penalty accessing an object creates versus a regular instance variable, but it must be at least a little and that's an important codepath.
Another thing you could try is instead of generating the a random number every time, generate the interval until the next task will go to the global queue instead of the thread local one, and on normal calls just increment a counter toward that interval. That would avoid doing the "expensive" random number generation every time.

On Tue, Feb 23, 2010 at 5:24 AM, Philipp Haller <philipp.haller@epfl.ch> wrote:

Hi Erik,

I don't have time to dig in a figure out if my analysis is right (the innards of the FJ pools have always left me in a state of confused awe), but I think it's at least close. Basically if one task hogs a thread, and all the other threads in the pool are busy so they don't do any task stealing, then tasks that are directly placed on the dequeue of the hogged thread never get executed.

Yes, that's actually what happened. Thanks for hinting at it!

I'd like to suggest that the Scheduler be modified so that it detects this situation and moves the tasks in the hogged thread's dequeue to other threads. What I'm picturing would work something like this:

1. Scan the worker threads and record their current task (use a
weakreference so we don't get memory leaks!)
2. Wait a while
3. Scan the worker threads again, if a thread is still on the same
task and has tasks on its dequeue, distribute the tasks to threads
that are moving through their dequeues
4. If all the threads are stuck, and the pool size is less than the
max, spawn another worker thread

To solve the problem at hand, I came up with a much simpler solution:
Instead of always putting new tasks into the local dequeue (if possible), put them into the global dequeue with a chance of, say, 2%. The rate shouldn't be too high so that work-stealing still has a positive effect. If it's too low, the actual behavior may not be very "fair".

This solves the starvation problem in the gears example on stackoverflow, since the gear actors will sometimes put their next task into the global queue. When this happens, a different gear is going to be executed, since the current worker's dequeue is empty causing it to get a task from the global dequeue.

I added a Boolean `fair` parameter to `ForkJoinScheduler`, which enables this behavior; it's enabled by default for actors. However, it incurs a small performance hit: around 10% for chameneos-redux. Fortunately, the above situation cannot occur with `Reactor`s, since they cannot receive while holding on to their thread. Therefore, `Reactor` uses a `ForkJoinScheduler` with `fair` disabled.

The changes are available in trunk (r20950).

I'm not sure if this is possible without screwing up the locking mechanisms in the pool and thus hurting performance. Another option could be to have the worker threads descend into task-stealing mode a psuedo-random intervals (probably be counting tasks executed) even if they have tasks to execute on their own dequeue.

That might be a workable alternative.

Cheers,
Philipp

--
http://erikengbrecht.blogspot.com/

Scala Main Menu

Actor starvation problem

Scala Quick Links

Featured News

User login