Fast polynomial evaluation
I guess everybody knows what a polynomial is.
Polynomials are really cool and used all over the place in programming. Evaluating a polynomial is really important, so how should we do it?
First of all lets see how the polynomial function looks like:
$a_n x^n + a_{n-1} x^{n-1} + \dots + a_2 x^2 + a_1 x + a_0$
We can see that a polynomial is basically defined by it’s coefficients $a_i$. So we can actually store all we need for the evaluation by storing the coefficients into an array.
We can use a shorter notation $\sum\limits_{i=1}^n a_i x^i$ that actually implies the first algorithm.
We implement this in Javascript
function polyEvalSimple(x,coeff)
{
  var xacc = 1;
  var finalValue = 0.0
  for(var i=0; i< coeff.length; i++)
  {
    finalValue = finalValue + xacc * coeff[i];
    xacc = xacc * x;
  }
  return finalValue;
}
and then in Python
def polyeval(x,coeff):
    xacc = 1;
    finalValue = 0.0
    for i in xrange(0,len(coeff)):
        finalValue = finalValue + coeff[i] * xacc
        xacc = xacc * x
    return finalValue
and of course C
double polyEvalSimple(double x, double *coeff, unsigned int numCoeff)
{
	double xacc = 1;
	double finalValue = 0.0f;
	for (unsigned int i = 0; i < numCoeff; i++)
	{
		finalValue = finalValue + coeff[i] * xacc;
		xacc = xacc * x;
	}
	return finalValue;
}
Ok, now we can do better than this. We just use a little math trickery and split the polynomial into linear parts (monomial basis):
$p(x) = a_0 + x(a_1+x(a_2+ \dots +x(a_{n-1} + a_n x)))$
This is called the Horner method and it’s been around ages (even before Horner of course :) )
Now lets see the implementation:
Javascript
function polyEvalHorner(x,coeff)
{
  var finalValue = 0
  for(var i=coeff.length -1; i>= 0 ; i--)
  {
    finalValue = finalValue * x + coeff[i];
  }
  return finalValue;
}
and then in Python
def polyevalHorner(x,coeff):
    finalValue = 0.0
    for i in xrange(len(coeff)-1 , -1,-1):
        finalValue = finalValue * x+ coeff[i]
    return finalValue
and of course C
double polyEvalHorner(double x, double *coeff, unsigned int numCoeff)
{
	double finalValue = 0.0f;
	for (int i = numCoeff-1; i >= 0 ; i--)
	{
		finalValue = finalValue* x + coeff[i];
	}
	return finalValue;
}
As we can see, we basically avoid a multiplication by $x$ each iteration.
Now that’s all fine and dandy but will that actually matter on todays computers?
Let’s do some profiling.
We note the G = $\frac{t_{horner}}{t_{simple}}$ .
| Python | Javascript | C | |
|---|---|---|---|
| G | 0.71 | 0.375 | 1.11 | 
I did my tests on my home CPU: i7-2600 at 3.4GHZ (they could vary on your computers so feel free to run them and tell me your results)
We can see that there are some gains in speed when using the Horner method in Python and in Javascript.
We also see that the Horner method is actually worst in C . Why is that? We should be having better times, because we avoid doing a multiplication ! :)
We can also check the assembly code (I took the debug code so we can actually understand what is happening, the debug times are still better for the simple evaluation):
Simple method
finalValue = finalValue + coeff[i] * xacc;
00007FF69D321BCE  mov         eax,dword ptr [rbp+44h]  
00007FF69D321BD1  mov         rcx,qword ptr [coeff]  
00007FF69D321BD8  movsd       xmm0,mmword ptr [rcx+rax*8]  
00007FF69D321BDD  mulsd       xmm0,mmword ptr [xacc]  
00007FF69D321BE2  movsd       xmm1,mmword ptr [finalValue]  
00007FF69D321BE7  addsd       xmm1,xmm0  
00007FF69D321BEB  movaps      xmm0,xmm1  
00007FF69D321BEE  movsd       mmword ptr [finalValue],xmm0  
xacc = xacc * x;
00007FF69D321BF3  movsd       xmm0,mmword ptr [xacc]  
00007FF69D321BF8  mulsd       xmm0,mmword ptr [x]  
00007FF69D321C00  movsd       mmword ptr [xacc],xmm0  
Horner
finalValue = finalValue* x + coeff[i];
00007FF69D321B10  movsd       xmm0,mmword ptr [finalValue]  
00007FF69D321B15  mulsd       xmm0,mmword ptr [x]  
00007FF69D321B1D  movsxd      rax,dword ptr [rbp+24h]  
00007FF69D321B21  mov         rcx,qword ptr [coeff]  
00007FF69D321B28  addsd       xmm0,mmword ptr [rcx+rax*8]  
00007FF69D321B2D  movsd       mmword ptr [finalValue],xmm0  
The problem is that the Horner scheme is really bad for the cpu pipeline, because one iteration needs to wait for the other one to finish. Horner is starving the poor pipeline.
In higher level languages (that don’t get converted to machine code) you don’t really feel this, because everything is slower and you are waiting for the language interpreter to do it’s job (which is not starving anytime soon :) ).
If we want to improve the c polynomial evaluation we could actually use a Divede et impera method like Estrin’s scheme.
In practice, if we know the polynomial we sometimes can rewrite it using Estrin’s scheme, without coding the whole algorithm, as it has overhead that is insignificant only when the polynomial is huge.
if you tldr and just want to see the test code click here
