我想优化这个短循环 [英] I want to optimize this short loop

查看:79
本文介绍了我想优化这个短循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想优化这个简单的循环:

I would like to optimize this simple loop:

unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
   float sub = 0;
   i=1;
   unsigned int c = j+s[1];
   while(c < N) {
       sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
       i++;
       c = j+s[i];// s[] is an array of unsigned int with 6 entries.
   }
   x[j] -= sub;                        // only one memory-write per j
}

循环的执行时间为使用4000 MHz AMD Bulldozer大约需要一秒钟。我考虑过SIMD和OpenMP(通常使用它们来提高速度),但是此循环是递归的。

The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.

有什么建议吗?

推荐答案

我同意为更好地缓存而进行转置(但最后请参见我对此的评论),还有很多事情要做,所以让我们看看我们可以使用完整功能...

I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...

原始功能,以供参考(出于我的理智,有些整理):

Original function, for reference (with some tidying for my sanity):

void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
    //We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
    //Let D Lt x = y. Then, first solve L y = b.

    float *y = new float[n];
    float **d = IncompleteCholeskyFactorization->Diagonals;
    unsigned int *s = IncompleteCholeskyFactorization->StartRows;
    unsigned int M = IncompleteCholeskyFactorization->m;
    unsigned int N = IncompleteCholeskyFactorization->n;
    unsigned int i, j;
    for(j = 0; j != N; j++){
        float sub = 0;
        for(i = 1; i != M; i++){
            int c = (int)j - (int)s[i];
            if(c < 0) break;
            if(c==j) {
                sub += d[i][c]*b[c];
            } else {
                sub += d[i][c]*y[c];
            }
        }
        y[j] = b[j] - sub;
    }

    //Now, solve x from D Lt x = y -> Lt x = D^-1 y
    // Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
    for(j = 0; j < N; j++){
        x[j] = y[j]/d[0][j];
    }

    while(j-- != 0){
        float sub = 0;
        for(i = 1; i != M; i++){
            if(j + s[i] >= N) break;
            sub += d[i][j]*x[j + s[i]];
        }
        x[j] -= sub;
    }
    delete[] y;
}

因为关于并行除法的评论提高了速度(尽管只有O (N)),我假设函数本身被调用很多。那么为什么要分配内存呢?只需将 x 标记为 __ restrict __ 并将 y 更改为 x 到处都是( __ restrict __ 是GCC扩展,取自C99。您可能要使用 define 。也许图书馆已经有一个)。

Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).

类似地,虽然我猜您不能更改签名,但您可以函数仅接受一个参数并对其进行修改。设置 x y 时,从不使用 b 。这也意味着您可以在运行〜N * M次的第一个循环中摆脱分支。如果必须具有2个参数,请在开始时使用 memcpy

Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.

为什么 d 指针数组?一定是吗这在原始代码中似乎太深了,所以我不会去碰它,但是,即使有可能展平存储的数组,即使您无法转置它,也可以提高速度(乘以,添加,取消引用更快

And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).

因此,新代码:

void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
    // comments removed so that suggestions are more visible. Don't remove them in the real code!
    // these definitions got long. Feel free to remove const; it does nothing for the optimiser
    const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
    const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
    const unsigned int M = IncompleteCholeskyFactorization->m;
    const unsigned int N = IncompleteCholeskyFactorization->n;
    unsigned int i;
    unsigned int j;
    for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
        float sub = 0;
        for(i = 1; i < M && j >= s[i]; i++){
            const unsigned int c = j - s[i];
            sub += d[i][c]*x[c];
        }
        x[j] -= sub;
    }

    // Consider using processor-specific optimisations for this
#pragma omp parallel for
    for(j = 0; j < N; j++){
        x[j] /= d[0][j];
    }

    for( j = N; (j --) > 0; ){ // changed for clarity
        float sub = 0;
        for(i = 1; i < M && j + s[i] < N; i++){
            sub += d[i][j]*x[j + s[i]];
        }
        x[j] -= sub;
    }
}

看起来很整洁,而且缺少内存分配减少分支(如果没有其他要求的话)是一个促进。如果您可以更改 s 以在末尾包含额外的 UINT_MAX 值,则可以删除更多分支(两个 i< M 支票,它再次运行〜N * M次。)

Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).

现在我们不能再做了循环是并行的,我们不能合并循环。正如另一个答案所建议的那样,现在的提升将是重新排列 d 。除了……重新安排 d 所需的工作与执行循环的工作具有完全相同的缓存问题。这将需要分配内存。不好。进一步优化的唯一选择是:更改 IncompleteCholeskyFactorization-> Diagonals 本身的结构,这可能意味着很多更改,或者找到一种更有效的算法

Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.

如果想进一步,您的优化将需要影响很多代码(这不是一件坏事;除非有充分的理由,否则)对于对角线是指针数组,似乎可以通过重构来完成。)

If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).

这篇关于我想优化这个短循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆