Looking for an advice on how to possibly boost my compute shader

I have 4 rendertargets R10G10B10A2 each being used to store color and normal in a compact double RGB 555 bits format. When I’m doing the lighting pass I’m obliged to have extra rendertargets because if I can read(extract) the normal/color, I can’t write back directly the lit color to these surfaces.

So I moved to a compute shader to achieve this goal (DX11 CS5). I have found that I’m limited to R32_UINT for read/write in DX11 CS5. I have thus adapted my encoding/decoding to have things work fine, although the final colors are a little bit hugly with my encoding.

Unfortunately the final frame rate is lower than the regular pixel shader even if I have to do 4 copyresources from the intermediate rendertargets with my previous method.

I have implemented the CS starting from those found in the legacy DX11 samples. My code is like this:

CPU calls:

 //some CSbufferConstant and CSshaderresourceview settings (e.g. shadowmaps, depth textures
 gpDC11->CSSetShader(pCS, 0, 0);
 //I' mworking on 4 UAV at the same time in the CSshader. With only one it is even slower
 gpDC11->CSSetUnorderedAccessViews(0, 4, gSRDeffered.ppUAV, (UINT*)(&gSRDeffered.ppUAV));
 gpDC11->Dispatch(960, 540, 1);//the size of my screen

GPU CS shader:

void CS_PostDeferred( uint3 nGid : SV_GroupID, uint3 nDTid : SV_DispatchThreadID, uint3 nGTid : SV_GroupThreadID )//only nDTid is used in fact here

    uint Output;
    float2 Tex = float2(nDTid.x/960.0, nDTid.y/540.0);
    float Depth1 = txDepth1.SampleLevel(samPoint, Tex, 0).r;
    float Depth2 = txDepth2.SampleLevel(samPoint, Tex, 0).r;
    float Depth3 = txDepth3.SampleLevel(samPoint, Tex, 0).r;
    float Depth4 = txDepth4.SampleLevel(samPoint, Tex, 0).r;

    Output = UAVDiffuse0[nDTid.xy];
    if ( Depth1 < 1 ) Output = GetLColorUnPackPack_CS(Depth1, Tex, Output);
    Output = UAVDiffuse1[nDTid.xy];
    if ( Depth2 < 1 ) Output = GetLColorUnPackPack_CS(Depth2, Tex, Output);
    Output = UAVDiffuse2[nDTid.xy];
    if ( Depth3 < 1 ) Output = GetLColorUnPackPack_CS(Depth3, Tex, Output);
    Output = UAVDiffuse3[nDTid.xy];
    if ( Depth4 < 1 ) Output = GetLColorUnPackPack_CS(Depth4, Tex, Output);

The uint GetLColorUnPackPack_CS (float Depth, float2 UV, uint Data) is the function extracting the color and normal as 2 float3 from the R32_UINT Data param, calculating as usual lighting and shadows for the pixel and then recompacting the final color to the R32_UINT (the normal is not changed). The Depth and UV params are used to recover the position in view space needed for the point lights I’m using.

Any advice welcome

