word to float to word conversion (DSP for using VST effects)

Just starting out? Need help? Post your questions and find answers here.
soerenkj
User
User
Posts: 95
Joined: Mon Jun 14, 2004 10:19 pm

word to float to word conversion (DSP for using VST effects)

Post by soerenkj »

I am trying to use a VST plugin to process a 16 bit .wav-file (put some flanger og chorus on it). Each sample in the .wav-file is stored in a word (because it is a 16 bit sample) but the VST standard requires each sample to be a float in the range -1.0 to 1.0, ie. I need to do some conversion. I am going to do the conversion like this:

Code: Select all

sample.w = 12345

If sample = - 32768
  sample + 1
EndIf
convertedSample.f = sample / 32767

processedSample.f = 0.23423244432343  ; = output from VST-procedure: process(convertedSample)

backConvertedSample.w = processedSample * 32767
However, I believe that this is not the optimal way of doing the conversions. As for the word to float conversion I have read some people claiming that the above method is very slow - in this post

http://groups.google.dk/group/alt.stein ... 0fdae8dfa2

Mark Robinson claims that there is method to do the word to float conversion where you move the bits around somehow. I have been reading a lot about the different number representations + done some experiments, but I still can't see/understand how it can be done.

As for the float to word conversion - which is more important for me - I think the above method might create some 'harmonic distortion' because of the rounding (if I understand http://en.wikipedia.org/wiki/Dithering correctly). I have heard some people talking about 'dithering' - is that necessary in this case and can someone give me an example/explanation of how to do that on a float sample value?
eriansa
Enthusiast
Enthusiast
Posts: 277
Joined: Wed Mar 17, 2004 12:31 am
Contact:

Post by eriansa »

dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

soerenkj,
don't bother with fiddling with bits, maybe it used to be faster 5+ years ago but FPU speeds have increased so much in modern processors compared to integer speeds that you'll get no advantage from it now.

The method mentioned by Mark Robinson in your link doesn't work, at least not on x86 CPUs. It fails to take into account that for 32 bit floats the most significant bit is implied, it's not actually stored.

Where you could save time is to change that divide into a multiply:

Code: Select all

multiplier.f=1/32767

convertedSample.f = sample * mulitplier 
 
Also, your logic for changing -32768 to +1 looks odd.

Don't believe that reference in the link above from eriansa. You will not get 6 FPU divides in a clock cycle on any x86 CPU.

As for the dithering, the FPU will round values for you to give the closest match for your result. It'll be a maximum of 0.5 LSBs. You can't hear changes of 0.5LSBs in a 16 bit number on a PC sound card, the background noise in a PC sound card is likely to be 100 times greater than that.


And another thing, shouldn't you really be dividing by 32768, not 32767.


Paul.
soerenkj
User
User
Posts: 95
Joined: Mon Jun 14, 2004 10:19 pm

Post by soerenkj »

OK, Thanks for the info!
I'm relieved that I do not have to learn about dithering..
I'll do the conversions as planned then, except I will do the multiplication instead of division (the PB compiler does not make that optimization?) and I'll divide/multiply by 32768 (I thought it was a good idea to do the clipping before converting to float, but I guess not..)
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Post by Rescator »

The following is the correct math,
I'm sure there is more effecient ways to do this,
and I'm very sure there exist a better asm solution.

But the general math method is correct.
The reason negative and positive values has to be treated differently is because a 16 bit signed sample range from -32768 and up to 32767.
The negative range is -32768 to -1 and the positive range is 0 to 32767
which is why the whole thing is off balance really.

This is why I prefer to stick to float as much as possible when working with sound.

Code: Select all

sample.w=-32768

Debug sample

;16bit signed to normalized float
If sample<0
 float.f=sample/32768
Else
 float.f=sample/32767
EndIf

Debug float

;float (normalized assumed) to 16 bit signed
If float<0.0
 sample=float*32768
Else
 sample=float*32767
EndIf

Debug sample
KarLKoX
Enthusiast
Enthusiast
Posts: 681
Joined: Mon Oct 06, 2003 7:13 pm
Location: France
Contact:

Post by KarLKoX »

Here is how i do it :

Code: Select all

Procedure VSTProcess(*buffer.l, samples.l)
Protected leftsample.w, rightsample.w

  ; sanity check
  If samples <0> 0 )
    Dispatch(#effProcessEvents, 0, 0, @*ptrEvList, 0)
  EndIf
  
  If ( *ptrPlug And NeedIdle)
    Idle()
  EndIf
  
  If ( *ptrPlug And *ptrPlug\process )
    If ( *ptrPlug\numInputs = 1)
      ;FIXME : MIX input0 And input1!
    EndIf  
  
   sampleframes = samples / 4
    
; VST Sound Buffers
  Dim *pinp.f(2)
  Dim *pout.f(2)
  Dim L_ib.f(samples)
  Dim R_ib.f(samples)  
  Dim L_ob.f(samples)
  Dim R_ob.f(samples)  

    ; Init sound buffers pointers
    PokeL(@*pinp(0), @L_ib())
    PokeL(@*pinp(1), @R_ib())
    PokeL(@*pout(0), @L_ob())
    PokeL(@*pout(1), @R_ob())
           
    ; Optimised short to float conversion
    ; taken from VLC (aka VideoLan)
    Structure p_f
      StructureUnion
        f.f
        i.l
      EndStructureUnion
    EndStructure
    u.p_f
    
    ; Put Left buffer to L_ib() and Right buffer to R_ib()
    m_dwBufferPos = 0
    *stereo16bitbuffers.w =  *buffer       
    For n = 0 To sampleframes - 1 
      leftsample =  PeekW(*stereo16bitbuffers)
      *stereo16bitbuffers + 2
      rightsample =  PeekW(*stereo16bitbuffers)
      *stereo16bitbuffers + 2

      u\i     = leftsample + $43c00000
      L_ib(n) = u\f - 384.000000
      
      u\i     = rightsample + $43c00000
      R_ib(n) = u\f - 384.000000
      
      L_ob(n) = 0.0
      R_ob(n) = 0.0
      
    Next n  

    SetBlockSize(sampleframes)
    If ( *ptrPlug\flags & #effFlagsCanReplacing)
      ProcessReplacing(@*pinp(), @*pout(), sampleframes)
    Else
      Process(@*pinp(), @*pout(), sampleframes)
    EndIf
    
    m_dwBufferPos = 0
    For n = 0 To sampleframes - 1
      u\f = L_ob(n) + 384.000000
      If u\i > $43c07fff
        leftsample  = 32767
      ElseIf u\i <43bf8000> $43c07fff
        rightsample = 32767
      ElseIf u\i <43bf8000> 32767 : leftsample = 32767
      ElseIf leftsample <32767> 32767 : rightsample = 32767
      ElseIf rightsample <32767>> 8)
      m_dwBufferPos + 1
      PokeW(*buffer+m_dwBufferPos, rightsample & $FF)
      m_dwBufferPos + 1
      PokeW(*buffer+m_dwBufferPos, rightsample >> 8)
      m_dwBufferPos + 1

    Next n  
   
   ; Taken from Psycle : don't know if it improve something
   ; in any way but seems to be ok
   PokeL(@*tempSamplesL, @*pinp(0))
   PokeL(@*tempSamplesR, @*pinp(1))    
   PokeL(@*pinp(0), @L_ob())
   PokeL(@*pinp(1), @R_ob()) 
   PokeL(@L_ob(),   *tempSamplesL)
   PokeL(@R_ob(),   *tempSamplesR)     
   PokeL(@*pout(0), *tempSamplesL)
   PokeL(@*pout(1), *tempSamplesR)     
 

   If ( *ptrEvList And *ptrEvList\numEvents > 0)
      *ptrEvList\numEvents = 0
   EndIf
   
  EndIf

EndProcedure
This is old code (peek/poke ...).
"Qui baise trop bouffe un poil." P. Desproges

http://karlkox.blogspot.com/
soerenkj
User
User
Posts: 95
Joined: Mon Jun 14, 2004 10:19 pm

Post by soerenkj »

@dioxin: I made a small test of the multiply/divide thing and to me it does not seem that there is a difference in efficiency (and the PB compiler does not seem to do the optimization of rewriting divide to multiply - I checked the assembly code generated by the compiler)

@Rescator: I found that 32767/32768 is rounded down to 1.0 and that 1.0*32768 is rounded to 32767 so I guess the first method is also ok (?)

@KarLKoX: you do not agree with dioxin that nothing is gained from 'fiddling with bits'? I guess I will have to do some tests..
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

soerenkj,
compare the processor operations required with both methods.
The underying job done if you use KarLKoX's method is:

1a) load 32 bit integer value (the sample)
2a) add constant 43c00000 (could be prestored in a register for speed)
3a) Store 32 bit integer result somewhere

4a) Floating point integer load of the result just stored
5a) Floating point subtract of 384 (which could have been preloaded into a register)
6a) Floating point store of the final result.

Compare with the direct method

1b) Floating point integer load of the sample
2b) Floating point multiply by 1/32768 (the constant could be preloaded to a register for speed)
3b) Floating point store of the result.


Now note that:
1b is directly matched by 4a
3b is directly matched by 6a


so the one FMUL that we need for the direct method is compared to 3 integer operations and an FSUB for the bit-fiddling method.

Now note that (at least for an Athlon CPU) the FSUB and FMUL actually both have the same execute latency.
And also notice that the first method requires the FPU to wait for the integer result to be calculated and stored in RAM before it can be loaded which causes a pipeline stall.
And also note that the first method is performing twice the memory accesses of the second method and you'll soon see that you aren't going to be gaining anything from bit-fiddling except harder to understand source code.

I checked the assembly code generated by the compiler)
I bet it didn't produce:

Code: Select all

!FILD sample
!FMUL scale
!FSTP answer
which is the quickest way to do this.


Paul.
KarLKoX
Enthusiast
Enthusiast
Posts: 681
Joined: Mon Oct 06, 2003 7:13 pm
Location: France
Contact:

Post by KarLKoX »

In my opinion, the best way to optimize word/float<-->float/word is using simd extensions like mmx, sse and so one wich allow to process more samples per cycle.
Depending of wich extension you use, you can even optimize a bit by using unrooling loops.


PS : i am not a asm guru but there are some tips i can understand :)
"Qui baise trop bouffe un poil." P. Desproges

http://karlkox.blogspot.com/
soerenkj
User
User
Posts: 95
Joined: Mon Jun 14, 2004 10:19 pm

Post by soerenkj »

I am glad that the method I understand is also the better one! (though as I said I did not measure any clear difference between floating point multiply and divide)
I prefer not using some MMX stuff that not all user have. (or do all modern processors have that?)
KarLKoX
Enthusiast
Enthusiast
Posts: 681
Joined: Mon Oct 06, 2003 7:13 pm
Location: France
Contact:

Post by KarLKoX »

All modern cpu have, at least, mmx including amd athlon (the first @300 mhz).
You can do 2 versions of the function, one for the detected instruction and another one in plain purebasic.
"Qui baise trop bouffe un poil." P. Desproges

http://karlkox.blogspot.com/
User avatar
Rescator
Addict
Addict
Posts: 1769
Joined: Sat Feb 19, 2005 5:05 pm
Location: Norway

Post by Rescator »

soerenkj wrote:@Rescator: I found that 32767/32768 is rounded down to 1.0 and that 1.0*32768 is rounded to 32767 so I guess the first method is also ok (?)
It's rounded yes, but you get a rounding error because of this.
Also 1.0*32768 is wrong as a 16bit wav does not have a 32768 value.
Highest positive value is 32767.

This stuff would have been much simpler if 16bit samples was unsigned,
or was artificially limited to a -32767 to 32767 range instead.

As it is now 0 is the middle, but negative values has 32768 possible values
while positive values has 32767 values.

Way back I did a audio comparison tool and keep cursing for days until I realized that 0 really isn't the center. In 16bit signed values the opposite of 0 is actually -1, and the opposite of 32767 is -32768
altough 0 is used as a silence or middle. There is no way to fix this "flaw" it's just how signed values are,
you have the same issue with 8bit and 32bit etc as well.
(8bit range is -128 to 127)

However, as long as what you input is the same as the output.
(do this by making a load and a saver and no processing at all)
So that you load a 16bit signed val, turn it into a normalized float then back to a 16bit signed val,
then compare it to the original value (like in my example)
If the values are the same before and after you are doing it correctly.

If -32768 ends up as -32767 you got a flaw in the math,
if 32767 ends up as 32768 not only do you have a flaw
but 32768 do not exist and will end up as -32768 and the saved sound will have some nasty static because of it.

I have seen a lot of routines that actually reduce the negative (or truncate it) from -32768 to -32767 so that negative and positive is "balanced"
but you do risk a sligh bias error because of this.
Whether it is audible or not is a totally different discussion.

But as I said. Test your 16bit signed to float to 16 bit signed process,
do not adjust volume or do anything else, just test the conversion itself.
The goal is to have the output match the input exactly.

Good test values are 0 and 1 and -1 and 32767 and -32768
if those match in the input and output you can be pretty confident the other values will as well.

Btw! The example did, altough not optimized in anyway is actually damn quick (thanks to its simplicity) even a wav that is 50+MB should only take a few seconds to "load" if your read routine has a good buffer routine,
and thanks to PB4 a read buffer is built in by default even :P

PS! Once you have converted the input to float,
make sure all following math is in the float domain,
mixing float and integer will be slower than pure float
because I'm pretty damn sure modern CPU is able to optimize it's cache
better when there is just pure float math or just pure integer math.
But don't take my word for it though, test it or wait for one or the Gurus here to chirp in on that part :)
dioxin
User
User
Posts: 97
Joined: Thu May 11, 2006 9:53 pm

Post by dioxin »

MMX doesn't help as it's integer only and you'll still need to convert to float at some point.
One of the SSE's might help as it has the integer to float conversion, but I don't know if that's SSE, SSE2 or SSE3.

You can usually assume that MMX will be present in everything newer than 8 years but SSE is more recent so you shouldn't assume the CPU has it.

Code: Select all

Timings      AthlonXP      CeleronM
Bit Fiddle    3.9             5.9
FP divide     8.0             8.0
FP mul        3.0             5.1
The figures are the number of CPU clk cycles each CPU takes to do a single conversion for each method of converting integer to float and scaling for a block of 10,000 samples.
FP muliply wins but I believe a Pentium IV may have a poorer performance with FP and may therefore be a little quicker with the bit fiddling.

even a wav that is 50+MB
With a WAV of that size the speed of the code becomes almost irrelevant. The CPU will spend most of its time waiting for data from the relatively slow RAM.


Paul.
Post Reply