Page 1 of 1

Base64EncoderBuffer(): Bad explanation of OutputSize parameter, example with bug, and weird behavior of function itself.

Posted: Thu May 23, 2024 7:19 pm
by NilsH
Hello!

In the documentation for Base64EncoderBuffer() we can find the following explanation for the required size of the output buffer:
The output buffer should be at last 33% bigger than the input buffer, with a minimum size of 64 bytes. It's recommended to get a slightly larger buffer, like 35% bigger to avoid overflows.
This seems terribly wrong to me for several reasons.

First, it might lead the user write code like this:

Code: Select all

OutputSize = InputSize * 1.35
This is already a very bad thing, because floating point arithmetics should just never be used to calculate buffer sizes and memory addresses.

Second, the code above would give too small output buffer sizes in case of small input sizes.
For example, 1 input byte would require 4 output bytes including padding, but the code would give Int(1*1.35) = 1.
In case of 4 input bytes, 8 output bytes would be required, but it would give Int(4*1.35) = 5.

Third, it makes no sense to provide a rule of thumb when the output buffer size can be predicted exactly.

The required number of characters Base64Size including padding for the Base64 representation of InputSize bytes is given by the following equation, where ceil() returns the smallest integer not smaller than its argument:
Base64Size = 4 * ceil(InputSize / 3)

Using integer arithmetics in PureBasic, the whole calculation could be done like this:

Code: Select all

Base64Size = ((InputSize + 2) / 3) * 4
Note that the outer parentheses should never be removed in this expression, because for the rounding to work, the division must always be carried out before the multiplication. Technically, it should work without the parentheses, because expressions are evaluated left to right, but better be safe than sorry.

While writing the above, I also noticed that the example in the page linked at the top contains errors as well.
Base64EncoderBuffer() writes single bytes representing ASCII characters into the output buffer, but in the example the output data is written into a PureBasic unicode string, causing gibberish to displayed by the Debug command.

Also, the documentation does not even state what encoding is used for the output. The Base64 algorithm produces a stream of characters after all, and they have to be encoded in some way. This means that in the current state, it is not possible to use this function without relying on undocumented behavior.

I tried to write a new and correct example code, and I came up with this:

Code: Select all

; This example has problems due to weird undocumented behavior from Base64EncoderBuffer()

InputString$ = "This is a test string!"
InputSize = StringByteLength(InputString$)

OutputSize = ((InputSize + 2) / 3) * 4
*OutputBuffer = AllocateMemory(OutputSize)

Debug Base64EncoderBuffer(@InputString$, InputSize, *OutputBuffer, OutputSize)
Debug PeekS(*OutputBuffer, OutputSize, #PB_Ascii)
Technically, this code should be correct, but when I decoded the resulting Base64 string, I noticed that there where always one or two characters missing at the end of the message. Investigating this issue led to the discovery of another weird and undocumented behavior of Base64EncoderBuffer().

To understand the issue, please run the following code and observe the results in the memory viewer:

Code: Select all

DataSection
  InputData:
  Data.a $11, $22, $33, $44
  InputEnd:
EndDataSection


#BufferSize = 16
*OutputBuffer = AllocateMemory(#BufferSize)


FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 1, *OutputBuffer, 4)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF  ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ

FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 1, *OutputBuffer, 5)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: 45 51 3D 3D 00 FF FF FF FF FF FF FF FF FF FF FF  EQ==.ÿÿÿÿÿÿÿÿÿÿÿ


FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 4, *OutputBuffer, 8)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: 45 53 49 7A FF FF FF FF FF FF FF FF FF FF FF FF  ESIzÿÿÿÿÿÿÿÿÿÿÿÿ

FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 4, *OutputBuffer, 9)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: 45 53 49 7A 52 41 3D 3D 00 FF FF FF FF FF FF FF  ESIzRA==.ÿÿÿÿÿÿÿ
First, we try to compute the Base64 representation of 1 input byte, which requires 4 output bytes including padding.
With an OutputSize of 4, the function returns 0 and the output buffer is not changed.
With an OutputSize of 5, the function returns the correct number of output bytes including padding and the output buffer contains the correct Base64 representation, but there is a zero byte appended to the output.

Finally, we try to compute the Base64 representation of 4 input bytes, which requires 8 output bytes including padding.
With an OutputSize of 8, the function returns 0, and it only writes the first Base64 block (4 bytes) into the buffer, without a trailing zero byte.
With an OutputSize of 9, the function correctly writes two Base64 blocks into the buffer, again appending a zero byte.

It seems that Base64EncoderBuffer() insists on appending a zero byte to the end of the output buffer.
I think this behavior makes absolutely no sense and should be removed for various reasons.

First, it makes no sense to write a trailing zero byte to a memory buffer. Strings might be zero-terminated, but memory buffers are not.

Second, it makes the use of the function unnecessarily complicated. The size of the output buffer, and the OutputSize parameter passed to the function, always need to be 1 larger than the number of bytes or characters required for the Base64 representation (Base64Size as calculated above).

Third, if you have a case where you have to write Base64-encoded data into an existing data structure, and this structure only provides space for the exact number of characters required for a certain input length, the appended zero byte will overwrite follwing data in the structure.

One could argue that changing the behavior would break existing code, but considering that it is not documented, only code relying on undocumented behavior would be affected.

In conclusion, the following changes should be made:
Base64EncoderBuffer() should no longer append a zero byte to the output, and therefore no longer fail when the OutputSize is not 1 byte larger than the required Base64Size)
It should be documented that no zero byte is appended to the output.
The correct calculation of Base64Size should be added to the docs.
The docs should state that the function produces single byte ASCII characters as the output.
A correct example should be added to the docs.
It should be documented that the function returns 0 if the OutputSize is to small (or if it fails for any other reason). Note that the internal checking of the OutputSize should take the presence of the #PB_Cipher_NoPadding flag into account.

Finally, the documentation for Base64DecoderBuffer() should also be reviewed.

Regards,
Nils (who spend half a day writing all this)

Re: Base64EncoderBuffer(): Bad explanation of OutputSize parameter, example with bug, and weird behavior of function its

Posted: Thu May 23, 2024 8:19 pm
by STARGÅTE
Dear NilsH,

thank you for your contribution.
I agree, the documentation is outdated in several point.
At some point Base64Encoder was introduced, which returns a string instead of using the encoded ascii buffer.
Furthermore, it encodes the input string (buffer) as a unicode string, which leads to confusion, when comparing with online tools using the UTF8 representation.

However, in your new examples you ignore the fact, that the output buffer needs at least a size of 64 bytes!
In such cases, there is no problem with tiny input buffer sizes and "InputSize * 1.35" because you have to use 64 byte anyway.

I would recommend to add a function like:

Code: Select all

Procedure.s Base64StringEncoder(String.s, Flags.i=#Null)
	
	Protected *InputBuffer = UTF8(String)
	Protected InputBufferLength = StringByteLength(String, #PB_UTF8)
	Protected EncodedString.s
	
	EncodedString = Base64Encoder(*InputBuffer, InputBufferLength, Flags)
	
	FreeMemory(*InputBuffer)
	
	ProcedureReturn EncodedString
	
EndProcedure
And similar for the decoding

Re: Base64EncoderBuffer(): Bad explanation of OutputSize parameter, example with bug, and weird behavior of function its

Posted: Thu May 23, 2024 9:56 pm
by NilsH
STARGÅTE wrote: Thu May 23, 2024 8:19 pm At some point Base64Encoder was introduced, which returns a string instead of using the encoded ascii buffer.
That's good if you need the result as a PureBasic string, but when you want to write ASCII bytes to a buffer, Base64EncoderBuffer() is the way to go.
STARGÅTE wrote: Thu May 23, 2024 8:19 pm Furthermore, it encodes the input string (buffer) as a unicode string, which leads to confusion, when comparing with online tools using the UTF8 representation.
Base64Encoder() does not accept an input string, it accepts a pointer to a memory address and will simply encode the bytes it finds there. If you point it to a UTF-16 string, it will of course encode that string. Nothing unclear here.
STARGÅTE wrote: Thu May 23, 2024 8:19 pm However, in your new examples you ignore the fact, that the output buffer needs at least a size of 64 bytes!
I have to admit that I overlooked this when I started writing, but I decided to post it anyway because that minimum size makes no sense.

As I said, the number of output characters for a certain input size can be calculated exactly using integer arithmetics.
And I can see no reason why a Base64 encoding algorithm would need an output buffer larger than the size necessary for the encoded representation.
Really, why does this minimum size exist? Does the algorithm use the extra bytes as a scratchpad to store temporary data? I don't think so.
I would really like to hear the reason from the developers.

The only explanation I have for this minium size of 64 bytes is that the developers themselves realized that the "just add 35%" rule would lead to problems, and they added this constraint to the use of the function in order to avoid situations where their floating point arithmetics would fail.

Too bad that there are still cases where it fails.
The following code calculates the output size for input sizes between 0 and 100000.
It uses both the correct integer calculation and the floating point calculation, and it applies the 64 bytes minimum rule to the result of the floating point calculation.
Whenever the floating point calculation gives a smaller output buffer size than the integer calculation, it prints the input and output sizes.

Code: Select all

For InputSize = 0 To 100000
  
  OutputSize = ((InputSize + 2) / 3) * 4
  
  OutputSizeFP.d = (InputSize * 1.35)
  
  OutputSizeFPInt = OutputSizeFP
  
  If OutputSizeFPInt < 64
    OutputSizeFPInt = 64
  EndIf
  
  If OutputSizeFPInt < OutputSize
    Debug "input = " + Str(InputSize) + "   correct calc = " + Str(OutputSize) + "   float calc = " + Str(OutputSizeFPInt)
  EndIf

Next
STARGÅTE wrote: Thu May 23, 2024 8:19 pm In such cases, there is no problem with tiny input buffer sizes and "InputSize * 1.35" because you have to use 64 byte anyway.
As you can see, there are still problems.
I think every sane programmer will agree with me that floating point arithmetics should never be used in memory address and size calculations.

I really assume that there is no technical reason for that 64 bytes minimum constraint.
And that makes it even worse, because this would be a deliberately introduced unnecessary limitation of the function's usability.
Like I said in the opening post, there might be cases where you have to write Base64-encoded data into an existing buffer, the size of which is out of your control, and that buffer might be much smaller than 64 bytes.
In that case the function would be unusable, and you would have to allocate a temporary buffer, encode the data into that buffer, and then copy the encoded data into the original buffer, causing entirely avoidable overhead for memory allocation and copying.

Regards,
Nils

Re: Base64EncoderBuffer(): Bad explanation of OutputSize parameter, example with bug, and weird behavior of function its

Posted: Fri May 24, 2024 8:34 am
by STARGÅTE
NilsH wrote: Thu May 23, 2024 9:56 pm
STARGÅTE wrote: Thu May 23, 2024 8:19 pm Furthermore, it encodes the input string (buffer) as a unicode string, which leads to confusion, when comparing with online tools using the UTF8 representation.
Base64Encoder() does not accept an input string, it accepts a pointer to a memory address and will simply encode the bytes it finds there. If you point it to a UTF-16 string, it will of course encode that string. Nothing unclear here.
Nothing unclear for you (and also me). But a beginner, don't know (and is not interested) the internal format of PureBasic strings.
We (the community) had the same "issue" with the MD5 finger print, because user uses @string and wonder, why the MD5 was "wrong".
NilsH wrote: Thu May 23, 2024 9:56 pm I have to admit that I overlooked this when I started writing, but I decided to post it anyway because that minimum size makes no sense.
If it makes sense or not is not your decision, it is a restriction of the implementation within PureBasic.
Probably, 64 bytes are needed for internal calculation or something else. Maybe it is also outdated, but up to now, it is a rule.
NilsH wrote: Thu May 23, 2024 9:56 pm As you can see, there are still problems.
Yes, you are right, because of the padding. And having a memory size of multiple 4 is anyway better for the computer.

Re: Base64EncoderBuffer(): Bad explanation of OutputSize parameter, example with bug, and weird behavior of function its

Posted: Fri May 24, 2024 3:56 pm
by NilsH
STARGÅTE wrote: Fri May 24, 2024 8:34 am
NilsH wrote: Thu May 23, 2024 9:56 pm As you can see, there are still problems.
Yes, you are right, because of the padding.
I'm not exactly sure what you mean by that. I do know that Base64 usually pads its output to a multiple of 4 characters, because 4 output characters map exactly to 3 input bytes.
My point is that the manual claims that you can calculate the OutputSize for a certain InputSize by doing this:

Code: Select all

OutputSize.i = InputSize * 1.35

If OutputSize < 64
  OutputSize = 64
EndIf
And this is wrong and will fail with certain input sizes.

I just realized that I made a mistake in the code in my last post. I forgot to take into account that Base64EncoderBuffer() requires the OutputSize to be one larger that the Base64Size. With the fixed version below, there are even more cases where the floating point calculation gives an OutputSize too small, causing Base64EncoderBuffer() to fail.

Code: Select all

For InputSize = 0 To 100000
  
  Base64Size = ((InputSize + 2) / 3) * 4
  OutputSize = Base64Size + 1
  
  OutputSizeFP.d = InputSize * 1.35
  OutputSizeFPInt = OutputSizeFP
  If OutputSizeFPInt < 64
    OutputSizeFPInt = 64
  EndIf
  
  If OutputSizeFPInt < OutputSize
    Debug "input = " + Str(InputSize) + "   correct calc = " + Str(OutputSize) + "   float calc = " + Str(OutputSizeFPInt)
  EndIf

Next
The documentation page for Base64DecoderBuffer() has a sample code that uses the floating point calculation as described above.
Here is a modified version of the sample code, that allows you to set arbitrary InputSizes (except zero). Whenever you enter one of the InputSizes that the aforementioned program prints, it will fail because the size of the output buffer passed to Base64EncoderBuffer() is too small. So this example is flawed as well.

Code: Select all

InputSize = 190
*Input = AllocateMemory(InputSize)

Size = InputSize * 1.35
If Size < 64
  Size = 64
EndIf

*EncodeBuffer = AllocateMemory(Size)
Size = Base64EncoderBuffer(*Input, InputSize, *EncodeBuffer, MemorySize(*EncodeBuffer))
Encoded$ = PeekS(*EncodeBuffer, Size, #PB_Ascii)
Debug Encoded$

*DecodeBuffer = AllocateMemory(Size)
Size = PokeS(*EncodeBuffer, Encoded$, StringByteLength(Encoded$, #PB_Ascii), #PB_Ascii|#PB_String_NoZero)
Size = Base64DecoderBuffer(*EncodeBuffer, Size, *DecodeBuffer, MemorySize(*DecodeBuffer))
ShowMemoryViewer(*DecodeBuffer, Size)
STARGÅTE wrote: Fri May 24, 2024 8:34 am If it makes sense or not is not your decision, it is a restriction of the implementation within PureBasic.
Probably, 64 bytes are needed for internal calculation or something else. Maybe it is also outdated, but up to now, it is a rule.
Well, at least it makes no sense to me. I can see no reason (technical requirement) why a Base64 encoding algorithm should need an output buffer larger than necessary for the Base64-encoded output. If there is a reason, I'd be happy if the developers could tell me, so that I can understand it.
And you're right, it's a rule, because it's what the documentation says. But it is still a bad thing, because it makes the use of the function more complicated than necessary. And this holds true even if there is a technical requirement for the minimum size.
I want to claim at this point that the 64 bytes minimum constraint does not actually exist, it's only in the docs.

Finally, even if there is a 64 byte minimum rule, the only correct way to get the OutputSize is integer arithmetics:

Code: Select all

; OutputSize for Base64 with padding

Base64Size = ((InputSize + 2) / 3) * 4 ; size of Base64 representation with padding
OutputSize = Base64Size + 1            ; Base64EncoderBuffer() needs an extra byte for zero termination
If OutputSize < 64                     ; 64 bytes minimum rule
  OutputSize = 64
EndIf

; OutputSize for Base64 without padding

Base64SizeNoPad = (InputSize * 8 + 5) / 6 ; size of Base64 representation without padding
OutputSize = Base64SizeNoPad + 1          ; Base64EncoderBuffer() needs an extra byte for zero termination
If OutputSize < 64                        ; 64 bytes minimum rule
  OutputSize = 64
EndIf

Regards,
Nils