Base64EncoderBuffer(): Bad explanation of OutputSize parameter, example with bug, and weird behavior of function itself.
Posted: Thu May 23, 2024 7:19 pm
Hello!
In the documentation for Base64EncoderBuffer() we can find the following explanation for the required size of the output buffer:
First, it might lead the user write code like this:
This is already a very bad thing, because floating point arithmetics should just never be used to calculate buffer sizes and memory addresses.
Second, the code above would give too small output buffer sizes in case of small input sizes.
For example, 1 input byte would require 4 output bytes including padding, but the code would give Int(1*1.35) = 1.
In case of 4 input bytes, 8 output bytes would be required, but it would give Int(4*1.35) = 5.
Third, it makes no sense to provide a rule of thumb when the output buffer size can be predicted exactly.
The required number of characters Base64Size including padding for the Base64 representation of InputSize bytes is given by the following equation, where ceil() returns the smallest integer not smaller than its argument:
Base64Size = 4 * ceil(InputSize / 3)
Using integer arithmetics in PureBasic, the whole calculation could be done like this:
Note that the outer parentheses should never be removed in this expression, because for the rounding to work, the division must always be carried out before the multiplication. Technically, it should work without the parentheses, because expressions are evaluated left to right, but better be safe than sorry.
While writing the above, I also noticed that the example in the page linked at the top contains errors as well.
Base64EncoderBuffer() writes single bytes representing ASCII characters into the output buffer, but in the example the output data is written into a PureBasic unicode string, causing gibberish to displayed by the Debug command.
Also, the documentation does not even state what encoding is used for the output. The Base64 algorithm produces a stream of characters after all, and they have to be encoded in some way. This means that in the current state, it is not possible to use this function without relying on undocumented behavior.
I tried to write a new and correct example code, and I came up with this:
Technically, this code should be correct, but when I decoded the resulting Base64 string, I noticed that there where always one or two characters missing at the end of the message. Investigating this issue led to the discovery of another weird and undocumented behavior of Base64EncoderBuffer().
To understand the issue, please run the following code and observe the results in the memory viewer:
First, we try to compute the Base64 representation of 1 input byte, which requires 4 output bytes including padding.
With an OutputSize of 4, the function returns 0 and the output buffer is not changed.
With an OutputSize of 5, the function returns the correct number of output bytes including padding and the output buffer contains the correct Base64 representation, but there is a zero byte appended to the output.
Finally, we try to compute the Base64 representation of 4 input bytes, which requires 8 output bytes including padding.
With an OutputSize of 8, the function returns 0, and it only writes the first Base64 block (4 bytes) into the buffer, without a trailing zero byte.
With an OutputSize of 9, the function correctly writes two Base64 blocks into the buffer, again appending a zero byte.
It seems that Base64EncoderBuffer() insists on appending a zero byte to the end of the output buffer.
I think this behavior makes absolutely no sense and should be removed for various reasons.
First, it makes no sense to write a trailing zero byte to a memory buffer. Strings might be zero-terminated, but memory buffers are not.
Second, it makes the use of the function unnecessarily complicated. The size of the output buffer, and the OutputSize parameter passed to the function, always need to be 1 larger than the number of bytes or characters required for the Base64 representation (Base64Size as calculated above).
Third, if you have a case where you have to write Base64-encoded data into an existing data structure, and this structure only provides space for the exact number of characters required for a certain input length, the appended zero byte will overwrite follwing data in the structure.
One could argue that changing the behavior would break existing code, but considering that it is not documented, only code relying on undocumented behavior would be affected.
In conclusion, the following changes should be made:
Base64EncoderBuffer() should no longer append a zero byte to the output, and therefore no longer fail when the OutputSize is not 1 byte larger than the required Base64Size)
It should be documented that no zero byte is appended to the output.
The correct calculation of Base64Size should be added to the docs.
The docs should state that the function produces single byte ASCII characters as the output.
A correct example should be added to the docs.
It should be documented that the function returns 0 if the OutputSize is to small (or if it fails for any other reason). Note that the internal checking of the OutputSize should take the presence of the #PB_Cipher_NoPadding flag into account.
Finally, the documentation for Base64DecoderBuffer() should also be reviewed.
Regards,
Nils (who spend half a day writing all this)
In the documentation for Base64EncoderBuffer() we can find the following explanation for the required size of the output buffer:
This seems terribly wrong to me for several reasons.The output buffer should be at last 33% bigger than the input buffer, with a minimum size of 64 bytes. It's recommended to get a slightly larger buffer, like 35% bigger to avoid overflows.
First, it might lead the user write code like this:
Code: Select all
OutputSize = InputSize * 1.35
Second, the code above would give too small output buffer sizes in case of small input sizes.
For example, 1 input byte would require 4 output bytes including padding, but the code would give Int(1*1.35) = 1.
In case of 4 input bytes, 8 output bytes would be required, but it would give Int(4*1.35) = 5.
Third, it makes no sense to provide a rule of thumb when the output buffer size can be predicted exactly.
The required number of characters Base64Size including padding for the Base64 representation of InputSize bytes is given by the following equation, where ceil() returns the smallest integer not smaller than its argument:
Base64Size = 4 * ceil(InputSize / 3)
Using integer arithmetics in PureBasic, the whole calculation could be done like this:
Code: Select all
Base64Size = ((InputSize + 2) / 3) * 4
While writing the above, I also noticed that the example in the page linked at the top contains errors as well.
Base64EncoderBuffer() writes single bytes representing ASCII characters into the output buffer, but in the example the output data is written into a PureBasic unicode string, causing gibberish to displayed by the Debug command.
Also, the documentation does not even state what encoding is used for the output. The Base64 algorithm produces a stream of characters after all, and they have to be encoded in some way. This means that in the current state, it is not possible to use this function without relying on undocumented behavior.
I tried to write a new and correct example code, and I came up with this:
Code: Select all
; This example has problems due to weird undocumented behavior from Base64EncoderBuffer()
InputString$ = "This is a test string!"
InputSize = StringByteLength(InputString$)
OutputSize = ((InputSize + 2) / 3) * 4
*OutputBuffer = AllocateMemory(OutputSize)
Debug Base64EncoderBuffer(@InputString$, InputSize, *OutputBuffer, OutputSize)
Debug PeekS(*OutputBuffer, OutputSize, #PB_Ascii)
To understand the issue, please run the following code and observe the results in the memory viewer:
Code: Select all
DataSection
InputData:
Data.a $11, $22, $33, $44
InputEnd:
EndDataSection
#BufferSize = 16
*OutputBuffer = AllocateMemory(#BufferSize)
FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 1, *OutputBuffer, 4)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 1, *OutputBuffer, 5)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: 45 51 3D 3D 00 FF FF FF FF FF FF FF FF FF FF FF EQ==.ÿÿÿÿÿÿÿÿÿÿÿ
FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 4, *OutputBuffer, 8)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: 45 53 49 7A FF FF FF FF FF FF FF FF FF FF FF FF ESIzÿÿÿÿÿÿÿÿÿÿÿÿ
FillMemory(*OutputBuffer, #BufferSize, $FF, #PB_Ascii)
Debug Base64EncoderBuffer(?InputData, 4, *OutputBuffer, 9)
ShowMemoryViewer(*OutputBuffer, #BufferSize)
CallDebugger
; result: 45 53 49 7A 52 41 3D 3D 00 FF FF FF FF FF FF FF ESIzRA==.ÿÿÿÿÿÿÿ
With an OutputSize of 4, the function returns 0 and the output buffer is not changed.
With an OutputSize of 5, the function returns the correct number of output bytes including padding and the output buffer contains the correct Base64 representation, but there is a zero byte appended to the output.
Finally, we try to compute the Base64 representation of 4 input bytes, which requires 8 output bytes including padding.
With an OutputSize of 8, the function returns 0, and it only writes the first Base64 block (4 bytes) into the buffer, without a trailing zero byte.
With an OutputSize of 9, the function correctly writes two Base64 blocks into the buffer, again appending a zero byte.
It seems that Base64EncoderBuffer() insists on appending a zero byte to the end of the output buffer.
I think this behavior makes absolutely no sense and should be removed for various reasons.
First, it makes no sense to write a trailing zero byte to a memory buffer. Strings might be zero-terminated, but memory buffers are not.
Second, it makes the use of the function unnecessarily complicated. The size of the output buffer, and the OutputSize parameter passed to the function, always need to be 1 larger than the number of bytes or characters required for the Base64 representation (Base64Size as calculated above).
Third, if you have a case where you have to write Base64-encoded data into an existing data structure, and this structure only provides space for the exact number of characters required for a certain input length, the appended zero byte will overwrite follwing data in the structure.
One could argue that changing the behavior would break existing code, but considering that it is not documented, only code relying on undocumented behavior would be affected.
In conclusion, the following changes should be made:
Base64EncoderBuffer() should no longer append a zero byte to the output, and therefore no longer fail when the OutputSize is not 1 byte larger than the required Base64Size)
It should be documented that no zero byte is appended to the output.
The correct calculation of Base64Size should be added to the docs.
The docs should state that the function produces single byte ASCII characters as the output.
A correct example should be added to the docs.
It should be documented that the function returns 0 if the OutputSize is to small (or if it fails for any other reason). Note that the internal checking of the OutputSize should take the presence of the #PB_Cipher_NoPadding flag into account.
Finally, the documentation for Base64DecoderBuffer() should also be reviewed.
Regards,
Nils (who spend half a day writing all this)