Page 1 of 1

Extract XML data from MS .docx file

Posted: Tue Jan 24, 2017 5:35 pm
by IdeasVacuum
Extract XML data from MS .docx file - anybody know how to do this programmatically?

Re: Extract XML data from MS .docx file

Posted: Tue Jan 24, 2017 5:40 pm
by Kukulkan
It is nothing else than a zipped folder and you can use PB functions to open and extract the content (like opening a ZIP file).

More about the content here: http://forensicswiki.org/wiki/Word_Document_(DOCX)

Re: Extract XML data from MS .docx file

Posted: Tue Jan 24, 2017 10:00 pm
by IdeasVacuum
Yeah, I had assumed it was, similar to an XLSX file.
I have in recent past processed XLSX files with code like this:

Code: Select all

             ;Open the packed file
                      If OpenPack(#Pack, sgCopyXLSX, #PB_PackerPlugin_Zip)

                              If ExaminePack(#Pack)

                                      While NextPackEntry(#Pack)

                                              If FindString(PackEntryName(#Pack), "sheet1.xml", 1, #PB_String_NoCase)
                                                              sFileXML.s = PackEntryName(#Pack)
                                                              UncompressPackFile(#Pack, sFileXML, PackEntryName(#Pack))
                                                              Break
                                              EndIf
                                      Wend
                              EndIf

                              ClosePack(#Pack)
..... in that case, I knew the names of the packed files required. I don't know the names of the files stored in the DOCX files to be processed. So, tried to initially unpack all files but no joy at the moment, no filenames are returned:

Code: Select all

#Pack = 0

sDOCX.s = "C:\MY TEMP\Sample.docx"

If OpenPack(#Pack, sDOCX, #PB_PackerPlugin_Zip)

        If ExaminePack(#Pack)

                While NextPackEntry(#Pack)

                         Debug PackEntryName(#Pack)
                Wend
        EndIf

        ClosePack(#Pack)
EndIf
.... and the very reliable BandiZip does not recognize them as Zip files either. So, perhaps they have been scrambled.

Edit: Nope, SoftMaker Office (TextMaker) opens the files with ease, so clearly they can be found.

Re: Extract XML data from MS .docx file

Posted: Wed Jan 25, 2017 4:31 am
by normeus
open bandizip from "All Programs" and from there choose your file
you'll see the folders

Code: Select all

_rels
docProps
word
Norm.

Re: Extract XML data from MS .docx file

Posted: Wed Jan 25, 2017 12:51 pm
by IdeasVacuum
Hi normeus - Bandizip reputation intact 8)

Bandizip also works if the file extension is changed to .zip, but that makes no difference to PB's Pack function.
7zip can do it effortlessly, so I could use the command line version via RunProgram. Though it would be nice to be able to use PB's Pack function.

Re: Extract XML data from MS .docx file

Posted: Tue May 02, 2017 1:07 am
by Karig
IdeasVacuum: Your code (or at least what you've presented here) doesn't call UseZipPacker(). You need to call that before trying to examine a DOCX or XLSX file's contents.