Extract XML data from MS .docx file

Just starting out? Need help? Post your questions and find answers here.
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Extract XML data from MS .docx file

Post by IdeasVacuum »

Extract XML data from MS .docx file - anybody know how to do this programmatically?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Kukulkan
Addict
Addict
Posts: 1396
Joined: Mon Jun 06, 2005 2:35 pm
Location: germany
Contact:

Re: Extract XML data from MS .docx file

Post by Kukulkan »

It is nothing else than a zipped folder and you can use PB functions to open and extract the content (like opening a ZIP file).

More about the content here: http://forensicswiki.org/wiki/Word_Document_(DOCX)
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Extract XML data from MS .docx file

Post by IdeasVacuum »

Yeah, I had assumed it was, similar to an XLSX file.
I have in recent past processed XLSX files with code like this:

Code: Select all

             ;Open the packed file
                      If OpenPack(#Pack, sgCopyXLSX, #PB_PackerPlugin_Zip)

                              If ExaminePack(#Pack)

                                      While NextPackEntry(#Pack)

                                              If FindString(PackEntryName(#Pack), "sheet1.xml", 1, #PB_String_NoCase)
                                                              sFileXML.s = PackEntryName(#Pack)
                                                              UncompressPackFile(#Pack, sFileXML, PackEntryName(#Pack))
                                                              Break
                                              EndIf
                                      Wend
                              EndIf

                              ClosePack(#Pack)
..... in that case, I knew the names of the packed files required. I don't know the names of the files stored in the DOCX files to be processed. So, tried to initially unpack all files but no joy at the moment, no filenames are returned:

Code: Select all

#Pack = 0

sDOCX.s = "C:\MY TEMP\Sample.docx"

If OpenPack(#Pack, sDOCX, #PB_PackerPlugin_Zip)

        If ExaminePack(#Pack)

                While NextPackEntry(#Pack)

                         Debug PackEntryName(#Pack)
                Wend
        EndIf

        ClosePack(#Pack)
EndIf
.... and the very reliable BandiZip does not recognize them as Zip files either. So, perhaps they have been scrambled.

Edit: Nope, SoftMaker Office (TextMaker) opens the files with ease, so clearly they can be found.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
normeus
Enthusiast
Enthusiast
Posts: 470
Joined: Fri Apr 20, 2012 8:09 pm
Contact:

Re: Extract XML data from MS .docx file

Post by normeus »

open bandizip from "All Programs" and from there choose your file
you'll see the folders

Code: Select all

_rels
docProps
word
Norm.
google Translate;Makes my jokes fall flat- Fait mes blagues tombent à plat- Machte meine Witze verpuffen- Eh cumpari ci vo sunari
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Extract XML data from MS .docx file

Post by IdeasVacuum »

Hi normeus - Bandizip reputation intact 8)

Bandizip also works if the file extension is changed to .zip, but that makes no difference to PB's Pack function.
7zip can do it effortlessly, so I could use the command line version via RunProgram. Though it would be nice to be able to use PB's Pack function.
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
Karig
New User
New User
Posts: 7
Joined: Mon Jul 11, 2016 10:20 pm

Re: Extract XML data from MS .docx file

Post by Karig »

IdeasVacuum: Your code (or at least what you've presented here) doesn't call UseZipPacker(). You need to call that before trying to examine a DOCX or XLSX file's contents.
Post Reply