Our work mainly focuses on DOM simulation. I believe the following is the most important for deobfuscation, but we also do lot more so that our program can handle normal web pages. We will not list them here.
Our code can be found at:
1. DOM tree generation.
We defined a class 'DOMObject' in python, it has a list 'children' as its member. We use SGMLParser to parse the html document and create a DOMObject when met a start tag. And the DOM tree can be output for further analysis.
Each time the function document.write is called, its argument will be passed into a new parser to handle. This parser is linked with the script object, so it is able to handle special cases such as a tag split into several parts and written by document.write several times, or the written text itself contains a document.write.
When change a DOMObject's innerHTML, the html text will be changed, and the new innerHTML will be parsed.
The source file of an iframe object will be downloaded and parsed. Additionally, when document.location is changed, it will also download html document from the new location and parse it.
An object tag (COM) will be considered as an unknownObject. Call its method or change its attribute will output a message for further analysis.