Project 1 - Automated Attack Community Graph Construction (Julia)

Student: Julia Y. Cheng
Primary mentor: Chris Horsley
Backup mentor: Natalia Stakhanova, Franck Guenichot, Thanh Nguyen, Claudio Guarnieri

Google Melange: http://www.google-melange.com/gsoc/project/google/gsoc2012/juliaycheng/16001

Project Overview:
A large amount of honeypot logs result in difficulties in data analysis and interpretation. In order to alleviate expert's workload and complexity of data analytic, this GSoC idea is to automatically build attack community graph for eliciting attack approaches and intention description.

The GSoC idea will be divided into three stages. The first constructs attack graph by extracting relationship among criminals, victims and malicious servers from honeypot logs. For this project, I will use dionaea logs, glaspot logs and kippo logs as the first-level raw data. From first-level data to conduct second-level analysis data, cuckoo sandbox developed by Claudio Guarnieri, PHP sandbox by Lukas Rist, Thug by Angelo Dellaera, Hale by Patrik Lantz and fast-flux detection are applied for advanced data collection and analysis. After completing data collection and processing, I will extract relationship from those data to build attack graph.
The second stage is to apply centrality mechanism to group graph into individual attack approach compartments. With evaluate the relative centrality of different attack approach compartments, attack community graph construction will be presented by connecting high-density attack approach compartments. Definitely, I will map attack approach compartments with its attack behavior intentions. The second deliverable is a python package to express the attack community graph and its intentions.

The Honeynet Project uses hpfeeds by Mark Schloesseras a generic authenticated datafeed protocol to collect honeypot data around the world. Ben Reardon used Splunk to do data analysis and visualization. The third is to develop a APP to present attack community graph and integrate into Splunk platform as the final deliverable.

Project Plan:

  • April 23rd - May20th: Community Bonding Period
    - Prepare develop and testing environment
    - Learn how to develop Splunk App
    - Study social network graph drawing tools and library
    - Reading social network centrality algorithm and paper
  • May 21st : GSoC 2012 coding officially starts
  • May 21st - May 28th:
    - Decide which honeypot logs I want to use.
    - Modify and integrate curent hpfeeds client to get the instance, called as first-level data set, and stored into testing site
    - Format, indexing and create Splunk searches to extract relationship for graph construction.
  • May 29th - June 15th:
    - Code advanced first-level data processing to get second-level data
    - Develop python code module for attack social network construction.
    - Integrate into Splunk App and show the first draft graph, which only show the relationship between nodes and edges.
  • June 19th - July 7th:
    - Discuss whether we should do graph reduction based on centrality algorithm to decrease the visualization complexity
    - Integrate into Splunk App and show the second version graph, which can show the attack compartments
    - Evaluate current project results and scopes. Adjust project scopes and deliverables.
  • July 9th - July 13th: Mid Term Assessments
    - Prepare and deliver mid-term evaluation
  • July 14th - July 20th:
    - Review current project results, methods
    - Code graph centrality algorithm to build attack approaches compartments
    - Connecting high-density attack compartments to figure out attack community
    - Design display UI and options
    - Improve codes and make it better
  • July 21st - August 10th:
    - Code Splunk display UI and integrate into Splunk APPs
    - Mapping attack approach compartments with attack behavior intention lists.
  • August 13th: Suggested "pencils down" date, coding close to done
  • - bug fixing, code commit as well as comment and documentation writing

  • August 20th: Firm "pencils down" date, coding must be done
  • August 24th - August 27th: Final Assessments
  • August 31st - Public code uploaded and available to Google
  • Project Deliverables:
    A Splunk APPs to display attack social network graph and its attack intention by honeypot logs.

    Project Source Code Repository:
    https://github.com/yuchincheng/HpfeedsHoneyGraph

    Student Weekly Blog: https://www.honeynet.or/blog/327

    Project Useful Links:
    Papers:
    [1] White, D. R., & Borgatti, S. P. (1994 October) Betweenness centrality measures for directed graphs. 16 (4), 335-346.
    [2] Zemljič, B., & Hlebec, V. (2005 January) Reliability of measures of centrality and prominence. 27 (1), 73-88.
    [3] Kang, S. M., (2007 January) A note on measures of similarity based on centrality. 29 (1), 137-142.

    Project Updates:
    May 27th
    Done last week:
    - Testing hpfeeds CLI to subscribe glastopf, thug and cuckoo data
    - Start to modify the code from Franck for subscribing the hpfeeds logs to Splunk indexing
    - Install Splunk testing environment

    Planned for next week:
    - Extract IP and hostname to do Fast-Flux detection
    - Get more live data for research discussion
    - Finish and testing the codes: Data from hpfeeds to Splunk indexing

    June 10th
    Done last week:
    - Extract IP and hostname to do Fast-Flux detection (ff_ipget.py)
    - Ready data collection processing (Glastopf_events, Glastopf_files, Glastopf_sandbox, Fast-Flux domain-IP data)

    Planned for next week:
    - Finish and testing the codes: Data from hpfeeds to Splunk indexing (Data replication on Splunk Indexing)
    (Need to solve the data redundant problem on Splunk)
    - Centrality calculation coding
    - Data vectors design and implementation fro graphing and centrality calulation from Splunk indexing data

    June 17th
    Done last week:
    - Data vectors design and implementation fro graphing and centrality calulation from Splunk indexing data
    (Testing environment on http://114.35.193.28:8000/) (Code integrating with Splunk4HPfeeds from Frank)
    - Centrality calculation coding, still have to do debugging )

    Planned for next week:
    - Centrality calculation coding debugging and integrating into Splunk4HPfeeds
    - Using networkx to draw the graph
    - Draw graph by applying the centrality calculation

    June 24th
    Done last week:
    - Centrality calculation coding debugging and integrating into Splunk4HPfeeds (Still in debugging)
    ( I cannot proceed hpfeeds logs into Splunk directly. The solution is to store hpfeeds log into a file, then monitoring the log.)
    - Using networkx to draw the graph. Code is on the folder graph_1 folder using glastopf_events, glastopf_sandbox logs

    Planned for next week:
    - Centrality calculation coding debugging and integrating into Splunk4HPfeeds
    - Using networkx to draw the graph version 2
    - Draw graph by applying the centrality calculation

    Blocked issue:
    - Splunk change his internal framework on newest version. It took lots of time to read the documentation.
    - Graph version 1 data format is complicated and reduplicated a lot. Re-design the data format makes the graph simple and efficience.

    July 1st
    Done last week:
    - Centrality calculation coding debugging and integrating into Splunk4HPfeeds (Still in debugging)
    ( I cannot proceed hpfeeds logs into Splunk directly. The solution is to store hpfeeds log into a file, then monitoring the log.)
    - Using networkx to draw the graph version 2. Code is on the folder graph_2 folder intergrating with glastopf_events, glastopf_sandbox, glastopf_files, thug_events and thug_files logs. The data format is re-designing based-on relationship and node types.

    Planned for next week:
    - Centrality calculation coding debugging and integrating into Splunk4HPfeeds
    - Using networkx to draw the graph version 2
    - Draw graph by applying the centrality calculation

    Blocked issue:
    - My final exam is on this week. Sorry for delaying the progress.

    July 8st
    Done last week:
    - Pre-processing thug.events and thug.files for extracting malicious web page visiting path and downloading malware
    - Store pre-processing data into sqlite db and poll into splunk index

    Planned for next week:
    - Midterm evaluation
    - Code debugging

    July 15st
    Done last week:
    - Done code debugging on submitting hpfeeds instants into splunk index
    - Finish midterm ealuation

    Planned for next week:
    - Study how to use D3 for dynamic graph
    - Study how to splunk to substitute traditional DB sturcture

    July 22st
    Done last week:
    - Study how to use D3 for dynamic graph
    - Study how to splunk to substitute traditional DB sturcture

    Planned for next week:
    - Implement D3 on Splunk to show the social graph
    - Implement DB select and search using Splunk for extracting data to present on graph.

    July 22st
    Done last week:
    (1) http://140.116.163.148:8000/en-US/app/HpfeedsHoneyGraph/HpfeedsHoneyGraph
    Display landing_site -> Hopping_site --> Malware Downloading with showing information when mouseover.
    - This graph cannot show on the center of screen. I still do debugging.
    - As you can see, the graph have too much single nodes. The problem is "http://www.aaa.com/sjdksd and http://www.aaa.com/weuwie" will show two nodes. I would like to discuss how to simplify the graph.

    (2) http://140.116.163.148:8000/en-US/app/HpfeedsHoneyGraph/ThugFilesGraph
    - This graph is the unchanged graph. After discussing with my mentor, Chris, I found a big mistake on force-collapsible graph on D3. This graph cannot display unique nodes.
    - Malware A --> google.com Malware B --> google.com (google.com will show two different nodes on graph. Therefore, I take two days to change data format and merge sub-trees for displaying unique nodes.

    (3) http://140.116.163.148:8000/en-US/app/HpfeedsHoneyGraph/ThugFilesUnique
    This graph is to extract malicious hostname from cuckoo report --> run Pffdetect to detect fast-flux IPs --> do passive DNS lookup to find more corresponding domain and IPs.
    - This graph take objects as unique nodes and link their relationship including show its information when mouseover.

    Planned for next week:
    (1) Add time search bar on the graph.
    (2) Add geoip information on the graph
    (3) Use REST API substitute pure python on running splunk search
    (4) Use Splunk APP framework to pass information on the APPs.