Project 2 - Automated Attack Community Graph Construction (Hugo)

Student: Hugo Gascon
Primary mentor: Natalia Stakhanova
Backup mentor: Franck Guenichot, Thanh Nguyen, Claudio Guarnieri, Chris Horsley

Google Melange: http://www.google-melange.com/gsoc/project/google/gsoc2012/hgascon/26001

Project Overview:
The goal of this project is to implement a Splunk application that can be deployed on a central server to automatically generate community attack graphs from a set of honeypot sources distributed across networks. An attack graph is a collection of scenarios showing how a malicious agent can compromise the integrity of a target system. When built from a wide range of sensors, it can provide a comprehensive view of attackers behavior at a large scale.

Project Plan:

  • April 23rd - May20th: Community Bonding Period
  • Continue to do research on previous approaches for attack graph construction.
  • Gain more familiarity with Splunk app development and graphical user interaction opportunities.
  • Define initial sources according to available data from hpfeeds live feeds.
  • May 21st : GSoC 2012 coding officially starts
  • May 21st - Jun 10th:
  • Learn about the deployed honeynet architecture from mentors and documentation of existing projects.
  • Get access to the running Splunk instance and understand how submitted data over hpfeeds is formatted and stored. As data seems to be already indexed, it would be necessary to know if this is enough to run the adequate searches for the construction of attack graphs or it has to be parsed somehow.
  • Design a way to integrate parsers and new modules into the existing system.
  • Find the best way to parametrize data from different sources according to the intended graph construction technique.
  • Jun 11th - Jul 8th:
  • Develop parsers for every input as part of the Splunk application so that events can be properly indexed.
  • Develop python core modules for attack graph construction. The creation and manipulation of graphs elements will be based on the NetworkX library (http://networkx.lanl.gov/), which allows to perform complex network analysis. Some algorithms for community detection in graphs as MOSES (https://sites.google.com/site/aaronmcdaid/moses) and Greedy Clique Expansion (https://sites.google.com/site/greedycliqueexpansion/) will be explored and, probably, accomodated or reimplemented for the specific scope of this project.
  • July 9th - July 13th: Mid-term evaluation deadline
  • Prepare and deliver mid-term evaluation
  • July 14th - July 28th
  • Visualization. Develop the UI of the application. An example of what can be currently done with the Splunk web engine can be seen here (http://splunkbot.splunk.com:8080/map). This is a toy project from the Splunk team (https://github.com/coccyx/Splunkbot), but I have talked with one of the developers and they are planning to release some new applications for network-like graph displaying. In any case, it can work as a starting point.
  • July 29th - August 12th
  • Reassessment. If there have been some difficulties or changes during the project development this time is reserved to address these issues. Otherwise, some improvements can be introduced, specially on the UI interaction.
  • August 13th: Suggested "pencils down" date, coding close to done
  • This is the polishing week. Tasks for this week involve bug fixing, code cleaning and proper documentation writing.
  • August 20th: Firm "pencils down" date, coding must be done
  • August 24th - August 27th: Final Assessments
  • August 31st - Public code uploaded and available to Google

Project Deliverables:
The final deliverable is a Splunk application that will include python modules for parsing the data provided by the hpfeeds system, analysis modules for clustering the gathered data and building community graphs and a dynamic web visualization layer.

Project Source Code Repository: https://github.com/hgascon/Acapulco4HNP

Student Weekly Blog: https://www.honeynet.or/blog/342

Project Useful Links:

Project Updates:
May 27th
Done last week:

  • Download and install of a local Splunk instance for testing and development.
  • Implementation of file logging capabilities on hpfeeds/cli/feed.py
  • Start to devise and implement the parser functions in the module acapulco.py that will load hpfeed event data from log files into a graph structure using NetworkX.
  • Planned for next week:

  • Ask for access to more live data feeds.
  • Continue thinking and discussing about the best possible graph representation for the available data.
  • Find if and how the Splunk example code provided by Franck can be reused and integrated.

June 3rd
Done last week:

  • Check code provided by Franck and install it in the local Splunk instance (very useful indeed).
  • Ask for access to more live data feeds (currently waiting for access on thug and kippo feeds).
  • Discover graph-tool, a new interesting library that may be used for community finding as it has already implemented community finding methods using the Potts model approach.
  • Planned for next week:

  • Continue thinking and discussing about the best possible graph representation for the available data.
  • Start implementing parser functions in acapulco.py that will load hpfeed event data from different log files into unified a structure.
  • Blocking issues:

  • We currently have access to a limited amount of channels form hpfeeds. Furthermore, some of them are not streaming any events. We need to solve this in order to have an adequate amount of (complex enough) data. Otherwise, we won't be able to observe significant attack behaviors at a large scale or take full advantage of clustering algorithms.

June 10th
Done last week:

  • Access granted to new thug and glastopf data feeds (currently waiting for access to kippo feeds).
  • Start implementing parser functions in acapulco.py for available data.
  • Planned for next week:

  • Finish parsing functions.
  • Decide if parallel graphs are finally our best choice and start loading data into a graph structure.
  • Blocking issues:

  • Basically the same as last week. We are getting access to more channels from hpfeeds but the activity is very low or null in a non-negligible amount of them.

June 17th
Done last week:

  • After some reading and debating, we have decided to go with parallel graphs and try to introduce clustering in each vertical coordinate. They seem to be our best choice.
  • I have been working on the Splunk code from Franck, solving some errors and adapting it to the last version of Splunk and the new feeds that I'm getting access to. Currently, the application can be configured from the web interface and while running, every event from the subscribed channels is logged in its own file. All files are monitored by Splunk and an individual index is created and updated for each one of the feeds. I have learnt that it is Splunk who defines the fields from the data in every index. This can be done at search or index time. Done at search time, it decreases performance but increases flexibility while, done at index time, it increases performance but decreases flexibility.
  • Planned for next week:

  • Finish defining custom fields at index time and decide exactly what data from each feed fits in what dimension of the parallel coordinate graph.
  • Blocking issues:

  • Currently waiting for access to kippo feeds.

June 24th
Done last week:

  • Solved a problem with the formatting of the glastopf.event.anon channel. Now events values are logged together with the corresponding keys (compliance with the Common Information Model is recommended by Splunk for better indexing).
  • Added new channels to the running local application: Glastopf.events.in and dionaea.capture.in. I am waiting for some streamed data to check if the events are correctly logged and indexed.
  • I have spent some non-negligible time going through the Splunk documentation in order to understand how every field is defined and extracted by means of regular expressions. I have also been advised to extract the fields at search time instead of at index time.
  • Planned for next week:

  • I expect to have figured out all the regular expressions for the extraction of data fields in every event type. Clustering can be done also at search time, so next step will be to find how to integrate the graphic layer with the search capabilities of Splunk.
  • Blocking issues:

  • My own inch by inch but, steadily progressing learning of the inner workings of Splunk.

July 8th
Done last week:

  • I managed to understand how Splunk is using regexs to index the fields and, finally, every feed (with data) is correctly parsed.
  • I have been testing the Splunk javascript sdk in order to develop a client application that can autenticate against the server and query for the indexed data.
  • I have designed an icon for the app :)
  • Planned for next week:

  • Finish a rough version of the javascript client and integrate it with the d3.js visualization toolkit.
  • Submit the mid-term evaluation.

July 29th
Done last week:

    During the last couple of weeks I have worked in different things:
  • First and once that every data channel from hpfeed is correctly logged and indexed, I have written a python module that is capable of parsing several log files, select the data from each channel that will be displayed as a coordinate in the graph, transform each one of them in an adequate (but preliminar) way and write the formatted data in a new log file. This file is also indexed by Splunk.
  • As the new formatted and ready JSON data is indexed by Splunk, I have developed a first rough version of the javascript Splunk-d3 client. Using the Splunk javascript sdk, the user is able to introduced the credentials and log in the Splunk server. Once the user is logged in, a new "Run" button appears in the web interface. When hit, the client retrieves the formatted data from Splunk and D3 starts building the parallel coordinates graph.
  • Planned for next week:

  • Right now, events from dionaea and thug are used. The parameters that can be visualized are source address, source port, dest address, dest port, url involved in the event and hash of the file. The next step is to explore the use of locality sensitive hashing functions in order to reformat each parameter/coordinate in a way that different values can be represented and clustered in the same axis.