Join us on Slack and talk to your potential mentors:
If there are any questions, please don’t hesistate and get in touch! 🙂
GSoC and The Honeynet Project
During the previous years of GSoC, the Honeynet Project's students have created a wide range of very successful open source security projects, many of which have gone on to become the industry standard open source tools in their respective fields. Examples for these include:
We are also always interested in hearing any ideas for additional relevant computer security and honeynet-related R&D projects (although remember that to qualify for receiving GSoC funding from Google your project deliverables need to fit in to GSoC's 3-month project timescales!). If you have a suitable and interesting project, we will always try and find the right resources to mentor it and support you.
Please note - even if you aren't an eligible GSoC student, we are also always looking for general volunteers who are enthusiastic and interested in getting involved in honeynet R&D.
Each sponsored GSoC 2017 project will have one or more mentors available to provide a guaranteed contact point to students, plus one or more technical advisors to help applicants with the technical direction and delivery of the project (often the original author of a tool or its current maintainer, and usually someone recognised as an international expert in their particular field). Our Google Summer of Code organisational administrators will also be available to all sponsored GSoC students for general advice and logistical support. We'll also provide supporting hosted project infrastructure, if required.
SNARE/TANNER: Make our web application honeypot attract new sorts of maliciousness.
HoneyThing: Attract Mirai and other botnets with the TR-069 honeypot.
#1 - Port Independent Protocol Identification Library
Library that can be attached to a network stream to identify the protocols in use.
We are currently developing a protocol agnostic honeypot  and use a port to protocol mapping to choose the correct connection handler. This is obviously a very simplified approach and we would like to be able to assume arbitrary protocols on any port. With that assumption, we need means to identify the protocol used in a network stream. There are various papers around the different identification approaches . We assume the most efficient approach would be a layered architecture where we start with cheap and quick pattern matching before deploying more expensive measures (e.g. heuristics, statistics, machine learning). A good starting point would be getting familiar with packet manipulation using Go. The gopacket library is an excellent tool for that purpose. The wireshark wiki has an exhaustive collection of labeled pcap samples, excellent for testing a signature or training a model . There are implementations of this problem that go into the right direction .
Projects for Mitmproxy
Mitmproxy is an interactive TLS-capable man-in-the-middle proxy. It can be used to intercept, inspect, modify and replay HTTP, HTTP/2, HTTPS, WebSockets, and raw TCP traffic. Think of it as a mix of WireShark and the Chrome developer tools - you can hook up any device or program and see how it communicates on the network. Mitmproxy is used by software developers, penetration testers, privacy advocates and researchers to fix bugs, find vulnerabilities, uncover privacy violations, conduct empirical research, and more.
Spend the summer working on mitmproxy's core and its addons!
We have a couple of feature requests for mitmproxy that would make really great additions to mitmproxy, but haven’t been tackled yet. This project would consist of multiple “mini-projects” spanning from a few days to multiple weeks, allowing you to work on isolated tasks at different parts of the code base.
"Map Remote Editor": Other proxies have a feature which maps one URL to another, e.g. one can map https://example.com/foo.js to a local file that is served to the client instead. It is easy to write a mitmproxy script that does this, but we want this to be a built-in feature! Fun fact: This task was initially proposed by our last year GSoC student in issue #1454!
Mitmproxy currently supports four different protocols: HTTP/1.x , HTTP/2.0, WebSockets, and a raw TCP mode as fallback for everything else. WebSockets and TCP are new additions that are not exposed in the UI yet. One project would be to display WebSocket connections in the mitmproxy flow list, and allow users to view all exchanged WebSocket frames.
With Pillow and watchdog, mitmproxy has two large dependencies with platform-specific binary components. We use very little functionality of those, so we want reimplement what is needed in pure Python. This makes mitmproxy smaller, most importantly easier to install and also less prone to security vulnerabilities caused by the underlying C code. See #1900 for details regarding Pillow!
Mitmproxy already supports streaming of responses back to the client. However a commonly asked feature is to also support request streaming, e.g., large file uploads from a client to the server. Recent changes in the core should make this feature pretty straight-forward to implement.
The mitmproxy project is keen to foster an addon ecosystem. This means that we need a clean, usable way for users to discover and install modules not bundled with mitmproxy itself.
The next step for mitmproxy is to work on higher-level functionality like security scanners, reconnaissance tools and end-point discovery mechanisms. If you have a neat idea that is central enough to belong in the mitmproxy core, pitch us on it.
Mitmproxy’s console interface can be improved in many areas - we have plans for a modal interface, configurable key bindings and other improvements.
Spend the summer improving mitmproxy’s web interface!
Last December, we shipped the first version of our web front-end “mitmweb”, which finally brought a graphical user interface and Windows support to mitmproxy. Mitmweb is largely based on last year’s GSoC work and currently only supports mitmproxy’s most important features. Our long-term aim is to achieve feature-parity between the web-interface and the console application. The goal of this project is to add some major mitmproxy features to mitmweb, ideally with a better UX than what the console interface provides. For example, one part of your project would be to implement a replacement editor so that users can define rules to automatically modify requests and responses. Another aspect we’d like to tackle is traffic visualization. While we have a good idea of further features that we want to see implemented, the first task for your application is to try out mitmweb and make a rough list of ideas/features how you would improve mitmweb to show us that you understand the product. We’ll then mix that with what we have in mind and create a great project plan for the summer!
Mitmweb is based on a modern web app technology stack (React.js, Redux.js, ES6, Bootstrap, Gulp, ...), so you can work with the latest technologies and focus on good code rather than IE support. 🚀
A large amount of Android malware needs user interactions (eg. clicking some buttons) before starting its malicious actions. Thus a test input generator is necessary for an online Android sandbox. The Android official tool for automated input generation is Monkey, but the pseudo-random strategy of Monkey is not effective for malware detection. In last years of GSoC, we introduced DroidBot, a model-based test input generator for Android.
In this year, we want to extend DroidBot to support semi-automated testing. That is, DroidBot learns from human about how to interact with apps. For example, a user interacts with an app for the first time (eg. swiping windows, drawing PIN code, etc.), we record the user’s input and send to DroidBot, and based on the user’s input, DroidBot will be able to pass the difficult UI states.
Specifically, what you need to do in GSoC 2017 might include:
An Android app which is able to record user’s input
Extending DroidBot to learn from user’s input
To get started:
Have a basic understanding of black-box testing by trying to test an app with Monkey
Try to test an app with DroidBot, and understand how it works
Learn about Android record/replay techniques [1,2]
Discuss with us.
#5 - Android sandbox detection and countermeasure
Android (JNI would be a plus)
Android system instrumentation (eg. Android kernel, Xposed framework)
New tool and empirical study
Investigate sandbox detection techniques and/or design an undetectable sandbox
Many Android apps (especially malware) are using sandbox-detection techniques[1,2]. An app may try to hide some behaviors if it is running in a test environment. Sandbox-detection techniques make it difficult (even impossible) for existing anti-malware systems to perform malware analysis.
The goals of this project include:
Investigating and collecting existing sandbox-detection techniques used in malware;
A sandbox detector (an app that makes use of many techniques in #1)
[not required] A detection-aware system (a system that is aware of sandbox-detection behaviors)
[not required] An undetectable system (a system that is immune from sandbox-detection techniques, should be something like )
Write an Android app using some of these techniques
Get to know some API hooking techniques, such as Xposed.
Projects for Holmes Processing
#6 - Holmes Framework to Automate Advanced Analytics
Languages: Go and/or Scala, some Python
Technologies: Linux in a server environment, Apache Spark, Apache Cassandra, Tensorflow (optional, nice to have)
Techniques: Microservice Architecture, Distributed Machine Learning, Actor or CSP based design
Improve existing tools suite
Manage the execution of advanced statistical or machine learning analysis. The supported back-ends must be modular with a prototype developed for Apache Spark
Enable the scheduling of Batch jobs taskings
Enable streaming jobs
Allow chaining of multiple jobs
In this project the student will design and develop a semi-generic interface that enables Holmes Processing to manage the execution of advanced statistical and machine learning analysis operations. The system will have three core parts: core, analytic engine, and analytic service.
The Core will receive tasking from RESTful connections, scheduled taskings, and AMQP messages. The core system will also manage the execution of taskings and monitor the operations for success.
The Core will provide a scheduling component for managing Analytic Services. This will enable when large bulk operations should execute by minute, hour, and day.
The Analytic Engine will provide a modular way for connecting Machine Learning and statistical frameworks to the core, called analytic engine services. These Analytic Engine Services will provide a generic way to connect to these various technologies and monitor the results. In this project we only expect the student to develop the back-end connectors for Apache Spark.
(optional) provide support for Tensorflow.
Analytic Services should be stored in a logical format and register themselves with core. These analytic logic of the service should be encapsulated into a Job for execution on an Analytic Engine. Furthermore, Analytic Services should be able to chain Jobs together so multiple Analytic Engines can be leveraged together.
Provide configuration information needed to execute the service
Provide Job logic to execute on an Analytic Engine
Identify how to execute the service
What Analytic Engine to use for each job
In what order the jobs should execute
How the job should be executed. I.e Attach to an AMQP feed, execute on a schedule, or kicked off my a manual RESTful query
Create an example Totem service for performing static analysis against an object using python and go
Create an example Totem-Dynamic service for performing dynamic analysis python and go
Create a guide that explains the how to develop a Holmes Processing Totem and Totem-Dynamic Service
(Optionally) Develop a static analysis service
Holmes Processing is designed for automated and efficient large-scale malware analysis. To allow for scaling and performance gains, this has required multiple techniques to include microservices, docker, virtual machines, external message queues, and RESTful communication. While this is great for someone already familiar with these design patterns, it increases the complexity required for new developers to extend the project.
The goal of this project is to help your peers to overcome the learning curve required to extent the analytic capabilities of Holmes Processing. Your work will enable analysts to extend Holmes Processing to empower the system to provide malware analytics and help thwart cyber crime.
To accomplish this goal you will you will develop four example programs that demonstrate how to develop services for executing static and dynamic analysis. You will become familiar with how to use docker, create a RESTful microservice, and issues tasking using AMQP messaging.
(Stretch Goal) Create an Analytic Service of your choice
#8 - Holmes Automated Malware Relationships
Languages: Scala, d3.js or similar graphical package, and potentially Python
Technologies: Apache Spark, Apache Cassandra
Techniques: Distributed Machine Learning
Identify relationships between malware objects (Domains, IP Addresses, Executables, Source Code, etc.)
Score relationship confidence based on supporting evidence. For example, additional relationships that link the objects together, frequency, time, etc.
Provide a structured output that represents the relationship and score
(optional) Develop a method for graphically representing the relationships
The purpose of this project is to develop a method capable of automatically identifying and managing the relationships between malware objects (IP addresses, Domains, Executables, etc.) in the Holmes Processing system. This will provide the users of the system with clues to help make sense of large volumes of information and in turn make better assessments.
The Holmes Processing system extracts information from objects using a wide range of static and dynamic services. For example, the services can provide the results of Yara signature matches, metadata from PE32 headers, dynamic analysis events from Cuckoo, antivirus matches from VirusTotal, ASN information, and DNS records. These results are then formatted in JSON, linked to a primary object, and stored in a database (Apache Cassandra).
The relationships between objects are defined as a potential connection between two objects. This should not be limited to just identify clusters of similar malware but also focus on correlations and interactions between objects. For example, if an analyst is interested in object A (for example, a PE32), the method can utilize the output of a dynamic analytic service to look for a domain (i.e. “bad.net”) or IP address with whom the object has communicated with. This connection between “object A” and the domain “bad.net” would be provided as a potential-relationship. For another example, a relationship can be generated between executables that share common properties as detected by PEInfo, YARA, IDAPRO etc. As such, the method used for identifying relationships should utilize, manual input, machine learning, and searching through existing results.
The project should supply a confidence rating for each generated relationship. The confidence score is generated by taking into account every facet and shared characteristics between objects and weighing them to calculate the final score. The final output of the method is a set of all generated and scored relationships.
This project should provide structured results containing the relationships and associated confidence score. (optional) develop a method for graphically viewing the results
Projects for Cuckoo Sandbox
Cuckoo Sandbox (developed during GSoC 2010-2016 with The Honeynet Project) has evolved to become the de-facto open-source standard for malware analysis systems. It contains capabilities for analyzing in malware in various Windows, Android and Apple environments, has a clean architecture and easy-to UI. It is used by many open source and commercial sandboxing efforts, including Google's own VirusTotal infrastructure. You may find more information about it in the following resources:
Porting of legacy Longterm Analysis support  in Cuckoo Sandbox to the latest version and performance/architecture/UX improvements to make it easily available & usable.
Around three years ago we started on "longterm analysis" support in Cuckoo Sandbox in a separate repository. The aim of longterm support is to be able to monitor specific malware samples and/or families over a longer period of time, e.g., to run a specific sample not just 1 or 2 minutes as we'd usually do, but 4 to 8 hours a day, for 5 (working) days in a row. In order to do so, a couple of core components of Cuckoo had to be modified in a way to allow these changes to become reality:
As there are N analyses tied to one sample execution, but each analysis is on a different day (i.e., between the executions the VM is shutdown / paused), a new VM-tracking mechanism is required - namely that each analyzed sample obtains its own VM. As the VM is tied to a sample, every day that the VM boots up again it restores its state to the way it was the day before and not to a default state as one would normally expect to happen in Cuckoo.
In Cuckoo we rely on a fairly simple path from start to end. A VM is started, the sample is executed in the VM, the VM is suspended, a report is generated and presented to the user, and that's it. For a 1 or 2 minute analysis, this is fine, however, for Longterm Analysis, where each analysis may take up to, e.g., 8 hours, this is not enough. In the past we implemented a realtime PCAP interpreter to fetch network statistics in every few minutes. For this project, further development on the realtime information processing will be required as well.
Performance & Architectural improvements. If one or two minutes of sample analyses may produce up to 250mb of raw data, imagine how much data may be retrieved from hours and hours of analyses. To handle this, optimizations (or limitations) in the Cuckoo Monitor should be evaluated as well as performance optimizations such as porting CPU-critical code from Python to Cython.
An API endpoint should be created that's able to handle everything related to Longterm Analysis VMs so that we can feature Longterm Analysis in a relatively simple and clean UX way.
If time allows, some user interactions in the VM would be preferable. E.g., at random opening a word document and automatically start "typing" in it so to show the VM somewhat reflects a real machine.
And finally, Proof of Concept scripts and executables should be developed and delivered to showcase the new functionality in a way that it may be reproduced and unit tested.
#10 - Unit testing & Continuous Integration for Cuckoo Sandbox
Python 2 (strong)
Windows Internals (good)
C & scripting (preferable)
Improve existing tool
Further work on our unit tests and functional unit tests to ensure that future changes in Cuckoo don't break existing functionality & features
Cuckoo Sandbox has been around for a quite a while now and with the increasing amount of features and complexity, it is of utmost necessity to ensure that features which worked at some point remain working. In the past year we've been working towards more code coverage and functional unit testing, but there's still a lot of work to be done in this area.
Your task for this GSoC project is as follows:
Implement unit tests for the Cuckoo Core & Cuckoo Analyzer (the components running on the host and in the virtual machine, respectively). For this a good understanding of Python is required. By implementing more unit tests we will get more code coverage which in turn ensures that any changes to said code will get reviewed properly before being merged in the Cuckoo repository - any errors introduced will get caught early on due to unit tests failing because of these errors (or, well, that's the goal anyway).
Implement functional unit tests. For functional unit tests you won't necessarily be looking at specific Python functions to test, but more at the global overview of things in Cuckoo. For this task you will need some creative thinking to come up with new samples (and types of samples, e.g., powershell, vbs, word macros, etc) as well as being able to reconstruct & integrate in-the-wild samples to our functional unit testing repository so that they will be evaluated during each testing round. Imagine that these testing rounds will be done after every commit in Cuckoo or one of its components to ensure stability and correctness. Questions that will be answered through functional unit tests include but are not limited to the following:
if an executable loads a certain DLL, does this show up in the report?
if a VBS script is submitted which connects to google.com, does this behavior show up in the behavior & network logs of the report?
if a malicious Word document drops ransomware, do we see the malicious Word behavior, the ransomware dropped file, and a decryption instruction file in the report?
if a WSF file drops a PowerShell script to download ransomware, do all of these IoC's & artifacts show up in the report?
Provide a web GUI for Thug, designed as a sort of social network where data can be enriched with metadata coming from various sources, and where users can share results, settings, analyses and whatever else.
Thug is a client honeypot developed during previous GSoC years that is used to analyse potentially malicious websites. Now that Thug is pretty stable and in general use, this project aims to be Thug's dress - providing a convenient web GUI - but also its weapon, as it should provide a set of tools that should enrich Thug's output with new metadata and allow for correlation of results.
While it is perfectly possible to use it as a simple web GUI for Thug on your own computer, with you as the only user, we want to take Rumāl to a powerful multi-user environment with you. During GSoC 2017, we want revamp the web interface to take it to the next level. Some work also needs to be done to complete the social elements that are required to make it a strong, cooperative platform (user profiles, data sharing, correlated searches and so on).
#12 - Heralding
Python 2 and 3(good)
Protocol understanding (good)
Understanding of authentication mechanisms (good)
Improve existing tool
In prioritized order:
Implement the parts of the IMAP protocol.
Convert the project to python 3.
Expand the number of authentication mechanisms supported by each protocol.
Heralding is a low-interaction credentials catching honeypot, primarily designed to serve as a alert-source for SIEM systems.
It currently exposes 6 protocols to the network, for each protocol the initial handshake and authentication mechanism is implemented.
For this years GSoC, we would want the student to complete two major tasks and a number of minor tasks. The major tasks would be implementing the imap protocol (handshake and a number of authentication mechanisms) and converting the project to python 3. The minor tasks would include expanding the authentication mechanisms supported by the already implemented protocols.
#13 - Conpot Protocol on Steroids
Python, Protocol Understanding
Improve existing tool
Improve/extend the protocol stacks provided by conpot
Conpot handles a number of protocols (including IPMI, HTTP, SNMP, BACnet, modbus and s7comm) that are usually used in industrial applications / environments. For some of these protocol stacks we rely on third-party libraries, others were developed specifically for conpot.
Protocol compliance and resemblance is the key for convincing both, human and automated attackers, that the honeypot is a real system rather than just a simulated product that does not provide any value to the attacker. Further a better / deeper support of the communication on the protocol level allows for a better reporting and understanding what an intruder is about to do to the honeypot.
The student should be willing to read technical specs and documentation like RFCs and blueprints as well as dissecting real traffic or analyzing testing tools in order to improve the conpot, e.g. by fixing bugs and adding new features and commands to existing protocols, in order to allow the honeypot to look like “the real thing™”.
Adding new functionality to SNARE/TANNER for making it stable and powerful.
SNARE is a web application honeypot sensor attracting all sort of maliciousness from the Internet. The web page is generated by cloning a real web application and injecting known vulnerabilities. SNARE connects to TANNER, a remote data analysis and classification service, to evaluate HTTP requests and composing the response then served by SNARE.
We are completely open to any ideas, and this list can give you a good start point:
Expand set of emulators/Improve existing emulators. Emulation of some popular vulnerabilities were implemented during previous GSoC, but we are interested in expanding of emulator's set and adding new functionality to existed emulators (e.g. emulate time-based/blind sql injection)
Implement TANNER API. We have a really simple api, which allows to get sessions from TANNER storage in json format, but we want a more powerful tool to get the possibility to work with the collected data outside of SNARE.
Implement a Web UI for TANNER to get session info.
Improve storing and analyzing sessions in TANNER. The current analyzing process makes simple suggestions about the peer status (bot/attacker/user). But we can extract more from the collected data.
Improve the SNARE cloning system. Developing a great cloning system for all kind of cms and custom sites is a kind of challenge.
Architecture and performance improvements.
Testing with popular vulnerability scanners.
#15 - HoneyThing Improvements
Improve existing tool
Add new and improve existing features for the HoneyThing
The HoneyThing is a honeypot for the internet of TR-069 things. It was designed to act as completely a modem/router that has RomPager embedded web server and supports TR-069 (CWMP) protocol. Development on the HoneyThing began last year as part of Ömer Erdem’s Master Thesis. The ideas is described in detail here and the current code resides here.
For this year we want to make sure the HoneyThing has more deception tactics than just the mentioned RomPager Vulnerabilities but expand it especially to support different modem/router and more IoT type devices in general.
The current code base and requires some reworking to make it more generic and modular. The project will include a bit understanding the current project and refactoring. Then the features should be added.
New vulnerabilities and deception tactics. HoneyThing supports limited vulnerabilities. Adding new/popular vulnerabilities and deception tactics (e.g.: mirai), not restricted with RomPager, will increase to attract the attackers' attention.
Telnet support. TR-069 and HTTP services work at different ports in current system. Telnet service (e.g.: that works at port 23) can be added to system. It provides the attacker to get the shell after successful exploit. Some important shell commands can be simulated at this stage.
System logs are written in parsable text format. As a new feature, logs can be send to syslog or a database. Thus, it will be easy to process logs with some visualization applications such as Kibana, Splunk etc.
Management Web Interface. An interface to configure and manage the Honeything instance. Present statistics etc.