Bujlow 2012
Bujlow 2012
0
Machine Learning Algorithm
Tomasz Bujlow, Tahir Riaz, Jens Myrup Pedersen
Section for Networking and Security, Department of Electronic Systems
Aalborg University, DK-9220 Aalborg East, Denmark
{tbu, tahir, jens}@es.aau.dk
Abstract—Our previous work demonstrated the possibility of system sockets), and the centralized solution was assessed to
distinguishing several kinds of applications with accuracy of be 99.3–99.9 % accurate, due to the C5.0 classification error,
over 99 %. Today, most of the traffic is generated by web when classifying 7 different applications [3].
browsers, which provide different kinds of services based on
the HTTP protocol: web browsing, file downloads, audio and In previous papers, we assumed that one application carries
voice streaming through third-party plugins, etc. This paper only one type of traffic and, for this reason, we took into
suggests and evaluates two approaches to distinguish various account only applications fulfilling this criterion. However,
HTTP content: distributed among volunteers’ machines and the data collected by VBS showed that nowadays majority
centralized running in the core of the network. We also assess of traffic is generated by HTTP-based applications as web
accuracy of the global classifier for both HTTP and non-HTTP
traffic. We achieved accuracy of 94 %, which supposed to be browsers. Until now, we were treating this kind of traffic
even higher in real-life usage. Finally, we provided graphical as a general web traffic class, which in effect consisted of
characteristics of different kinds of HTTP traffic. interactive traffic (web pages), audio and video streams, and
Index Terms—traffic classification, computer networks, HTTP big file downloads. All the kinds of content have different
traffic, browser traffic, C5.0, Machine Learning Algorithms characteristics and QoS requirements [5], and therefore, they
(MLAs), performance monitoring
need to be distinguished and processed in a different way.
The measured characteristics of different content types found
I. I NTRODUCTION
within HTTP flows are shown at the end of this paper. All these
The assessment of Quality of Service (QoS) in computer factors lead to the conclusion that during QoS assessment, we
networks is a challenging task because different kinds of are interested in the type of the traffic, not in the application
applications (voice, video, file download) have different which it generates. In this paper we present and evaluate a
requirements. Therefore, to estimate the performance, we need method for recognizing different kinds of HTTP traffic.
to know what type of data flow is currently being assessed. Other methods for HTTP traffic classification are shown in
There are many methods for distinguishing computer network [6] and [7]. In [6], the authors propose to use the size of
traffic, including the classification by ports, Deep Packet the flow and the number of flows associated with the same
Inspection (DPI), or statistical classification [1]. We compared IP address to determine character of the traffic by 3 different
them in [2] and we assessed that these methods are not MLAs. Unfortunately, this approach requires to have the traffic
sufficient for the real-time identification of HTTP traffic. collected in advance, and in consequence, it is not suitable
We had two possible approaches to the classification of data for the real-time classification needed for QoS assessment
in the high-speed computer network infrastructure: centralized purposes. The method described in [7] is based on keyword
and distributed. We implemented the distributed approach matching, flow statistics and a self-developed algorithm. This
as the Volunteer-Based System (VBS) and presented in approach also does not fulfill our needs because it requires
[2]. It involves collecting data by VBS clients installed on processing whole flows: first to match the signature, then to
users’ computers together with a name of the corresponding extract statistics, such as number of packets contained by the
application. The necessary statistical parameters are calculated flow. As opposite, our centralized solution is able to classify
on the client side and sent to the database located on the the data based on 35 packets from any point of a flow. As a
VBS server. We designed the centralized solution as a flow- consequence, we can monitor flows very quickly.
examining-application installed in the central point of the The remainder of this paper not only gives an overview of
network. All the flows passing through that point are captured our solutions for the distributed and centralized classification
and assigned to a particular application class by the C5.0 of network traffic and our methods for providing precise input
Machine Learning Algorithm (MLA) [3]. As training data we data, but also describes the results and finally shows different
used the data collected by our VBS. The proposed design of a traffic profiles. We assessed the accuracy of the classification
solution for estimating QoS using both these approaches and when using different algorithm parameters. The data used in
combining passive and active measurements was described in our experiments originate from 5 private machines running
[4]. The accuracy of the distributed solution is approaching in Denmark and in Poland as well as 18 machines installed
100 % (as it uses the application names taken directly from in computer classrooms in Gimnazjum nr 3 z Oddziałami
Integracyjnymi i Dwuj˛ezycznymi imienia Karola Wojtyły w with information about a chosen method of classification. As
Mysłowicach, a high school in Poland. shown, most services can be classified accurately by HTTP
content type. Unfortunately, in some cases, we are not able
II. T HE CENTRALIZED CLASSIFICATION METHOD
to distinguish HTTP audio from HTTP video streams (as
We designed the centralized solution to be implemented in shown in the case of application/x-mms-framed content type,
the core of the network. 35-packet long snippets from the used both for streamed audio and video content). However,
selected flows are inspected by the statistics generator, which streamed multimedia content is often played by plugins which
calculates the values of the relevant parameters. Based on use the Real-Time Messaging Protocol (RTMP) instead of
the calculated statistics, the C5.0 MLA is able to predict the HTTP, so the content can be separated using plugin names
traffic class of the flow. The first and the most important issue (such as plugin-container) and RTMP remote port (1935).
in our solution was how to train the classifier properly. As The other problem is that we cannot distinguish streamed
a consequence, we designed and implemented an algorithm multimedia content from multimedia files embedded on
which uses pre-classified browser traffic to generate training websites, such as YouTube, because they use the same content
cases for different classes of traffic. The description of the type, for example, audio/mpeg. Moreover, the traffic generated
algorithm is based on Figure 1. by multimedia files downloaded through web browsers appear
Browser traffic can be classified based on two different like a regular multimedia transmission.
approaches: by using HTTP headers, or application names and
additional flow conditions like ports. Table I contains examples As the first step, we need to decide if we are dealing with
of different services provided by Firefox web browser, together HTTP-based flow or another kind of transport-layer flow. For
Internet radio
The Voice http://www.thevoice.dk/popup/popup.php?tab=radio firefox audio/mpeg By content type
NOVAfm http://www.novafm.dk/popup/popup.php?tab=radio firefox audio/mpeg By content type
Radio 3 http://www.radio3.dk/sites/all/modules/netplayer/player.php plugin-container Impossible
RMF FM http://www.rmfon.pl/play,5 plugin-container audio/aacp By content type
ESKA http://www.eska.pl/player?streamId=101 firefox audio/mpeg By content type
CNN radio http://radioradio7.com/radio/CNN.html totem-plugin- application/x-mms-framed By content type
Embedded audio
Wrzuta.pl http://www.wrzuta.pl firefox audio/mpeg By content type
Video on Demand
Youtube http://www.youtube.com/ firefox video/x-flv By content type
Ipla http://www.ipla.pl iplalite video/x-flv By content type
Onet Video News http://www.onet.pl firefox video/x-flv By content type
CNN Video News http://edition.cnn.com/video/ firefox video/x-flv By content type
Wrzuta.pl http://www.wrzuta.pl firefox video/mp4 By content type
Internet TV
Justin.tv http://www.justin.tv/ plugin-container By app name and remote port 1935
Al-Jazeera http://www.aljazeera.com/watch_now/ plugin-container By app name and remote port 1935
PDR http://www.pdr.pl totem-plugin- application/x-mms-framed By content type
File download
File 1 http://download.oracle.com/otn-pub/java/jdk/7u1-b08/ firefox application/x-compress By content type
jdk-7u1-solaris-sparc.tar.Z
File 2 http://www.skatnet.dk/test/testfile.avi firefox video/x-msvideo False classification as video
this purpose, we examine each packet in the flow and check if name to traffic class assignment list. If we cannot find any
the HTTP header exists. If yes, we look for the content-type match, the flow is discarded as well. As it is written in [2],
field. If we can obtain the information, the preferred way of around the first 10 and the last 5 packets of each flow have
processing is always to handle the flow as a HTTP-based flow, different characteristics of size parameters than other packets.
as it allows to recognize different kinds of flows generated by As a result, these packets are cut out of the flow. Next, the
one application. Short flows (below 200 packets in the case flow is split into 35-packet subflows, which are provided to
of regular flows, and below 35 packets in the case of HTTP- the statistics generator. The generated statistics are given as
based flows) are discarded because they are useless from the the input to the C5.0 classifier as training or test data. It was
QoS measurement point of view. shown in [2] that a further increasing number of packets in
the subflow does not improve significantly the accuracy of
A. Regular transport-layer flows the classifier. Using the reasonably smallest number of flows
allows to perform faster traffic classification and saves system
Regular flows are processed based on the application
resources, what allows to process more flows at a time.
name to traffic class assignment list. Most applications
are specialized to handle specific types of traffic (voice
B. HTTP-based transport-layer flows
conversations for Skype, file transfer for FTP clients,
or interactive traffic for games), but they also generate Dealing with HTTP-based flows is more complex, as one
background traffic. For example, Skype shares a distributed transport-layer flow can contain multiple HTTP flows, which
users’ directory, free file transfer clients tend to download can carry various kinds of content: text files, images, audio
advertisements to display on the screen when doing their job, and video data. For this reason, we split the transport-layer
and games have control connections to the main server. These flow into separate HTTP flows, which are mapped to a traffic
flows, acting as noise, are usually quite short. To eliminate class based on the content-type field in the HTTP header. We
their impact, we decided to discard all flows shorter than found that the content-type field in the HTTP header is present
200 packets. If there is no application name assigned to the in and only in the first inbound packet of a new logical HTTP
flow, the flow is discarded. Flows associated with HTTP-based flow. If mapping does not exist, the HTTP flow is discarded.
applications (like web browsers) are discarded as well, because We decided to specify the following traffic classes: audio,
they are not recognized as HTTP flows and their type is file, multimedia, video and web. The multimedia class was
unknown. Then the flows are checked against the application assigned to traffic with content-types, which could carry audio
Figure 2. Misclassification table for HTTP (above) and whole traffic (below)
audio, file, p2p, ssh and video. The non-HTTP video streams
Figure 4. Distribution of total payload size in the sample
were played mostly through third-party plugins in the browser,
such as Adobe Flash. The error rate was in this case 6.0 %.
The number of misclassifications between the file and the video HTTP content type information to extract logical HTTP flows
classes decreased (not only in percents, but in some cases also from the transport-layer flows. Later, the traffic classes are
in absolute values). It means that the training process is more assigned based on the particular content type. The centralized
efficient while using the higher number of cases, because it method based on the C5.0 MLA is able to distinguish
causes better classification accuracy. different types of HTTP traffic in the central point of the
V. T HE TRAFFIC PROFILES network. Furthermore, the high classification error rate (17 %)
is possibly caused by numerous mistakes in both the training
Based on the output from the C5.0, we found the most
and the test sets. These mistakes were made by assigning
used classification attributes to distinguish different types of
downloaded movies to the video instead of to the file download
the HTTP traffic. We chose two of them (the number of PSH
class. However, it was discussed that the predicted traffic class
flags for the inbound direction and the total payload size)
is probably correct, but it can be proved only by real-time
to perform the graphical analysis. The distributions of these
observations of what kind of tasks are made by the user.
attributes shown in Figure 3 and in Figure 4 confirm that the
We demonstrated that the classifier did not have problems
audio and the web traffic differ significantly between each
with recognizing interactive voice traffic. The last step of our
other, and from the video traffic and the big file transfers.
experiment was to classify the whole traffic: the HTTP as well
It proves that we can easily catch HTTP-based audio traffic,
as the non-HTTP one. In this case we achieved much lower
which is the most fragile for network performance issues. It
error rate of 6.0 %.
justifies a need for the separate group of interactive web traffic
as well. During this experiment we used M=100 and N=300 R EFERENCES
in the algorithm generating the cases. [1] Jun Li, Shunyi Zhang, Yanqing Lu, Junrong Yan, Real-time P2P Traffic
Identification, IEEE GLOBECOM 2008 PROCEEDINGS, pp. 1–5.
[2] Tomasz Bujlow, Kartheepan Balachandran, Tahir Riaz, Jens Myrup
Pedersen, Volunteer-Based System for classification of traffic in computer
networks, 19th Telecommunications Forum TELFOR 2011, IEEE 2011,
pp. 210–213.
[3] Tomasz Bujlow, Tahir Riaz, Jens Myrup Pedersen, A method for
classification of network traffic based on C5.0 Machine Learning
Algorithm, International Conference on Computing, Networking and
Communications (ICNC 2012), IEEE 2012, pp. 244–248.
[4] Tomasz Bujlow, Tahir Riaz, Jens Myrup Pedersen, A Method for
Assessing Quality of Service in Broadband Networks, Proceedings of the
14th International Conference on Advanced Communication Technology
(ICACT 2012), IEEE 2012, pp. 826–831.
[5] Gerhard Haßlinger, Implications of Traffic Characteristics on Quality of
Service in Broadband Multi Service Networks, Proceedings of the 30th
EUROMICRO Conference (EUROMICRO’04), IEEE Computer Society
2004, pp. 196–204.
[6] Kei Takeshita, Takeshi Kurosawa, Masayuki Tsujino, Motoi Iwashita,
Evaluation of HTTP Video Classification Method Using Flow Group
Figure 3. Distribution of number of PSH flags for the inbound direction Information, 14th International Telecommunications Network Strategy
and Planning Symposium (NETWORKS), IEEE 2010, pp. 1–6.
[7] Samruay Kaoprakhon, Vasaka Visoottiviseth, Classification of Audio
VI. C ONCLUSION and Video Traffic over HTTP Protocol, 9th International Symposium
on Communications and Information Technology (ISCIT 2009), IEEE
This paper presents two novel methods for recognizing 2009, pp. 1534–1539.
different kinds of HTTP traffic in computer networks. The
distributed method implemented among VBS clients uses the