Jasa Konsultasi Disertasi Keuangan (1) dan Perbankan S3, hubungi WA 0821 2230 7021: Contoh paper manajemen proyek untuk anda: Locally Linear Metric Adaptation for Semi-Supervised Clustering

Locally Linear Metric Adaptation for Semi-Supervised Clustering

Hong Chang HONGCH@CS.UST.HK

Dit-Yan Yeung DYYEUNG@CS.UST.HK

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong

Abstract

Many supervised and unsupervised learning algorithms are very sensitive to the choice of an appropriate distance metric. While classi-fication tasks can make use of class label in-formation for metric learning, such informa-tion is generally unavailable in conventional clustering tasks. Some recent research sought to address a variant of the conventional clus-tering problem called semi-supervised clus¬tering, which performs clustering in the pres¬ence of some background knowledge or super¬visory information expressed as pairwise sim¬ilarity or dissimilarity constraints. However, existing metric learning methods for semi-supervised clustering mostly perform global metric learning through a linear transforma¬tion. In this paper, we propose a new met¬ric learning method which performs nonlin¬ear transformation globally but linear trans-formation locally. In particular, we formu¬late the learning problem as an optimiza¬tion problem and present two methods for solving it. Through some toy data sets, we show empirically that our locally linear met¬ric adaptation (LLMA) method can handle some difficult cases that cannot be handled satisfactorily by previous methods. We also demonstrate the effectiveness of our method on some real data sets.

1. Introduction

Many machine learning and pattern recognition algo-rithms rely on a distance metric. Some commonly used methods are nearest neighbor classifiers, radial basis function networks and support vector machines for

classification tasks and the k-means algorithm for clus-tering tasks. The performance of these methods often depends critically on the choice of an appropriate met-ric. Instead of choosing the metric manually, a promis-ing approach is to learn the metric from data automat-ically. This idea can be dated back to some early work on optimizing the metric for k-nearest neighbor den¬sity estimation (Fukunaga & Hostetler, 1973). More recent research along this line continued to develop various locally adaptive metrics for nearest neighbor classifiers, e.g., (Domeniconi et al., 2002; Friedman, 1994; Hastie & Tibshirani, 1996; Lowe, 1995; Peng et al., 2002). Besides nearest neighbor classifiers, there are other methods that also perform metric learning based on nearest neighbors, e.g., radial basis function networks and variants (Poggio & Girosi, 1990).

While class label information is available for metric learning in classification tasks, such information is gen-erally unavailable in conventional clustering tasks. To adapt the metric appropriately to improve the clus¬tering results, some additional background knowledge or supervisory information should be made available. This learning paradigm between the supervised and unsupervised learning extremes is referred to as semi-supervised clustering, as contrasted to another type of semi-supervised learning tasks called semi-supervised classification which solves the classification problem with the aid of additional unlabeled data.

One type of supervisory information is in the form of limited labeled data.1 Based on such information, Sinkkonen and Kaski (2002) proposed a local metric learning method to improve clustering and visualiza¬tion results. Basu et al. (2002) explored using labeled data to generate initial seed clusters for the k-means clustering algorithm. Also, Zhang et al. (2003) pro¬posed a parametric distance metric learning method for both classification and clustering tasks.

1Semi-sup ervised clustering with the aid of labeled data is essentially the same as semi-supervised classification with the aid of unlabeled data.

Another type of supervisory information is in the form of pairwise similarity or dissimilarity constraints. This type of supervisory information is weaker than the first type, in that pairwise constraints can be de¬rived from labeled data but not vice versa. Wagstaff and Cardie (2000) and Wagstaff et al. (2001) pro¬posed using such pairwise constraints to improve clus¬tering results. Klein et al. (2002) introduced spa¬tial generalizations to pairwise constraints, so that the pairwise constraints can also have influence on the neighboring data points. However, both methods do not incorporate metric learning into the cluster¬ing algorithms. Xing et al. (2003) proposed using pairwise side information in a novel way to learn a global Mahalanobis metric before performing cluster¬ing with constraints. Both Klein et al.’s and Xing et al.’s methods generally outperform Wagstaff et al.’s method in the experiments reported. Instead of us¬ing an iterative algorithm as in (Xing et al., 2003), Bar-Hillel et al. (2003) devised a more efficient, non-iterative algorithm called relevant component analy¬sis (RCA) for learning a global Mahalanobis metric. However, their method can only incorporate similarity constraints. Shental et al. (2004) extended the work of (Bar-Hillel et al., 2003) by incorporating both pair-wise similarity and dissimilarity constraints into the expectation-maximization (EM) algorithm for model-based clustering based on Gaussian mixture models. Kwok and Tsang (2003) established the relationship between metric learning and kernel matrix adaptation.

To summarize, we can categorize metric learning meth-ods according to two different dimensions. The first dimension is concerned with whether (supervised) clas-sification or (unsupervised) clustering is performed. Most methods were proposed for classification tasks, but some recent methods extended metric learning to clustering tasks under the semi-supervised learning paradigm. Supervisory information may be in the form of class label information or pairwise (dis)similarity information. The second dimension categorizes met¬ric learning methods into global and local ones. Pro¬vided that sufficient data are available, local metric learning is generally preferred as it is more flexible in allowing different local metrics at different locations of the input space. In this paper, we propose a new metric learning method for semi-supervised clustering with pairwise similarity side information. While our method is local in the sense that it performs met¬ric learning through locally linear transformation, it also achieves global consistency through interaction between adjacent local neighborhoods.

The rest of this paper is organized as follows. In Sec¬tion 2, we present our metric learning method based

on locally linear transformation. We also formulate the learning problem as an optimization problem. In Section 3, we present two methods for solving this op¬timization problem. Experimental results on both toy and real data are presented in Section 4, comparing our method with some previous methods. Finally, some concluding remarks are given in the last section.

2. Locally Linear Metric Adaptation 2.1. Basic Ideas

Let us denote a set of n data points in a d-dimensional input space by X = {x1, x2,... , xn}. As in (Bar-Hillel et al., 2003), we only consider pairwise similarity con¬straints which are given in the form of a set S0 of simi¬lar point pairs. Intuitively, we want to transform the n data points to a new space in which the points in each similar pair will get closer to each other. To preserve the topological relationships between data points, we move not only the points involved in the similar pairs but also other points. For computational efficiency, we resort to linear transformation. One promising ap¬proach is to apply locally linear transformation so that the overall transformation of all points in X is linear lo¬cally but nonlinear globally, generalizing previous met¬ric learning methods based on applying linear trans¬formation globally (Bar-Hillel et al., 2003; Xing et al., 2003). We call this new metric learning method locally linear metric adaptation (LLMA). However, caution should be taken when applying linear transformation to reduce the distance between similar points, as a degenerate transformation will simply map all points to the same location so that all inter-point distances vanish (and hence become the smallest possible). Ob¬viously this degenerate case is undesirable and should be avoided.

2.2. Metric Adaptation as an Optimization Problem

We now proceed to devise the metric learning algo¬rithm more formally. We first generate the transitive and reflective closure S from S0. For each point pair (xr, xs) E S, we apply a linear transformation to the vector (xs  xr) to give Ar(xs  xr) + cr for some d×d matrix Ar and d-dimensional vector cr. If a data point is involved in more than one point pair, we consider the transformation for each pair separately. The same lin¬ear transformation is also applied to every data point xi in the neighborhood set Nr of xr. In other words, every data point xi E Nr is transformed to

yi = Ar(xi  xr) + cr + xr = xi + (Ar  I) xi + br,

where br = (I  Ar)xr + cr is the translation vector for all points xi’s in r.

However, a data point xi may belong to multiple neighborhood sets corresponding to different point pairs in . Thus, the new location yi of xi is the overall transformation effected by possibly all similar point pairs (and hence neighborhood sets):

yi = xi + ri[(Ar  I) xi + br],

(Xr,Xs)S

where ri = 1 if xi r and 0 otherwise.

Let m denote the number of point pairs in . Thus a total of m different transformations have to be es¬timated from the point pairs in , requiring O(md2) transformation parameters in Ar and br. When m is small compared with the dimensionality d, we cannot estimate the O(md2) transformation parame¬ters accurately. One way to get around this problem is to focus on a more restrictive set of linear transforma¬tions. The simplest case is to allow only translation, which can be described by md parameters. Obviously, translating all data points in a neighborhood set by the same amount leads to no change in the inter-point dis¬tances. Although some data points may fall into mul¬tiple neighborhood sets and hence this phenomenon does not hold, we want to incorporate an extra de¬gree of freedom by changing the neighborhood sets to Gaussian neighborhood functions. More specifically, we set Ar to the identity matrix I and express the new location yi of xi as

yi = xi + ribr, (1)

(Xr,Xs)S

where ri is a Gaussian function defined as

 ri = exp 1 2 (xi  xr)T~1

r (xi  xr ) ,

with Er being the covariance matrix. For simplicity, we use a hyperspherical Gaussian function, meaning that the covariance matrix is diagonal with all diag¬onal entries being 2. Thus ri can be rewritten as ri = exp xi  xr2/(22). Note that (1) can be expressed as

yi = xi + B 7ri, (2)

where B = [b1, b2, ... , bm] is a dm matrix and 7ri = (1i, 2i, ... , mi)T is an m-dimensional column vector. For data points that are far away from all points involved in  (and hence the centers of the neighborhoods), all ri’s are close to 0 and hence those points essentially do not move (since yi  xi).

We now formulate the optimization problem for find-ing the transformation parameters. The optimization criterion is defined as

J = dS + P, (3)

where dS is the sum of squared Euclidean distances for all similar pairs in the transformed space

dS = yr ys2,

(Xr,Xs)S

and P, a penalty term used to constrain the degree of transformation, is defined as

P = (dij) (qij  dij)2, (4)

i j

where qij = yi  yj and dij = xi  xj repre¬sent the inter-point Euclidean distances in the trans¬formed and original spaces, respectively. (dij) (dij) = exp d2

is again in the form of a Gaussian function, as ij/2 , with parameter  specify¬ing the spread of the Gaussian window. The regular¬ization parameter  > 0 in (3) determines the rela¬tive significance of the penalty term in the objective function for the optimization problem. Note that the optimization criterion in (3) is analogous to objective functions commonly used in energy minimization mod¬els such as deformable models (Cheung et al., 2002), with the penalty term P playing the role of an internal energy term.

2.3. Iterative Metric Adaptation Procedure

The optimization problem formulated above is solved in an iterative manner, resulting in an iterative met¬ric adaptation procedure. To increase the local speci¬ficity gradually over time to allow global nonlinearity in the transformation, the Gaussian window parame¬ters  and  determining the neighborhood size and the weights in the penalty term, respectively, should decrease over time. We apply a simple method of decreasing the window parameters: (t) =  q(t)/t, (t) =  (t), for iteration t = 1, 2, ..., where q(t) is the average inter-point Euclidean distance in the trans¬formed space over all point pairs in  (i.e., q(t) =



2 i<j y(t)

i  y(t)

j ), and ,  > 0 are two con

n(n1)

stant parameters.

At iteration t, given the data point locations y(t)

i and the window parameters (t) and (t), the overall optimization criterion in (3) is rewritten as the opti¬mization criterion for iteration t:

J(t)(br; y(t)i, (t), (t))

= yr ys2 + 

 (t)(dij)(qij  dij)2. (5)

(Xr,Xs)S i j

Note that yr, ys and qij depend on {br} and {y(t)

i }. However, for simplicity, the dependency is not explic¬itly shown on the right-hand side of (5). We seek to minimize J(t) by finding the optimal values of {br} as {b(t)

r }, which are then used to compute the location changes from {y(t)

i } to {y(t+1)

i }.

There are two stopping criteria in our iterative algo¬rithm. The first criterion is based on the ratio (t) of the average inter-point distance over point pairs in S to that over all point pairs in X (i.e., q(t)). The pro¬cedure will stop when (t) becomes smaller than some prespecified threshold . Another stopping criterion is simply to set a maximum number of iterations T. The metric learning procedure will stop when either stopping criterion is satisfied.

We summarize our LLMA algorithm as follows:

1. y(1)

i = xi, 1 < i < n; t = 1.

2. If (t) <  or t = T, then exit.

3. (t) =  q(t)/Vt; (t) =  (t).

 )T

4. Compute gr(t)

i = (t)

1 (y(t)

i ), ... , (t)

m (y(t)

i ) ,1 <

i < n, based on (t).

5. Compute the optimal b(t)

r , 1 < r < m, by mini

mizing J(t) in (5) w.r.t. {br}.

6. Update all data points as

7. t = t + 1; go to Step 2.

In the algorithm, Step 5 is the key step which solves the optimization problem for each iteration based on the criterion in (5). In the next section, we present two methods for solving this optimization problem.

3. Optimization Methods

We now proceed to solve the optimization problem in Step 5 of the LLMA algorithm shown above. Two dif-ferent optimization methods are discussed in the fol-lowing two subsections.

3.1. Gradient Method

While the first term of J(t) in (5) is quadratic in {br}, the second term is of a more complex form. So we can¬not find a closed-form solution for the optimal values

we can solve the optimization problem by iterative ma-jorization, which can be seen as an EM-like algorithm for problems with no missing data. We define

~C = aij(7ri  7rj)(7ri  7rj)T

i j

and

~~D(L) = eij(L) (7ri  7rj)(7ri  7rj)T

i j

with

 a N(dij) dij

qij(L) > 0

eij(L) = qij (L)

0 qij(L) = 0

Then the optimization problem consists of the follow-ing steps:3

1. Initialize L(0); u = 0.

2. u = u + 1; and compute

L(u) = L(u1) (D(L(u1)))T (C1)T .

3. If converged, then stop; otherwise go to Step 2. 3.3. Other Methods

Recall that the penalty term P in (3) serves to con¬strain the degree of transformation, partly to avoid the occurrence of a degenerate transformation and partly to preserve the local topological relationships between data points. Besides defining the penalty term as in (4), there also exist other ways to achieve this goal. One possibility is to preserve the locally linear rela¬tionships between nearest neighbors, as in a nonlinear dimensionality reduction method called locally linear embedding (LLE) (Roweis & Saul, 2000). Due to page limit, details of this method are omitted here.

4. Experimental Results

To assess the efficacy of LLMA, we perform extensive experiments on toy data as well as real data from the UCI Machine Learning Repository.4

4.1. Illustrative Examples

Figure 1 demonstrates the power of our LLMA method by comparing it with the RCA method (Bar-Hillel

3Note that the iteration count u here is different from t in the LLMA algorithm shown above. This optimization problem is for Step 5 of each iteration t of the algorithm.

4http://www.ics.uci.edu/~mlearn/MLRepository.html

et al., 2003) on three toy data sets.5 RCA, as a metric learning method, changes the feature space by a global linear transformation which assigns large weights to relevant dimensions and low weights to ir¬relevant dimensions. The relevant dimensions are es¬timated based on connected components composed of similar patterns. For each data set, we randomly select 10 similar pairs to form 0. While RCA can perform well on the first data set, its performance is signifi¬cantly worse than LLMA on the second and third data sets which are much more difficult cases. On the other hand, LLMA can give satisfactory results for all three cases. More details about these experiments will be given in Section 4.3.

4.2. Clustering Algorithms and Performance Measures for Comparative Study

In order to assess the efficacy of LLMA for semi-supervised clustering, we compare the clustering re-sults based on k-means with and without metric learn-ing. Besides RCA method, we also repeat the ex-periments using the constrained k-means algorithm (Wagstaff et al., 2001). Constrained k-means algo¬rithm is based on default Euclidean metric subject to the constraints that patterns in a pair (xr, xs) E  are always assigned to the same cluster. More specifically, the following four clustering algorithms are compared:

1. k-means without metric learning

2. Constrained k-means without metric learning

3. k-means with RCA for metric learning

4. k-means with LLMA for metric learning

The Rand index (Rand, 1971) is used to measure the clustering quality in our experiments. It reflects the agreement of the clustering result with the ground truth. Let ns be the number of point pairs that are as-signed to the same cluster (i.e., matched pairs) in both the resultant partition and the ground truth, and nd be the number of point pairs that are assigned to different clusters (i.e., mismatched pairs) in both the resultant partition and the ground truth. The Rand index is defined as the ratio of (ns + nd) to the total number of point pairs, i.e., n(n  1)/2. When there are more than two clusters, however, the standard Rand index will favor assigning data points to different clusters. We modify the Rand index as in (Xing et al., 2003) so that matched pairs and mismatched pairs are assigned weights to give them equal chance of occurrence (0.5).

5The MATLAB code for RCA was downloaded from the web page of an author of (Bar-Hillel et al., 2003).

(a) Toy data set 1 (b) Similar pairs (c) (d) LLMA

RCA

(e) Toy data set 2 (f) Similar pairs (g) RCA hLLMA

(i) Toy data set 3 (j) Similar pairs (k) RCA (l) LLMA

Figure 1. Comparison of LLMA with RCA on three toy data sets. Subfigures in the first column show the data sets each with two classes, while subfigures in the second column show 10 similar pairs in S0 for each data set. The third and fourth columns show the data sets after applying RCA and LLMA, respectively, for metric learning.

To see how different algorithms vary their performance with the background knowledge provided, we use 20 randomly generated 0 sets for each data set. More¬over, we compute the average Rand index over 20 ran¬dom runs of (constrained) k-means for each 0 set. The results for all four algorithms are then shown as box-plots using MATLAB.

4.3. Semi-Supervised Clustering on Toy and UCI Data Sets

In the LLMA algorithm, there are a few parameters that need to be set before running the experiments. These parameters are quite easy to set based on their physical meanings. The two parameters, 0 and y, for the decay functions of the Gaussian windows are set to [0.1, 3] and (0, 1), respectively. The regularization pa¬rameter A adjusting the tradeoff between local trans-formation and geometry preservation is set to [1, 5]. For the stopping criteria, we set p to [0.1, 0.2] and T to 5 (i.e., very few iterations of the LLMA algorithm are run). All data sets are normalized before applying the four algorithms. Gradient method is used to ob¬tain the experimental results shown, which are similar to those obtained using iterative majorization.

Figure 2 shows the clustering results for the three toy data sets as illustrated in Section 4.1. Obviously, all the three data sets cannot be clustered well using the standard and constrained k-means algorithms. Even RCA can give good result only on the first data set. On the other hand, LLMA can handle all these cases and perform particularly well on the second and third data sets which cannot be handled satisfactorily by the other methods.

We further conduct experiments on nine UCI data sets. The number of data points n, the number of features d, the number of classes c, and the number of ran¬domly selected similar pairs 0 are shown under each subfigure in Figure 3. From the clustering results, we can see that LLMA outperforms the other methods for most of these data sets. As for the iris, Boston housing and balance data sets, RCA can improve the clustering results most.

To summarize, these experimental results on both toy and real data sets demonstrate the effectiveness of our LLMA method.

RAND INDEX 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55

0.5 RAND INDEX 1 0.95 0.9 0.85 0.8 0.75 0.7

0.65 RAND INDEX 1 0.95 0.9 0.85 0.8 0.75 0.7

0.65

0.6

0.55

0.5 0.6

0.55

0.5

1 2 3 4

(a) Toy data set 1 1 2 3 4

(b) Toy data set 2 1 2 3 4

Figure 2. Clustering results for toy data sets shown as box-plots for 20 different S0 sets with |S0| = 10 (the four clustering algorithms are numbered as in Section 4.2).

5. Concluding Remarks

In this paper, we have proposed a new metric learning method called LLMA for semi-supervised clustering. Unlike previous methods which can only perform lin¬ear transformation globally, LLMA performs nonlinear transformation globally but linear transformation lo¬cally. This generalization makes it more powerful for solving some difficult clustering tasks as demonstrated through the toy data sets. To solve the optimization problem as one step in the LLMA algorithm, we have presented two methods and hinted some other possi¬bilities, such as a spectral method like that used in LLE. We have also compared our method with some previous methods using real data sets.

Note that in LLMA, the original input space and the transformed space are explicitly related via a mapping, as Y = LII, where II is a nonlinear function with re¬spect to X. Although it is not necessary for clustering problems, it is possible for new data points added to the input space to be mapped onto the transformed space. This possibility will be explored as we extend our LLMA method to other applications.

Currently, our method can only utilize similarity con-straints. A natural question to ask is whether we can extend LLMA by incorporating dissimilarity con¬straints. In principle this is possible, but the optimiza¬tion criterion has to be modified in order to incorpo¬rate the new constraints. One challenge to face is to maintain the form of the objective function so that the optimization problem remains tractable.

Moreover, we have only considered a restrictive form of locally linear transformation, namely, translation. A potential direction to pursue is to generalize it to more general linear transformation types. Other possible re-search directions include improving the current LLMA algorithm such as performing globally linear transfor-mation first and then LLMA only when necessary.

References

Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence re¬lations. Proceedings of the Twentieth International Con-ference on Machine Learning (pp. 11–18). Washington, DC, USA.

Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. Proceedings of the Nineteenth International Conference on Machine Learn¬ing (pp. 19–26). Sydney, Australia.

Cheung, K., Yeung, D., & Chin, R. (2002). On deformable models for visual pattern recognition. Pattern Recogni¬tion, 35, 1507–1526.

Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelli¬gence, 24, 1281–1285.

Friedman, J. (1994). Flexible metric nearest neighbor clas-sification (Technical Report). Department of Statistics, Stanford University, Stanford, CA, USA.

Fukunaga, K., & Hostetler, L. (1973). Optimization of k-nearest neighbor density estimates. IEEE Transactions on Information Theory, 19, 320–326.

Hastie, T., & Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 607–616.

Klein, D., Kamvar, S., & Manning, C. (2002). From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. Proceedings of the Nineteenth International Conference on Machine Learning (pp. 307–314). Sydney, Australia.

Kwok, J., & Tsang, I. (2003). Learning with idealized ker¬nels. Proceedings of the Twentieth International Confer¬ence on Machine Learning (pp. 400–407). Washington, DC, USA.

Lowe, D. (1995). Similarity metric learning for a variable-kernel classifier. Neural Computation, 7, 72–85.

Peng, J., Heisterkamp, D., & Dai, H. (2002). Adaptive kernel metric nearest neighbor classification. Proceed-ings of the Sixteenth International Conference on Pat¬tern Recognition (pp. 33–36). Qu´ebec City, Qu´ebec, Canada.

International Journal of Information Technology Project Management, 2(2), 1-13, April-June 2011 1

Human Resource Related

Problems in Agile and Traditional

Software Project Process Models

Stefan Koch, Bogazici University, Turkey

Gerhard Turk, Vienna University of Economics and Business, Austria

2 International Journal of Information Technology Project Management, 2(2), 1-13, April-June 2011

and on efficiency and effectiveness of work should not be underestimated. For example, eXtreme Programing (XP) explicitly has a “no overtime” rule designed to maintain spirit and motivation (Beck, 1999). We addressed this question with a literature review on human resource related topics in software develop¬ment, and an empirical study. The empirical study included interviews with a number of project participants, mostly project managers from different software developing companies.

The outline of the paper is as follows: We will start with a literature review which covers the types of software process models included in our study, as well as possible human resource related problems and prior research on the in¬tersection of both issues. Then we will detail the empirical study undertaken, starting with a description of the methodology, followed by the results themselves. The paper closes with discussion and directions for future research.

LITERATURE REVIEW

Process Models in

Software Development

The concept of a process model in software development is still not uniformly defined, and some of the approaches to be found in the literature are not detailed enough to be called a full-blown process model. The basic idea we will use is that a process model contains a list of time-successive steps in a problem-solving process, thus constituting a guideline for devel¬oping software, structuring the overall project, helping in achieving goals including quality levels, and aiding in planning, controlling and monitoring (Liu et al., 2008). Also a process model helps in giving a common understanding to all involved parties, and defining clear inter-faces between different tasks. Process models also gain special importance in multi-project management (Demirkan & Nichols, 2008).

In the literature, a plethora of approaches can be found including the V-model, SA/SD, XP, Scrum and many more. Although they are difficult to group, for this paper we will

adopt a widespread differentiation between those models which are of a more consecu¬tive, waterfall-based nature (Royce, 1970) like V-model or SA/SD, for which Syed-Abdullah et al. (2006) also use the term designed-based approaches, and those which show more proto-typing, spiral-oriented features (Boehm, 1988). This last group is currently primarily embodied by agile approaches like Scrum or XP, which are based on the Agile Manifesto. This holds as central tenant that both the business and technology environment continue to change at an increasing pace. Therefore the main question to be addressed is how to better handle these changes, not stop them early in the life cycle by anticipating them all, while still aiming at achieving high quality and timeliness. A focal point therefore is the flexibility that is to be ensured (Ollson, 2008). This leads to a set of principles, which value individuals and interac¬tions over processes and tools, working software over comprehensive documentation, customer collaboration over contract negotiation, and responding to change over following a plan, while explicitly acknowledging the value of the items on the right. While all the instruments of more process- and planning-oriented methods are recognized as useful, they are the ones seen as needing to give in problematic situations.

Quite often, the right choice of model is a huge challenge. Lots of research has focused on comparing different models, and trying to determine in which contexts different ap¬proaches have merit or not (Boehm & Turner, 2003; Reifer, 2002; Stephens & Rosenberg, 2003). For example, agile approaches have faced a lot of criticism, one of them being that planning, processes and documentation are essential, and agile development is nothing more than an excuse for hackers to do as they like, coding away without planning or design (Rakitin, 2001). Another point of criticism is the reliance of agile methods on individual competency and craftsmanship, with partici¬pants working together in self-organizing teams even including members of other organizations, e.g. customers (Cockburn & Highsmith, 2001; Lindvall et al., 2002). It is argued that the number

11 more pages are available in the full version of this

document, which may be purchased using the "Add to Cart"

button on the publisher's webpage:

www.igi-global.com/article/human-resource-related

problems-agile/53541

Related Content

Public-Key Cryptography

Eberhard Stickel (2005). Encyclopedia of Information Science and Technology, First

Edition (pp. 2368-2372).

www.irma-international.org/chapter/public-key-cryptography/14615/

New Product Development

Robert S. Friedman, Desiree M. Roberts and Jonathan D. Linton (2009). Principle Concepts of Technology and Innovation Management: Critical Research Models (pp. 192-215).

www.irma-international.org/chapter/new-product-development/28131/

Organizational Assimilation Capacity and IT Business Value

Vincenzo Morabito and Gianluigi Viscusi (2009). Encyclopedia of Information Science

and Technology, Second Edition (pp. 2929-2933).

www.irma-international.org/chapter/organizational-assimilation-capacity-business-value/14006/

From Pilot to Practice: Streamlining Procurement and Engineering at

Lawrence Livermore National Laboratory

Judith Gebauer and Frank Farber (2000). Organizational Achievement and Failure in

Information Technology Management (pp. 77-90).

www.irma-international.org/chapter/pilot-practice-streamlining-procurement-engineering/27855/

The Expert’s Opinion

Information Resources Management Association (1992). Information Resources

Management Journal (pp. 45-47).

www.irma-international.org/article/expert-opinion/50966/

MSc Real Estate Management

Brilliant outlook.

Solid planning.

Many people tend to choose their first degree based on personal interest; the next step is to focus on identifying a sustainable career. The management, valuation and development of real estate provide access to an interesting and rewarding business life.

The MSc Real Estate Management is an accelerated, intensive Mode of study and duration

programme that can be undertaken full-time in a year or part-time 

over two. You can progress from first degrees in a wide range 

of disciplines ranging from mathematics to the history of art. 

All academic staff involved in teaching have direct working experience in professional real estate, offering invaluable insight into the workings of the industry.

The programme concentrates on the acquisition of property and surveying skills and competencies. It provides authentic tasks and case studies, all set within the appropriate theoretical framework whilst introducing students to a range of professional practitioners. This approach is applied to a range of modules and a dissertation which must be passed for the award of MSc Real Estate Management. A key feature of the programme is the London study visit which extends the internationalisation of the student experience.

Brighter futures

The programme has been developed in consultation with a wide range of companies and organisations in various branches of the property and real estate industries. This allows us to ensure that graduates obtain precisely the skills and knowledge that employers want, leading to excellent career prospects.

Entry requirements & English language requirements

UK honours degree 2:2 (or equivalent); applicants with appropriate

professional experience will be considered.

IELTS score of 6.0 (or equivalent) with no element below 5.5. INTO GCU offers a range of academic and English language pathway programmes or applicants; for more information visit www.gcu. ac.uk/into

the professional project or dissertation.

* Individual modules can also be studied for professional development purposes.

Fees (including study visit to London)

• 

• 

• 



Programme structure

Students complete six taught modules – three in trimester A and three in trimester B – plus an MSc professional project or dissertation after completion of six taught modules.

Trimester A

Core modules: 



Trimester B

Core modules: 

including London study visit).

One elective module selected from: Real Estate 

Assessment methods

Taught modules are assessed by a combination of examinations (as part assessment of the two double modules), coursework, and online discussion boards. The MSc dissertation is assessed by the submission of a conference paper and a dissertation. The alternative professional project is assessed by the submission of a professional project report and a viva voce.

Student support

Study support is available to all students throughout the MSc programme from personal tutors, module leaders and the can also provide structured mentoring and study support sessions which cover numerous topics such as academic writing skills, personal goal setting and pastoral care.

Module information

Core modules

Property Asset Management (double module): provides an intensive introduction to the management of urban property. It includes tenant selection and mix, the negotiation of property contracts, building condition and diagnostics, repairs and maintenance, statutory obligations, financial monitoring and reporting, comparative leasing practice plus aspects of property law.

Sustainable Design & Construction: provides students with knowledge and understanding of the principles of sustainable design and construction, introducing the main requirements, technologies, tools, methods, processes and systems required to develop sustainable buildings.

Property Valuation (double module): introduces students to a range of property valuation methods and procedures.

The effectiveness of traditional methods of valuation and the use of alternative approaches are assessed. Modern negotiation techniques are taught within the module. A coursework – on comparative market analysis between Glasgow and London – includes a week-long London study visit of presentations, company visits and tours of current property projects based at GCU’s London campus.

Elective modules (one from)

Real Estate Development: taught by an experienced property developer, this module introduces the development process, identification, assessment and assembly as well as analysing successful project delivery.

International Construction Project Finance: investigates the financial management processes from bidding to contract financing required for international construction projects.

Urban Regeneration: demonstrates and critically reviews the range of policies, principles, processes and techniques that contribute to urban regeneration in the UK.

Professional Project/Masters Dissertation: is the student’s in-depth investigation of a topic selected by them, in consultation with staff, for advanced study and investigation.

 www.gcu.ac.uk/postgraduate

or contact a member of the programme team on

+44(0)141 331 8572 or email realestatemanagement@gcu.ac.uk

telephone +1 (510) 764-7600 email INFO@ITHENTICATE.COM web WWW.ITHENTICATE.COM

The stakes of ensuring originality of content have never been higher, both for authors and publishers. In the past year alone, public allegations of unchecked plagiarism have damaged the reputations of some of the most prominent scholarly publishers and scientific organizations. The National Science Foundation has launched an investigation of what may amount to nearly $100M in plagiarized grant proposals, and scientific journals report a tenfold increase in retractions over the past 20 years. Science writer Jonah Lehrer and renowned primatologist Jane Goodall both found themselves embroiled in plagiarism scandals related to forthcoming books, and German Chancellor Angela Merkel’s Education Minister resigned after her PhD was revoked upon discovery of plagiarism in her doctoral dissertation.

Ready access to content online makes it easier than ever to reuse copy, whether intentionally or accidentally. However, increasingly sophisticated search technologies—notably Google—along with plagiarism detection software, are leading to faster and more complete discovery of acts of duplication.

Plagiarism detection software is accessible, affordable and effective, yet many common misperceptions exist about its use and limitations. This paper will explore seven of the prominent myths that surround plagiarism detection and provide guidance into how to best use these tools to ensure originality of an author’s own work, or of work submitted for publication or grant application.

7 Misconceptions of Plagiarism Detection Software

Plagiarism Detection Software Automatically Detects Plagiarism

Plagiarism detection software, like search technology, is algorithm driven. With search technology, the algorithm is tuned to identify keywords; the intention is to surface keyword-matched content across crawled and indexed online content. With plagiarism detection software, the algorithm may be tuned to identify chains of linked words in a strict text-to-text match or it may be tuned to pick up more nuanced matches, as with paraphrasing.

But identifying matched content is only part of the process of detecting plagiarism. The remainder requires a critical human element—a thorough interpretation of the scan’s results to determine whether the matched content constitutes an act of plagiarism or is simply a quote or excerpt. Search algorithms cannot parse intent, nor can they determine if matched content is properly cited. The combination of technology and human analysis yields the most thorough plagiarism assessment.

Plagiarism Detectors Are Inaccurate

Google tops the list of search engines for the breadth and speed with which it crawls and indexes content online. That said, content is growing at a pace that far exceeds Google’s capabilities. In addition, content

that is gated or behind a pay wall will not be accessible to the Google search bots.

Good plagiarism detection software not only provides access to crawled and indexed online content, it can compare text to databases of gated journal and e-book content. The software may also compare submissions to a database of masters and doctoral theses as well as submitted paper content.

There is no way to be wholly comprehensive in indexing all content, but plagiarism detection software used in the proper context—for example in the review of journal submissions or student papers—is highly successful in identifying matched content by virtue of the inclusion of additional database content with which to search against.

Plagiarism Detectors Are Easy to Deceive

Especially as more academic institutions and publishers begin mandating the use of plagiarism detection software, there are increasing attempts to devise methods of tricking the system into passing off duplicated text as original. Students have gone to great lengths to cheat the system—substituting Cyrillic letters that look similar to Latin letters throughout the text, adding extra spaces between letters and words, or even adding “invisible” small letters in white font to differentiate the text, often posting their strategies online for others to use. What is less publicized is that behind the scenes, the creators of plagiarism detection software go to even greater lengths to anticipate cheating strategies and to incorporate fixes into their systems. Many plagiarism detection software systems automatically strip macros from submitted manuscripts and can easily detect oddly placed punctuation and invisible characters.

Most attempts to outsmart plagiarism detection software require effort beyond what is required to properly cite or paraphrase source material. Software engineers estimate that in order for duplicated text to pass a plagiarism detection scan, the author would need to rewrite or revise every third word.

All Plagiarism Detection Tools Are The Same

The key to a thorough plagiarism check is the database the software uses for comparison. Many popular services perform text searches of anything available on the Internet that isn’t behind a pay wall. Others include a database of academic journals and archived papers. iThenticate compares submitted manuscripts to a database of 37 billion archived and current web pages, more than 92 million offline works from academic journals, magazines and research abstracts, and 37 million scholarly articles, books, and proceedings from tens of thousands of scientific and technical journals.

Additionally, many plagiarism detection services employ algorithms similar to those used in web search, providing only text-to-text matches. iThenticate utilizes a proprietary algorithm that transforms each submitted manuscript into a “digital fingerprint”, which is compared—just like a human fingerprint—to an extensive database where subtle shades of similarity can be detected.

Plagiarism Detection Tools Are Only For Academia

While those in academia were the earliest adopters of plagiarism detection software, the utility of this technology has moved well beyond universities. Today, the range of industries employing plagiarism detection technologies is wide, and use of the software to ensure originality is standard protocol within many organizations.

Government agencies and granting bodies use plagiarism detection software to check grant proposals, legal documents, and financial reports for duplication and misconduct. Medical research institutions

screen manuscripts to avoid blatant plagiarism, duplication or self-plagiarism, grant misconduct and improper patient treatment. Scholarly journals and publishers rely on plagiarism detection tools to screen submitted work before publication to avoid copyright infringement and protect their reputations. Many publishers also suggest that submitting authors run their own scan prior to submission to catch accidental errors in citation or sourcing.

Plagiarism Detectors Are Only Useful for Uncovering Unethical Work

Plagiarism detection tools are designed to identify blatant instances of duplication but plagiarism comes in many forms, some of which are far more subtle and even accidental.

A good plagiarism detection software will locate not only exact matches in text, but can identify poor paraphrasing, missing citations and even grammatical errors. Researchers submitting a grant proposal or paper for publication may be juggling hundreds of citations, and it is easy to understand how a reference may be unintentionally omitted. It is these cases in which a plagiarism detection scan can make a meaningful difference, protecting a researcher’s reputation and ensuring original, properly cited work.

Scholarly journals are also using plagiarism software to detect duplicative publication, sometimes referred to as self-plagiarism. Self-plagiarism may occur when a researcher submits an article to a journal before learning that it was accepted elsewhere or when an author borrows heavily from their own previously published work.

While plagiarism detection tools can certainly detect the work of unethical writers, often they are used to detect accidental duplication and common writing mistakes.

Plagiarism Detection Software Is Time-Consuming

Editors at scholarly publications are facing record submission volumes, and researchers and academics experience tremendous pressure to publish. And while time is precious, most regular users of plagiarism detection software report that the additional time required for scanning—typically not more than a few minutes—is a worthwhile investment. A good plagiarism detection tool can even speed up the editing process by highlighting questionable passages so the author or editor can focus only on the most suspect parts of a work.

In a survey conducted by iThenticate, researchers reported the highest level of concern over plagiarism, with more than one in four reporting that plagiarism is a "very serious" problem in their field. The additional minutes spent ensuring that work is fully original prior to submission can allay a good deal of that concern. In that same survey, editors at scholarly publications reported routinely checking authors' submitted work for plagiarism—pre-checking gives authors added confidence through the submission process.

Conclusion

Of all of the ethical problems in academia and professional work, plagiarism is both one of the most serious and most common offenses. Plagiarism checkers can help spot plagiarism, both intentional and accidental.

Furthermore, with plagiarism detectors becoming more widely used, the likelihood of a document being checked at some point in the publishing process has increased significantly. Running a scan before submission or publication allows authors to remedy any errors without risking the reputation costs of plagiarism.

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

Trust as an Integral Part for Success of Cloud

Computing

Felix Meixner, Ricardo Buettner

FOM Hochschule fuer Oekonomie & Management, University of Applied Sciences

Chair of Information Systems, Organizational Behavior and Human Resource Management

Arnulfstrasse 30, 80335 Munich, Germany

f.meixner@ieee.org, ricardo.buettner@fom.de

Abstract—Cloud computing has become a hot topic in research in the enterprise and consumer sector. It is clear to everyone that the opportunities and applications of cloud computing are versatile and that cloud computing is an emerging computing paradigm. However when decisions on adopting cloud computing-related solutions are made, trust and security are two of the most critical obstacles for the adoption and growth of cloud computing today. We think there are ways to largely eliminate concerns of potential cloud users by taking advantage of numerous existing technological possibilities, including trust-building measures, like standardization, cryptography, isolation and many more.

Keywords-cloud computing; security; identity-management; encryption; trust

I. INTRODUCTION

Cloud Computing can be regarded as the most important evolution of the mid 1990’s concept of grid computing [1]. In recent years cloud computing clearly became the trend to follow in the IT-industry, providing flexible and scalable software-, platform- and infrastructure-services on demand [2]. However, to fully leverage its potential for cost-savings, cloud computing still has to overcome some major obstacles. As traditional network borders are breaking down at the same time as security threats are increasing, the most important concern about cloud computing are issues of security and trust that have only been partially solved so far.

A lot of literature about cloud computing, trust and security does exist, though most of it is IT-centric [3] [4] [5] [6]. What is less examined and documented is the human perspective that examines the shortcomings of cloud computing, people’s expectations and anxieties as well as psychological aspects. This paper’s objective is to focus on both perspectives, IT and human and try to narrow the gap between both by offering a state of the art overview of mechanisms that help secure the use of cloud computing and thereby create trust in cloud computing. The research question is: Can cloud computing gain enough trust from its users and customers to be even more successful and become an indispensable utility like the power grid?

Our approach to this subject included research on the history and state of cloud computing today, thereby identifying trust and security as the most critical factors of success for future growth and adoption. With these findings in mind, our research was refined on trust and security in cloud computing and its supporting and control mechanisms. The research methodology included investigating multiple of the most relevant online scientific journals databases (Springer Link, JSTOR, ScienceDirect, Elsevier, IEEE Xplore Digital Library and ACM Digital Library).

The remainder of the paper is organized as follows: In Section II we recognize related work. Then the paper gives an insight into the history, different types and sources of trust in non-technological fields and ways in Section III. These fields include trust in general, in psychological and in economical aspects. The paper outlines the difference between party trust and control trust and sets up a framework for trust that is transferred to Section IV, where the framework is mapped to cloud computing technology. The paper continues with Section V by describing various types of technology aiming to enhance user’s and decision makers trust in cloud computing. Finally, in Section VI, we draw the conclusion and provide recommendations for future work and show the need for optimizing existing trust infrastructure and mechanisms.

II. RELATED WORK

In his article “Cloud Computing”, Brian Hayes discusses the trend of moving software applications into the cloud and the related trust privacy, security, and reliability challenges [7]. E. Pearson focuses on privacy challenges as important issues for cloud computing, both in terms of legal compliance and user trust and says that it needs to be considered at every phase of design. He suggests key design principles for software engineers and argues that privacy must be considered when designing any aspects of cloud services, for both legal compliance and user acceptance [8]. The article “A View of Cloud Computing” defines classes of utility and cloud computing and creates a ranked list of critical obstacles to adoption and growth of cloud computing. The list includes availability, data lock-in, data confidentiality and auditability as the top three factors for adoption [9]. M. Mowbray and S. Pearson of HP Labs in their paper “A Client-Based Privacy

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

Manager for Cloud Computing” state that processing sensitive user data in the cloud poses a significant barrier to the adoption of cloud services and that users fear data leakage and loss of privacy. Mowbray and Pearson describe a client-based privacy manager that helps reduce this risk as well as providing additional privacy-related benefits by reducing the amount of sensitive information sent to the cloud [10].

III. CONCEPTS, TYPES AND SOURCES OF TRUST

People have been aware of the concept of trust for quite a long time. In fact, it is as old as the history of man and the existence of human social interactions [11]. The majority of literature and studies about trust comes from classic disciplines like philosophy, psychology and economics, all of which concentrate on exploring a general understanding of trust. This paper focuses on trust in cloud computing, by referring to these studies that explain classic forms of trust alias offline trust.

Philosophy traces the concept of trust back to the ancient Greek. They believed that people trusted others, only if they were confident that the others feared detection and punishment enough to deter them from harming or stealing.

Psychology focuses on interpersonal trust and agrees that it was an especially important concept in psychology and vital to personality development (Erikson, 1963) [12], cooperation institution (Deutsch, 1962) [13] and social life (Rotter, 1980) [14]. Rotter gave a frequently cited definition of interpersonal trust as “an expectancy held by individuals or groups that the word, promise, verbal, or written statement for another can be relied on [14].” He has also proven through experiments, that trust has positive consequences to people and society overall.

Economics study trust intensively in organizational contexts. Among other factors it is considered a predictor of satisfaction in organizational decision-making. It was also recognized that trust is able to reduce the cost of both intra- and inter-organizational transactions and able to enhance business performance [15]. Trust, defined as “a willingness to rely on an exchange partner in whom one has confidence”, assumed an essential role in establishing and maintaining a long-term relationship between sellers and customers [16].

It can be stated already, that trust is a complex, subjective and abstract concept that is difficult to define. You can find many definitions of trust in literature substituting it with credibility, reliability or confidence. The Oxford English Dictionary in 1971 defines trust as “confidence in or reliance on some quality or attribute of a person or thing, or the truth of a statement”. Mainly though it is a mechanism reducing social complexity on the one hand, but causing vulnerability towards something or somebody on the other hand.

In an article regarding e-commerce, Tan and Thoen considered party trust, control trust and the duality between trust and control as important concepts [17]. Party Trust means trust in the other party. It is subjective and has both an action and an information perspective. Mayer et al. define it as “the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other party will perform a particular action important to the truster, irrespective of the ability to monitor or control that other party [18].” Control

Trust means the trust that is created by a control mechanism. It tends to be more objective than party trust. If there is not enough party trust in a situation, an instance of control trust should be used to increase the overall level of trust. For example, getting a receipt at the dry cleaners stating how many pieces of clothes you handed in, increases your level of trust to get all the pieces back later on.

Psychology was found to one of the most important aspects of trust, which is why it is helpful to have a framework of criteria on how trust is generally observed. Using this framework it will then be possible to draw comparisons between offline trust, in the before described sense, and online trust in the field of technology and cloud computing. According to the overview of Wang and Emurian [11] most researchers study four characteristics of trust:

1. Trustor and trustee

A trusting relationship always consists of a trusting party (trustor) and a party to be trusted (trustee). “The development of trust is based on the ability of the trustee to act in the best interest of the trustor and the degree of trust that the trustor places on the trustee“[11].

2. Vulnerability

The concept of trust only works and is needed in environments where vulnerability, uncertainty and risk are involved. A trustor relies on the trustee not to exploit his vulnerabilities.

3. Produced actions

“Trust leads to actions, mostly risk-taking behaviors. The form of the action depends on the situation, and the action may concern something either tangible or intangible [11].”

4. Subjective matter

In every case trust is a subjective matter. Each individual regards trust differently on a case-by-case basis being influenced by personal and situational factors.

IV. TRUST IN CLOUD COMPUTING TECHNOLOGY

As the introduction of the paper says, some of the major concerns in cloud computing are trust and security. Trust is one of the most critical obstacles for the adoption and growth of cloud computing. Therefore, in this section we will not only refer to the framework with the four characteristics of trust we have just laid out in the preceding chapter, but go beyond this and include security as an object of study, which interacts bilateral with trust.

1. Trustor and trustee

The cloud also relies heavily on the concept of trustor and trustee parties to establish trusting relationships. The difference is that with online trust, the distribution of roles is narrowed down to the cloud service provider being the trustee and the cloud service customer or end user being the trustor.

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

2. Vulnerability

The count of vulnerabilities enterprises face in cloud computing are innumerable. In the digital age of software bugs and ideological hacking groups such as “anonymous” and “LulzSec”, the news are full of exploited vulnerabilities in the Internet. They reach from inadvertent loss of privacy and data theft, to loss of reputation and therefore money. Together, these reasons contribute to the necessity of trust in an insecure and hostile online world.

3. Produced actions

Customer’s trust in cloud service providers can generate a couple of desired actions. An enterprise starts using a cloud service and shares its private and precious data with the cloud computing provider. On top of that, an enterprise might be confident to even pay for the cloud service and continue using it on a regular basis.

4. Subjective matter

Trust in cloud computing and technology is fundamentally as subjective as its offline counterpart. Again each individual and enterprise has different affections and preferences regarding technology that influences the level of trust towards cloud computing.

Meanwhile even more frameworks regarding trust in cloud computing exist. For example, a recent study from the University of Adelaide showed how to determine the credibility of trust feedbacks. In their paper “Trust as a Service: A Framework for Trust Management in Cloud Environments” they implement the Trust as a Service (TaaS) framework to improve ways on trust management in cloud environments [19].

V. CREATING SYSTEMIC TRUST THROUGH IT TECHNOLOGY

In a world wide web and in clouds of anonymity personal trust is a trait that is very hard to find. Therefore, cloud computing has to earn the trust of enterprises, decision makers and users, by relying on other forms of trust. Fortunately, there are many methods to create systemic trust by means of control mechanisms and help of modern virtualization and security technology.

The next sections follow and expand a proposal for a reference deployment model to eliminate user concerns on cloud security by Zhao, Rong, Jaatun and Sandnes [20]. The model deals with security related issues in cloud computing and proposes five service deployment models to address these issues. The proposed model provides different security related features to address different requirements and scenarios. While some scenarios of the deployment model have multiple valid solutions at hand, others have not yet been entirely solved. Keeping the model in mind it is used as a basis and expanded with some similar, but more practical solutions towards a trusted and secure enterprise cloud:

A. Separation, Isolation and Multi-Tenancy

B. Availability and Reliability

C. Data and Service Migration

D. Cryptography

E. Contractually Fixed Agreements

F. Certifications, Standards Compliance and IT Service Quality

G. Transparency

A. Separation, Isolation and Multi Tenancy

Some central mechanisms of increasing importance are identity management and access control. They fit into the category of separation, isolation and multi-tenancy. In contrast to applications and services hosted in-house, proper access management is a must-have. As soon as enterprises decide to use more than one cloud computing service, the challenge rises quickly, due to a couple of issues. Users have to deal with an inflation of credentials, thus increasing the risk of simple and re-used passwords for multiple services. The responsible IT-Managers cannot oversee the access rights of employees or users that are spread across multiple cloud service providers. This fact leads to difficulties in access control management, especially if changes in responsibilities or personnel take place, or an employee resigns. This decentralized identity management also makes central logging of access much more difficult.

A solution to this issue could be to recentralize identity management and access control back into the enterprise by means of single-credential and single-sign-on solutions. A single-credential solution uses a master identity store, either replicated to the cloud, or queried by the cloud service provider, for example via Lightweight Directory Access Protocol (LDAP). A Single-Sign-On solution leverages the single-credential solution and requests authentication from the user only once at the first login. Subsequent authentications to cloud services are automated via asymmetric encryption mechanisms such as Public Key Infrastructure (PKI) using the trust model of certificate authorities (CA). These underlying mechanisms are transparent to the user. Both solutions require an effective protection of the central identity store, as a theft of those credentials provides potential access to all cloud services, granting access based on single-credential or SSO solutions [21].

In their article “Isolation in Cloud Computing and Privacy-Enhancing Technologies” N. Sonehara, I. Echizen and S. Wohlgemuth discuss the common issues around data leakage and loss of privacy [22]. They see isolation as a special kind of privacy protection mechanism, which avoids information exchange between cloud services through their users. Furthermore, isolation should be able to hide the objectives of cloud-users from the cloud service provider. They agree with Ambrust et al. 2010 [9] that the most current and common security mechanism in today’s clouds, to reach the goal of isolation, is primarily virtualization. Ambrust states “It is a powerful defense, and protects against most attempts by users to attack one another or the underlying cloud infrastructure. However, not all resources are virtualized and not all virtualization environments are bug-free. ... Incorrect network virtualization may allow user code access to sensitive portions of the provider’s infrastructure, or to the resources of other

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

users. These challenges, though, are similar to those involved in managing large non-cloud data centers, where different applications need to be protected from one another. Any large Internet service will need to ensure that a single security hole doesn’t compromise everything else [9].” Due to such flaws in technology, it is important not only to rely on a single mechanism to provide trust and security, but to interlink and connect with other mechanisms, as explained in the following sections.

B. Availability and Reliability

Some of cloud computing’s key requirements for information security are availability and reliability. Data centers and cloud services should be designed for scalability and performance as well, and limit the necessity of human interaction [23]. Nonetheless we have seen a number of complete datacenters outages in the recent past, including market leaders such as Amazon and Google. Undheim, Chilwan and Heegaard focus on four different types of failures, namely failures in the power distribution or cooling, network failures, management software failures and server failures [24]. For all types of potential failures there are mechanisms in place that help to reduce the availability- and reliability risks to a minimum level. Two of the four mentioned types of failures were picked, and related work was investigated:

Regarding network failures, Gill, Jain and Nagappan present a large-scale analysis of failures in a data center network [25]. Their key observations state that data center networks are already reliable, especially because of their highly redundant design. Nevertheless, there is room for improvement in some areas. They state that load balancer reliability and the effectiveness of network redundancy have to be improved to mask the impact of network failures from applications. Further they recommend separating the network control plane from the data plane to avoid undesirable interference between application and control traffic.

Venkatesh and Nagappan study server failures, hardware repairs and reliability for large cloud computing datacenters and present a detailed analysis of failure characteristics, as well as a preliminary analysis on failure predictors. They state that “8% of all servers can expect to see at least 1 hardware incident in a given year and that this number is higher for machines with lots of hard disks. ... Chances of seeing another failure on the same server is high. We find that the distribution of successive failure on a machine fits an inverse curve. ... We also find that the location of the datacenter and the manufacturer are the strongest indicators of failures, as opposed to age, configuration etc. [26].” In ongoing work they are working on models for server reliability, including replacing hard disk drives (HDD) with solid state drives (SDD) for better reliability.

Now that we have given an insight into various types of failures, we want to show a conceptual and simple solution design, to circumvent all types of failures that jeopardize availability and reliability of cloud services. The reference deployment model of Zhao, Rong, Jaatun and Sandnes [20] corresponds with the central point on Ambrust’s [9] top ten list of obstacles for growth of cloud computing, namely

availability + business continuity. Their solution is to use multiple cloud service providers, as they describe in their reference deployment model. The model builds an availability model on top of at the best already redundantly designed cloud infrastructure, adding an extra layer of redundancy of its own. The model achieves this by meeting the following three requirements:

• Get two independent cloud service providers offering equivalent data processing services and two independent cloud service providers offering equivalent data storage services.

• Data replication between both data storage providers is bidirectional and transparent to the user.

• Both data processing services must have access to both data storage services, assumed authorization is granted.

“The Availability Model imposes redundancy on both data processing and cloud storage, hence there is no single point of failure with respect to data access. When a data processing service, or a cloud storage service experiences failure, there is always a backup service present to ensure the availability of the data [20].”

All of the above clearly shows that availability and reliability can be established in multiple and redundant ways, and, therefore are able to contribute to establishing trust in cloud services.

C. Data and Service Migration

Another concern of cloud users is potential lack of long¬term service viability and, as a result, the inability to get the data, once placed there, out of the cloud, due to data lock-in with one cloud service provider. In this scenario users would be forced to stay with their cloud service provider, who might request premium prices and thus discourage potential customers to use the cloud service at all. They would only use it, if they really had to, or if they were assured that their data could freely be migrated to other cloud service providers.

Hao, Yen and Thuraisingham consider the problem of service selection and migration in a cloud and developed a framework that simplifies service migration. It also includes a cost model and a genetic decision algorithm to discuss tradeoffs of that matter and find the optimal service migration decisions. In their opinion the important issues surrounding the paper are: “It is necessary to consider the infrastructure support in the cloud to achieve service migration. The computation resources (computer platforms) in the cloud need to be able to support execution of dynamically migrated services. We develop a virtual machine environment and corresponding infrastructure to provide such support. ... It is also essential to have a strong decision support to help determine whether to migrate some services and where to place them. The consideration involves the service migration cost, consistency maintenance cost, and the communication cost gains due to migration. We develop a cost model to correctly capture these costs and help determine the tradeoffs in service selection and migration in clouds. Then, we use a genetic algorithm to search the decision space and make service selection and migration decisions based on the cost tradeoffs... [27].”

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

With their reference deployment model Zhao, Rong, Jaatun and Sandnes go a bit further by stating: “a model that can ensure the capability of migrating data from one cloud to another is imperative... [20].” They demonstrate an abstract model where “the migration of data is guaranteed”. The model utilizes a data processing service through which users process their data and that is capable of migrating data from one cloud storage service to another. The model achieves this by meeting the following three requirements:

• There is a Cloud Data Migration Service that can interact with the Cloud Storage Service that keeps users’ data for exporting users’ data.

• There is a second Cloud Storage Service that allows users to import and export data.

• Two independent cloud providers should provide the two Cloud Storage Services.

Hirofuchi, Ogawa, Nakada, Itoh and Sekiguchi are fulfilling this migration model and believe “the next stage for IaaS cloud technology is cloud federation ... users can easily deploy their applications on any IaaS cloud providers in the same manner, and transparently relocate them to other providers on demand [28].” They back up their proposal with an “advanced storage access mechanism that strongly supports live VM migration over WAN. It rapidly relocates VM disks between source and destination sites with the minimum impact on I/O performance. It is implemented as a transparent proxy server for a storage I/O protocol ... which can be integrated into SAN services in datacenters. This means that the proposed mechanism is independent of VMM implementations [28].” This counters the risk of data lock-in with a particular provider, while still enabling users to select the most appropriate provider any time with the framework of Hao, Yen and Thuraisingham.

The solutions and proposals in [20][27][28] correspond to the second central point on Ambrust’s [9] top ten list of obstacles for growth of cloud computing, namely data lock-in. He thinks standardization of APIs and compatible software enable a surge or hybrid cloud computing. Offering different cloud service selection and migration models, as well as standards, can be used to increase trust in cloud computing.

D. Cryptography

One common way to preserve key requirements, such as confidentiality and integrity in computing, is to encrypt data before, during and even after transport through the Internet for secure storage. As the cloud service provider has access to the data of all its customers, and may offer it, inadvertently or deliberately, to third parties, there is an urgent need for data encryption. One way to conduct this measure is by using combinations of encryption mechanisms. The trust-building and underlying technique used is pre-egression or pre-internet encryption (PIE). This simply means, encrypting data with your own encryption keys before sending it to the cloud. The encryption keys are in possession of the data owner only and are unknown by the cloud service provider or any 3rd party. After the data is encrypted locally it will leave the local premises and transit through the Wide Area Network (WAN).

The cloud service provider should not only offer a tunneled and encrypted transit through the network to the storage destination in the cloud. He should also offer encrypted storage of the data. However, since the cloud service provider knows the encryption keys to those tunnels and storage, the only secure method of processing data is the aforementioned PIE.

Pushing the idea of end-to-end encrypted data even further, is the concept of homomorphic encryption. It can be used to conduct mathematic operations on encrypted data without decrypting it [29]. The major and still unsolved downside to this approach is the immense computing power needed to process the encrypted data and limited support for computing operations, which is why this concept is almost unheard of in the public discussion about cloud trust and cloud security.

E. Contractually Fixed Agreements

As stated earlier in the text, trust can be established by establishing control mechanisms. One example of those control mechanisms is Security Service Level Agreements (SSLA) sometimes also referred to as Protection Level Agreements (PLA). They include contractually fixed security restrictions, compliance checks, as well as security information and event management (SIEM). They can be compared to general terms and conditions a company bases its contracts on or to an acceptable use policy (AUP) and are the only legal obligation of the cloud service provider. However, as of today, besides the technical standardization, there are no publicly defined standards yet in the field of information rights management, secure virtual runtime environments and externalization of identities [30][31].

F. Certifications, Standards Compliance and IT Service Quality

Online trust needs a solid and justified foundation to build upon. There are a number of trust-building measures in the field of standards compliance and certifications, three of which we find particularly appealing.

The first trust-building measure that should help choose the right cloud service provider is certifications. Looking at geographical boundaries, there is the Cloud Security Alliance (CSA) in the US and the Federal Agency for Information Security (BSI) in Germany. Both support an initiative called EuroCloud Star Audit that provides a seal of quality for Software-as-a-Service (SaaS), one of the three subdomains of cloud computing. It focuses on topics like data security, data privacy, drafting of contracts and compliance on the one hand, on the other hand, topics such as professional IT management, transparent and comprehensible processes, encryption, backup, archiving, exit-strategy, service level agreements, performance and many more have top priority. By means of a scoring system, cloud service providers are rated with one to five stars, expressing the degree of fulfillment of aforementioned criteria and therefore the trustworthiness. In the near future EuroCloud Star Audit will be expanded to the other two subdomains of cloud computing, namely Platform-as-a-Service (Paas) and Infrastructure-as-a-Service (IaaS), to enable a more complete rating of cloud service providers [32].

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

The second trust-building measure that should help choose the right cloud service provider is standards compliance. The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) created a series of information security standards, namely the 27000-series. It provides best practice recommendations on information security management, risks and control within the context of an overall Information Security Management System (ISMS). The series is applicable to all types and sizes of organizations and, most importantly, for cloud service providers. Among other topics it covers privacy, confidentiality and IT or technical security issues. The standards series includes ISO/IEC 27001, a standard that specifies requirements for establishing, implementing, operating, monitoring, reviewing, maintaining and improving a documented Information Security Management System within the context of the organization's overall business risks. It specifies requirements for the implementation of security controls customized to the needs of individual organizations or parts thereof. It is designed to ensure the selection of adequate and proportionate security controls that protect information assets and give confidence to interested parties. The succeeding standards ISO/IEC 27003, 27004, 27005 and 27006 all refer to the requirements defined in 27001. ISO/IEC 72003 focuses on the critical aspects needed for successful design and implementation of an ISMS. ISO/IEC 27004 provides guidance on the development and use of measures and measurement in order to assess the effectiveness of an implemented ISMS. ISO/IEC 27005 specifies guidelines for information security risk management and ISO/IEC 27006 specifies requirements and guidance for bodies providing audit and certification of an ISMS and is primarily intended to support the accreditation of certification bodies providing ISMS certification [33]. By implementing an ISO/IEC 27001 information security management system, the organization adopts a comprehensive and systematic approach to the security of the process control systems and can therefore be formally audited and certified compliant with the standard.

The third trust-building measure that should help choose the right cloud service provider is IT service quality as defined in the IT Infrastructure Library (ITIL) framework. It is independent of manufacturers, and describes systematic procedures for the strategic development, design, introduction, transition, operation and improvement of IT services. It closely follows ISO/IEC 20000, which provides a formal and universal standard for organizations seeking to have their service management capabilities audited and certified. ITIL version 3, passed in June 2007, consists of five books: Service strategy, service design, service transition, service operation and continual service improvement. Cloud service providers that have aligned their services to the ITIL framework can increase their trustworthiness not only, but mainly because of three ITIL building blocks:

• Information Security Management (ISM)

• Availability Management

• Access Management

ISM ensures most of the information security key concepts: Confidentiality, integrity and availability of an organization’s

assets, information, data and IT services. Information security is aligned with business security and ISM ensures that information security is effectively managed in all service management processes, activities, etc. The ISM process should be a focal point for all IT security issues and should increase awareness of the need for security within all IT services. A main task of ISM is to produce, maintain and enforce the information security policy.

Availability Management focuses and manages all availability-related issues and is responsible for defining, analyzing, planning, measuring and improving all aspects of the availability of IT services. It ensures that the IT infrastructure and processes, tools, roles etc. are appropriate for the agreed service level targets for availability. This process thus secures the level of availability delivered in all services is matched to, or exceeds the current and future agreed needs of the customers in a cost-effective manner. Availability Management is important because availability and reliability are highly visible to the customers and can directly influence customer satisfaction and the service provider’s reputation.

Access Management deals with protecting the confidentiality, integrity and availability of the organization’s data and intellectual property. It achieves this by ensuring that only authorized users are able to access or modify the service assets. It provides the right for users to use a service or group of services, while preventing access to non-authorized users. It may also be needed for regulatory compliance reasons. Technologically, Access Management is usually executed by means of directory services [14][34].

All of the three suggested trust-building measures have one thing in common: They prove through examination of a trusted third party that the cloud service provider operates with the necessary care and accuracy required by the presented certifications, standards, frameworks and grants compliance. The willingness of the provider to do so creates transparency for the cloud users and a chance to make a well-informed decision.

G. Transparency

As learned, trust is always a subjective matter, which gives transparency requirements for trust a soft and elastic touch. Transparency has multiple facets though. Trust through transparency can be induced by very simple means such as a web interface design or by more sophisticated means such as a conglomeration of technological factors.

In [11], a framework of four trust-inducing features is proposed by taking existing relevant studies on enhancing online trust by web interface design and using them as dimensions of the framework. The four dimensions are graphic design, structure design, content design and social-cue design. Graphic design refers to the graphical design factors on the web site that normally give consumers a first impression. Structure design defines the overall organization and accessibility of displayed information on the web site. Content design refers to the informational components that can be included on the web site, either textual or graphical. Social-cue design relates to embedding social cues, such as face-to-face

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

interaction and social presence into web interface via different communication media.

Compared to a trust-inducing web interface design, transparency as add on to technological security mechanisms has much clearer and more precise requirements. Contradicting the often-used principle of security by obscurity, T. Weichert demands security by transparency [31]. He sets up multiple factors on how to reach this goal:

• State of the art measures

• Access restricted to entitled users

• Differentiated access management

• Encryption capabilities

• Anonymization tools

• Adequate separation of data by isolating

• Client-side application security

• Documented data privacy management

His statement is simple to understand: The more of these factors are in place, the higher the transparency and therefore security for cloud service customers will be.

VI. CONCLUSION AND FUTURE WORK

Cloud computing services will grow further, regardless of whether a cloud service provider sells services at a low level of abstraction as IaaS, at the medium level as PaaS or at the top level as SaaS. Trust and security go hand in hand - one might even go as far as saying one induces the other.

This paper presented a state of the art overview of the role of trust in cloud computing. Explaining and mapping offline trust to online trust, we showed that the concept of trust does also exist and even plays a vital role in the online world. Trust and security are an integral part of cloud computing and essential for its adoption and growth.

Our main contribution is showing multiple ways to improve online trust and security by leveraging and combining as many existing technology and trust building measures as possible, and by that, minimizing concerns of potential or existing cloud service users. In our opinion, the bottom line of this state of the art overview is, that trust in cloud computing can indeed be improved by means of technology.

A. Limitations

The paper did provide several existing approaches to the issue of insufficient trust and security in cloud computing. However, there are several limitations that have to be acknowledged. The paper did not examine infrastructure issues such as data transfer bottlenecks and performance unpredictability. Computing, storage and networking must all focus on horizontal scalability of virtualized resources rather than on single node performance. Infrastructure in all areas has to be improved, not only in respect to trust and security, but also in respect bandwidth and cost. Furthermore, the paper only highlighted a fractional amount of available security and trust

enhancing mechanisms, which we found most important. There are a large number of other efficient mechanisms, standards and an even larger number under investigation in research and development.

B. Future Research and Recommendations

This paper’s examples contribute to the ongoing effort of minimizing the challenges regarding trust and security in cloud computing. What still remains is the issue that users have to trust the presented technology, certifications, standards and finally the cloud service provider itself.

Even though trust per definition remains the willingness of a party to be vulnerable to the actions of another party, many unsolved technical issues still exist and many solutions can be improved in order to reduce this inevitable residual risk.

Future research on this topic should include the simplification of cloud security models, for example by standardizing and leveraging protocols, such as the Open Authorization Protocol (OAuth) and the Security Assertion Markup Language (SAML). With the vision of Inter-Cloud-Computing in mind, which introduces an additional management layer above conventional cloud computing systems [35] to reach greater sustainability and availability, large IT companies have to work together more intensely in taskforces, alliances and foundations to push towards this common goal.

ACKNOWLEDGMENT

The authors gratefully acknowledge support from grant 17103X10 from the German federal ministry of education and research.

REFERENCES

[1] C. Weinhardt, A. Anandasivam, B. Blau, N. Borissov, T. Meinl, W. Michalk, and J. Stößer, “Cloud Computing – A classification, business models, and research directions,” Business & Information Systems

Engineering, 5, pp. 391–399, 2009.

[2] C. Baun, M. Kunze, T. Kurze, and V. Mauch, “Private Cloud-Infrastrukturen und Cloud-Plattformen,” Informatik Spektrum, vol. 34, no. 3, pp. 242–254, 2011.

[3] Cloud Security Alliance, (2009) “Security guidance for critical areas of

focus in Cloud Computing,” [Online]. Available:

https://cloudsecurityalliance.org/wp-content/themes/csa/guidance-download-box.php [retrieved: April, 2012]

[4] A. Weiss, “Computing in the clouds,” networker, vol. 11, no. 4, pp. 16– 25, 2007.

[5] F. Kamoun, “Virtualizing the datacenter without compromising server performance,” Ubiquity, vol. 2009, no. August, p. 2, 2009.

[6] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, “The cost of a cloud: research problems in data center networks,” ACM SIGCOMM Comput. Commun., vol. 39, no. 1, p. 68, 2009.

[7] B. Hayes, “Cloud Computing,” Comm. ACM, vol. 51, no. 7, p. 9, 2008.

ICIW 2012 : The Seventh International Conference on Internet and Web Applications and Services

[8] S. Pearson, “Taking account of privacy when designing cloud computing

services,” in Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing (CLOUD '09), pp. 44–52.

[9] M. Armbrust, I. Stoica, M. Zaharia, A. Fox, R. Griffith, A. D. Joseph, et al., “A view of cloud computing,” Comm. ACM, vol. 53, no. 4, p. 50, 2010.

[10] J. Bosch, S. Clarke, M. Mowbray, and S. Pearson, “A client-based privacy manager for cloud computing,” in COMSWARE '09 Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE, p. 1, 2009.

[11] Y. D. Wang and H.H. Emurian, “An overview of online trust: Concepts, elements, and implications,” Computers in Human Behavior, vol. 21, no. 1, pp. 105–125, 2005.

[12] E. H. Erikson, “Childhood and society” (2nd ed.), New York: W.W. Norton, 1963.

[13] M. Deutsch, “Cooperation and trust: Some theoretical notes, ” Nebraska Symposium on Motivation, 10, pp. 275–318, 1962.

[14] J. B. Rotter, “A new scale for the measurement of interpersonal trust,” J of Personality, vol. 35, no. 4, pp. 651–665, 1967.

[15] B. Uzzi, “Social structure and competition in interfirm networks: The paradox of embeddedness,” Administrative Science Quarterly, vol. 42, no. 1, pp. 35–67, 1997.

[16] C. Moorman, R. Deshpande, and G. Zaltman, “Factors affecting trust in market research relationships,” J of Marketing, vol. 57, no. 1, pp. 81– 101, 1993.

[17] Y. Tan and W. Thoen, “Toward a generic model of trust for electronic commerce,” International J of Electronic Commerce, vol. 5, no. 2 (Winter, 2000/2001), pp. 61-74, 2001.

[18] R. C. Mayer, J. H. Davis, and F. D. Schoorman, “An integrative model of organizational trust,” Academy of Management Review, vol. 20, no. 3, pp. 709–734, 1995.

[19] T. H. Noor and Q. Z. Sheng, “Trust as a service: A framework for trust management in cloud environments,” pp. 314–321, 2011.

[20] G. Zhao, C. Rong, M. G. Jaatun, and F. E. Sandnes, “Reference deployment models for eliminating user concerns on cloud security,” J of Supercomputing, 2010.

[21] P. Laue and O. Stiemerling, “Identitäts- und Zugriffsmanagement für Cloud Computing Anwendungen,” Datenschutz und Datensicherheit, vol. 34, no. 10, pp. 692–697, 2010.

[22] N. Sonehara, I. Echizen, and S. Wohlgemuth, “Isolation in cloud computing and privacy-enhancing technologies,” Business & Information Systems Engineering, vol. 3, no. 3, pp. 155–162, 2011.

[23] E. Nygren, R. K. Sitaraman, and J. Sun, “The Akamai network; a platform for high-performance internet applications,” SIGOPS Oper. Syst., vol. 44, no. 3, p. 2, 2010.

[24] A. Undheim, A. Chilwan, and P. Heegaard, “Differentiated availability in cloud computing SLAs,” in Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing (GRID ’11), pp. 129–136.

[25] P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data centers,” SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, pp. 350-361, 2011.

[26] J. M. Hellerstein, S. Chaudhuri, and M. Rosenblum, K. V. Vishwanath, N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proceedings of the 1st ACM symposium on Cloud Computing (SoCC ’10), p. 193-204, 2010.

[27] W. Hao, I. Yen, and B. Thuraisingham, “Dynamic service and data migration in the clouds,” in Computer Software and Applications Conference, COMPSAC '09. 33rd Annual IEEE International, pp. 134– 139, 2009.

[28] T. Hirofuchi, H. Ogawa, H. Nakada, S. Itoh, and S. Sekiguchi, “A live storage migration mechanism over WAN for relocatable virtual machine services on clouds,” in Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID '09), pp. 460–465.

[29] F. Kerschbaum, “Secure and sustainable benchmarking in clouds,” Business & Information Systems Engineering, vol. 3, no. 3, pp. 135–143, 2011.

[30] S: Paulus, “Standards für trusted clouds,” Datenschutz und Datensicherheit, vol. 35, no. 5, pp. 317–321, 2011.

[31] T. Weichert, “Cloud Computing und Datenschutz,” Datenschutz und Datensicherheit, vol. 34, no. 10, pp. 679–687, 2010.

[32] R. Giebichenstein and A. Weiss, “Zertifizierte Cloud durch das EuroCloud Star Audit SaaS,” Datenschutz und Datensicherheit, vol. 35, no. 5, pp. 338–342, 2011.

[33] International Organization for Standardization at http://www.iso.org [retrieved: April, 2012]

[34] Materna Information & Communications, (2012), ITIL Version 3 Pocket

Guide [Online]. Available:

http://www.materna.de/cae/servlet/contentblob/11600/publicationFile/24 65/Pocketbrosch%C3%BCre ITIL%C2%AE Version 3.pdf [retrieved: April, 2012]

[35] T. Aoyama and H. Sakai, “Inter-cloud computing,” Business & Information Systems Engineering, vol. 3, no. 3, pp. 173–177, 2011.

Generating Text with Recurrent Neural Networks

Ilya Sutskever ILYA@CS.UTORONTO.CA

James Martens JMARTENS@CS.TORONTO.EDU

Geoffrey Hinton HINTON@CS.TORONTO.EDU

University of Toronto, 6 King’s College Rd., Toronto, ON M5S 3G4 CANADA

Abstract

Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely diffi-cult to train them properly. Fortunately, re¬cent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence prob¬lems. In this paper we demonstrate the power of RNNs trained with the new Hessian-Free op¬timizer (HF) by applying them to character-level language modeling tasks. The standard RNN ar-chitecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or “gated”) con¬nections which allow the current input charac¬ter to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, we were able to surpass the performance of the best previous single method for character-level language modeling – a hierarchical non-parametric sequence model. To our knowledge this represents the largest recurrent neural net¬work application to date.

1. Introduction

Recurrent Neural Networks (RNNs) form an expressive model family for sequence tasks. They are powerful be-cause they have a high-dimensional hidden state with non¬linear dynamics that enable them to remember and process past information. Furthermore, the gradients of the RNN are cheap to compute with backpropagation through time. Despite their attractive qualities, RNNs failed to become a

mainstream tool in machine learning due to the difficulty of training them effectively. The cause of this difficulty is the very unstable relationship between the parameters and the dynamics of the hidden states, which manifests it¬self in the “vanishing/exploding gradients problem” (Ben-gio et al., 1994). As a result, there has been surprisingly lit¬tle research on standard RNNs in the last 20 years, and only a few successful applications using large RNNs (Robin¬son, 2002; Pollastri et al., 2002), including a recent no¬table application of RNNs as a word-level language model (Mikolov et al., 2010).

Recently, Martens (2010) developed a greatly improved variant of Hessian-Free optimization (HF) which was pow¬erful enough to train very deep neural networks from ran¬dom initializations. Since an RNN can be viewed as an extremely deep neural network with weight sharing across time, the same HF optimizer should be able to train RNNs. Fortunately, Martens & Sutskever (2011) were able to show that this is indeed the case, and that this form of non-diagonal, 2nd-order optimization provides a principled so¬lution to the vanishing gradients problem in RNNs. More¬over, with the addition of a novel damping mechanism, Martens & Sutskever (2011) showed that the HF optimizer is robust enough to train RNNs, both on pathological syn-thetic datasets known to be impossible to learn with gra¬dient descent, and on complex and diverse real-world se¬quence datasets.

The goal of the paper is to demonstrate the power of large RNNs trained with the new Hessian-Free optimizer by ap¬plying them to the task of predicting the next character in a stream of text. This is an important problem because a bet¬ter character-level language model could improve compres¬sion of text files (Rissanen & Langdon, 1979) and make it easier for people with physical disabilities to interact with computers (Ward et al., 2000). More speculatively, achiev¬ing the asymptotic limit in text compression requires an understanding that is “equivalent to intelligence” (Hutter, 2006). Good compression can be achieved by exploiting simple regularities such as the vocabulary and the syntax of the relevant languages and the shallow associations exem

Generating Text with Recurrent Neural Networks

plified by the fact that the word “milk” often occurs soon after the word “cow”, but beyond a certain point any im¬provement in performance must result from a deeper un¬derstanding of the text’s meaning.

Although standard RNNs are very expressive, we found that achieving competitive results on character-level lan¬guage modeling required the development of a different type of RNN that was better suited to our application. This new “MRNN” architecture uses multiplicative con¬nections to allow the current input character to determine the hidden-to-hidden weight matrix. We trained MRNNs on over a hundred of megabytes of text for several days using 8 Graphics Processing Units in parallel to perform significantly better than one of the best word-agnostic sin¬gle character-level language models: the sequence memo-izer (Wood et al., 2009; Gasthaus et al., 2010), which is a hierarchical nonparametric Bayesian method. It defines a prior process on the set of predictions at every conceivable context, with judiciously chosen details that make approx¬imate inference computationally tractable. The memoizer induces dependencies between its predictions by making similar predictions at similar contexts. Although intelli-gent marginalization techniques are able to eliminate all but a relatively small number of the random variables (so the datastructures used scale linearly with the amount of data), its memory requirements are still prohibitively ex¬pensive for large datasets, which is a direct consequence of its nonparametric nature.

While our method performs at the state of the art for pure character-level models, its compression performance falls short of the best models which have explicit knowledge of words, the most powerful of these being PAQ8hp12 (Ma¬honey, 2005). PAQ is a mixture model of a large number of well-chosen context models whose mixing proportions are computed by a neural network whose weights are a function of the current context, and whose predictions are further combined with a neural-network like model. Un¬like standard compression techniques, some of PAQ’s con¬text models not only consider contiguous contexts but also contexts with “gaps”, allowing it to capture some types of longer range structures cheaply. More significantly, PAQ is not word-agnostic, because it uses a combination of character-level and word-level models. PAQ also prepro¬cesses the data with a dictionary of common English words which we disabled, because it gave PAQ an unfair advan¬tage over models that do not use such task-specific (and indeed, English-specific) explicit prior knowledge. The nu-merous mixture components of PAQ were chosen because they improved performance on a development set, so in this respect PAQ is similar in model complexity to the winning entry of the netflix prize (Bell et al., 2007).

Finally, language models can be used to “generate” lan

Figure 1. A Recurrent Neural Network is a very deep feedforward neural network whose weights are shared across time. The non¬linear activation function used by the hidden units is the source of the RNN’s rich dynamics.

guage, and to our surprise, the text generated by the MRNNs we trained exhibited a significant amount of inter¬esting and high-level linguistic structure, featuring a large vocabulary, a considerable amount of grammatical struc¬ture, and a wide variety of highly plausible proper names that were not in the training set. Mastering the vocabulary of English did not seem to be a problem for the MRNN: it generated very few uncapitalized non-words, and those that it did generate were often very plausible, like “homosoma-list” or “un-ameliary”. Of particular interest was the fact that the MRNN learned to balance parentheses and quotes over long distances (e.g., 30 characters). A character-level N-gram language model could only do this by mod¬eling 31-grams, and neither Memoizer nor PAQ are rep-resentationally capable of balancing parentheses because of their need for exact context matches. In contrast, the MRNN’s nonlinear dynamics enables it to extract higher level “knowledge” from the text, and there are no obvious limits to its representational power because of the ability of its hidden states to perform general computation.

2. Recurrent Neural Networks

A Recurrent Neural Network is a straightforward adapta¬tion of the standard feed-forward neural network to allow it to model sequential data. At each timestep, the RNN receives an input, updates its hidden state, and makes a prediction (fig. 1). The RNN’s high dimensional hidden state and nonlinear evolution endow it with great expres¬sive power, enabling the hidden state of the RNN to inte¬grate information over many timesteps and use it to make accurate predictions. Even if the non-linearity used by each unit is quite simple, iterating it over time leads to very rich dynamics.

The standard RNN is formalized as follows: Given a se-quence of input vectors (x1, ... , xT ), the RNN computes a sequence of hidden states (h1, ... , hT) and a sequence of outputs (o1, ... , oT ) by iterating the following equations

Generating Text with Recurrent Neural Networks

fort = 1 to T:

ht = tanh(Whxxt + Whhht1 + bh) (1)

ot = Wohht + bo (2)

In these equations, Whx is the input-to-hidden weight ma¬trix, Whh is the hidden-to-hidden (or recurrent) weight ma¬trix, Woh is the hidden-to-output weight matrix, and the vectors bh and bo are the biases. The undefined expres¬sion Whhht1 at time t = 1 is replaced with a special ini¬tial bias vector, hinit, and the tanh nonlinearity is applied coordinate-wise.

The gradients of the RNN are easy to compute via back-propagation through time (Rumelhart et al., 1986; Werbos, 1990)1, so it may seem that RNNs are easy to train with gradient descent. In reality, the relationship between the parameters and the dynamics of the RNN is highly unsta¬ble which makes gradient descent ineffective. This intu¬ition was formalized by Hochreiter (1991) and Bengio et al. (1994) who proved that the gradient decays (or, less fre¬quently, blows up) exponentially as it is backpropagated through time, and used this result to argue that RNNs can¬not learn long-range temporal dependencies when gradi¬ent descent is used for training. In addition, the occasional tendency of the backpropagated gradient to exponentially blow-up greatly increases the variance of the gradients and makes learning very unstable. As gradient descent was the main algorithm used for training neural networks at the time, these theoretical results and the empirical difficulty of training RNNs led to the near abandonment of RNN re¬search.

One way to deal with the inability of gradient descent to learn long-range temporal structure in a standard RNN is to modify the model to include “memory” units that are specially designed to store information over long time pe¬riods. This approach is known as “Long-Short Term Mem¬ory” (Hochreiter & Schmidhuber, 1997) and has been suc¬cessfully applied to complex real-world sequence mod¬eling tasks (e.g., Graves & Schmidhuber, 2009). Long-Short Term Memory makes it possible to handle datasets which require long-term memorization and recall but even on these datasets it is outperformed by using a standard RNN trained with the HF optimizer (Martens & Sutskever, 2011).

Another way to avoid the problems associated with back-propagation through time is the Echo State Network (Jaeger & Haas, 2004) which forgoes learning the recurrent con¬nections altogether and only trains the non-recurrent out¬put weights. This is a much easier learning task and it works surprisingly well provided the recurrent connections

1In contrast, Dynamic Bayes Networks (Murphy, 2002), the probabilistic analogues of RNNs, do not have an efficient algo¬rithm for computing their gradients.

Figure 2. An illustration of the significance of the multiplicative connections (the product is depicted by a triangle). The presence of the multiplicative connections enables the RNN to be sensitive to conjunctions of context and character, allowing different con¬texts to respond in a qualitatively different manner to the same input character.

are carefully initialized so that the intrinsic dynamics of the network exhibits a rich reservoir of temporal behaviours that can be selectively coupled to the output.

3. The Multiplicative RNN

Having applied a modestly-sized standard RNN archi-tecture to the character-level language modeling problem (where the target output at each time step is defined as the the input character at the next time-step), we found the performance somewhat unsatisfactory, and that while in¬creasing the dimensionality of the hidden state did help, the per-parameter gain in test performance was not suffi¬cient to allow the method to be both practical and com¬petitive with state-of-the-art approaches. We address this problem by proposing a new temporal architecture called the Multiplicative RNN (MRNN) which we will argue is better suited to the language modeling task.

3.1. The Tensor RNN

The dynamics of the RNN’s hidden states depend on the hidden-to-hidden matrix and on the inputs. In a standard RNN (as defined by eqs. 1-2), the current input xt is first transformed via the visible-to-hidden weight matrix Whx and then contributes additively to the input for the current hidden state. A more powerful way for the current input character to affect the hidden state dynamics would be to determine the entire hidden-to-hidden matrix (which de¬fines the non-linear dynamics) in addition to providing an additive bias.

One motivation for this approach came from viewing an RNN as a model of an unbounded tree in which each node is a hidden state vector and each edge is labelled by a char¬acter that determines how the parent node gives rise to the child node. This view emphasizes the resemblance of an RNN to a Markov model that stores familiar strings of char¬acters in a tree, and it also makes it clear that the RNN tree is potentially much more powerful than the Markov model because the distributed representation of a node allows dif

Generating Text with Recurrent Neural Networks

ferent nodes to share knowledge. For example, the charac¬ter string “ing” is quite probable after “fix” and also quite probable after “break”. If the hidden state vectors that rep¬resent the two histories “fix” and “break” share a common representation of the fact that this could be the stem of a verb, then this common representation can be acted upon by the character “i” to produce a hidden state that predicts an “n”. For this to be a good prediction we require the conjunction of the verb-stem representation in the previous hidden state and the character “i”. One or other of these alone does not provide half as much evidence for predict¬ing an “n”: It is their conjunction that is important. This strongly suggests that we need a multiplicative interaction. To achieve this goal we modify the RNN so that its hidden-to-hidden weight matrix is a (learned) function of the cur¬rent input xt:

( l ht = tanh Whxxt + W(xt)

hh ht1 + bh (3)

ot = Wohht + bo (4)

These are identical to eqs. 1 and 2, except that Whh is re

placed with W(xt)

hh , allowing each character to specify a

different hidden-to-hidden weight matrix.

It is natural to define W(xt)

hh using a tensor. If we store

M matrices, Whh(1), ... , W(M)

hh , where M is the number of

dimensions of xt, we could define W(xt)

hh by the equation

xt W(m)

(m) (5)

where x(m)

t is the m-th coordinate of xt. When the input

xt is a 1-of-M encoding of a character, it is easily seen that every character has an associated weight matrix and W(xt)

is the matrix assigned to the character represented by xt. 2 3.2. The Multiplicative RNN

The above scheme, while appealing, has a major drawback: Fully general 3-way tensors are not practical because of their size. In particular, if we want to use RNNs with a large number of hidden units (say, 1000) and if the dimen¬sionality of xt is even moderately large, then the storage required for the tensor W(xt)

hh becomes prohibitive.

It turns out we can remedy the above problem by factoring the tensor W(x)

hh (e.g., Taylor & Hinton, 2009). This is done by introducing the three matrices Wfx, Whf, and Wfh, and reparameterizing the matrix W(xt)

hh by the equation

W(xt)

hh = Whf • diag(Wfxxt) • Wfh (6)

2The above model, applied to discrete inputs represented with their 1-of-M encodings, is the nonlinear version of the Observ¬able Operator Model (OOM; Jaeger, 2000) whose linear nature makes it closely related to an HMM in terms of expressive power.

Figure 3. The Multiplicative Recurrent Neural Network “gates” the recurrent weight matrix with the input symbol. Each triangle symbol represents a factor that applies a learned linear filter at each of its two input vertices. The product of the outputs of these two linear filters is then sent, via weighted connections, to all the units connected to the third vertex of the triangle. Consequently every input can synthesize its own hidden-to-hidden weight ma¬trix by determining the gains on all of the factors, each of which represents a rank one hidden-to-hidden weight matrix defined by the outer-product of its incoming and outgoing weight-vectors to the hidden units. The synthesized weight matrices share “struc¬ture” because they are all formed by blending the same set of rank one matrices. In contrast, an unconstrained tensor model ensures that each input has a completely separate weight matrix.

If the dimensionality of the vector Wfxxt, denoted by F, is sufficiently large, then the factorization is as expressive as the original tensor. Smaller values of F require fewer parameters while hopefully retaining a significant fraction of the tensor’s expressive power.

The Multiplicative RNN (MRNN) is the result of factoriz¬ing the Tensor RNN by expanding eq. 6 within eq. 3. The MRNN computes the hidden state sequence (h1, ... , hT), an additional “factor state sequence” (f1, ... , fT), and the

output sequence (o1, ... , oT ) by iterating the equations

ft = diag(Wfxxt) • Wfhht1 (7)

ht = tanh(Whfft+ Whxxt) (8)

ot = Wohht + bo (9)

which implement the neural network in fig. 3. The tensor factorization of eq. 6 has the interpretation of an additional layer of multiplicative units between each pair of consec¬utive layers (i.e., the triangles in fig. 3), so the MRNN ac¬tually has two steps of nonlinear processing in its hidden states for every input timestep. Each of the multiplicative units outputs the value ft of eq. 7 which is the product of the outputs of the two linear filters connecting the multi¬plicative unit to the previous hidden states and to the inputs.

We experimentally verified the advantage of the MRNN over the RNN when the two have the same number of pa¬rameters. We trained an RNN with 500 hidden units and an MRNN with 350 hidden units and 350 factors (so the RNN has slightly more parameters) on the “machine learn¬ing” dataset (dataset 3 in the experimental section). After extensive training, the MRNN achieved 1.56 bits per char¬acter and the RNN achieved 1.65 bits per character on the

Generating Text with Recurrent Neural Networks

test set.

3.3. The difficulty of learning multiplicative units

In an MRNN, the effective weight W (c) 3 from hidden unit

i to hidden unit j contributed by character c is given by:

W(c)

ij =

f WifWfcWfj (10)

This product of parameters makes gradient descent learning difficult. If, for example, Wif is very small and Wfj is very large we get a very large deriviative for the very small weight and a very small derivative for the very large weight. Fortunately, this type of difficulty is exactly what second-order methods are good at handling, so multiplicative units should be better handled by a 2nd-order approach like the HF optimizer.

4. The RNN as a Generative Model

The goal of character-level language modeling is to pre¬dict the next character in a sequence. More formally, given a training sequence (x1, ... , xT ), the RNN uses the sequence of its output vectors (o1, ... , oT ) to ob¬tain a sequence of predictive distributions P(xt+1x<t) = softmax(ot), where the softmax distribution is defined by P(softmax(ot) = j) = exp(o(j)

t )/~k exp(o(k)

t ). The lan-guage modeling objective is to maximize the total log prob¬ability of the training sequence ~T 1

t=0 log P(xt+1x<t), which implies that the RNN learns a probability distribu¬tion over sequences. Even though the hidden units are de¬terministic, we can sample from an MRNN stochastically because the states of its output units define the conditional distribution P(xt+1x<t). We can sample from this condi¬tional distribution to get the next character in a generated string and provide it as the next input to the RNN. This means that the RNN is a directed non-Markov model and, in this respect, it resembles the sequence memoizer (Wood et al., 2009).

5. The experiments

The goal of our experiments is to demonstrate that the MRNN, when trained by HF, learns high-quality language models. We demonstrate this by comparing the MRNN to the sequence memoizer and to PAQ on three real-world lan¬guage datasets. After splitting each dataset into a training and test set, we trained a large MRNN, a sequence memo-izer4, and PAQ, and report the bits per character (bpc) each model achieves on the test set.

3We slightly abuse notation, using Wij(c) to stand for W(c)

hh ij.

4Which has no hyper-parameters and strictly speaking isn’t ’trained’ but rather conditioned the training set.

Owing to its nonparametric nature and the nature of the data-structures it uses, the sequence memoizer is very memory intensive, so it can only be applied to training datasets of roughly 130MB on a machine with 32GB of RAM. In contrast, the MRNN can be applied to datasets of unlimited size although it typically requires consider¬ably more total FLOPS to achieve good performance (but, unlike the memoizer, it is easily parallelized). However, to make the experimental comparison fair, we train the MRNN, the memoizer, and PAQ on datasets of the same size.

5.1. The datasets

We now describe the datasets. Each dataset is a long string of characters from an 86-character alphabet of about 100MB that includes digits and punctuation, together with a special symbol which indicates that the character in the original text was not one of the other 85 characters in our alphabet. The last 10 million characters of each dataset are used as a test set.

1. The first dataset is a sequence of characters from the En¬glish Wikipedia. We removed the XML and the Wikipedia markup to clean the dataset. Since Wikipedia is extremely nonuniform, we randomly permuted its articles before par¬titioning it into a train and a test set.

2. The second dataset is a collection of articles from the New York Times (Sandhaus, 2008).

3. The third dataset is a corpus of machine learning pa¬pers. We constructed this dataset by downloading every NIPS and JMLR paper, and converting them to plain text using the pdftotext utility. We then translated a large num¬ber of special characters to their ascii equivalents (includ¬ing non-ascii punctuation, greek letters, and the “fi” and “if” symbol) to clean the dataset, and removed most of the unstructured text by using only sentences consisting of at least 70% alphanumeric characters. Finally, we randomly permuted the papers.

The first two corpora are subsets of larger corpora (over 1GB large), but the semi-online nature of our optimizer makes it easy to train the MRNN on a dataset of any size.

5.2. Training details

To compute the exact gradient of the log probability of the training set (eq. 4), the MRNN needs to process the entire training set sequentially and store the hidden state sequence in order to apply backpropagation-through-time. This is in¬feasible due to the size of the training set but it is also un¬necessary: Training the MRNN on many shorter sequences is just as effective, provided they are several hundred char¬acters or more long. If the sequences are too short, we fail to utilize the ability of the HF optimizer to capture long

Generating Text with Recurrent Neural Networks

Table 1. This table shows the test bits per character for each ex¬periment, with the training bits in brackets (where available). The MRNN achieves lower bits per character than the sequence mem-oizer but higher than PAQ on each of the three datasets. The MRNN (full set) column refers to MRNNs trained on the larger (1GB) training corpora (except for the ML dataset which is not a subset of a larger corpus). Note, also, that the improvement re¬sulting from larger dataset is modest, implying that the an MRNN with 1500 units and factors is fairly well-trained with 100MB of text.

DATA SET MEMO IZER PAQ MRNN MRNN

(FULL SET)

WIKI 1.66 1.51 1.60 (1.53) 1.55 (1.54)

NYT 1.49 1.38 1.48 (1.44) 1.47 (1.46)

ML 1.33 1.22 1.31 (1.27)

term dependencies spanning hundreds of timesteps.

An advantage of using a large number of relatively short sequences over using a single long sequence is that the for¬mer is much easier to parallelize. This is essential, since our preliminary experiments suggested that HF applied to MRNNs works best when the gradient is computed us¬ing millions of characters and the curvature-matrix vec¬tor products are computed using hundreds of thousands of characters. Using a highly parallel system (consisting of 8 high-end GPUs with 4GB of RAM each), we computed the gradient on 160•300=48000 sequences of length 250, of which 8•300=2400 sequences were used to compute the curvature-matrix vector products that are needed for the HF optimizer (Martens & Sutskever, 2011) (so each GPU pro¬cesses 300 sequences at a time).

The first few characters of any sequence are much harder to predict because they do not have a sufficiently large con¬text, so it is not beneficial to have the MRNN spend neural resources predicting these characters. We take this effect into account by having the MRNN predict only the last 200 timesteps of the 250-long training sequences, thus provid¬ing every prediction with at least 50 characters of context.

The Hessian-Free optimizer (Martens, 2010) and its RNN-specialized variant (Martens & Sutskever, 2011) have a small number of meta-parameters that must be specified. We set the structural damping coefficient µ to 0.1, and ini¬tialized λ to 10 (see Martens & Sutskever (2011) for a de¬scription of these meta-parameters). Our HF implemen¬tation uses a different subset of the training data at every iteration, so at a coarse temporal scale it is essentially on¬line. In this setup, training lasted roughly 5 days for each dataset.

We found that a total of 160•150 weight updates was suffi¬

cient to adequately train an MRNN. More specifically, we used 160 steps of HF, with each of these steps using a max¬imum of 150 conjugate gradient iterations to approach the minimum of the quadratic Gauss-Newton-based approxi¬mation to the objective function, which remains fixed dur¬ing the conjugate gradient iterations. The small number of weight updates, each requiring a massive amount of com¬putation, makes the HF optimizer much easier to parallelize than stochastic gradient descent.

In all our experiments we use MRNNs with 1500 hidden units and 1500 factors (F), which have 4,900,000 parame-ters. The MRNNs were initialized with sparse connections: each unit starts out with 15 nonzero connections to other units (see Martens & Sutskever, 2011). Note that if we un¬roll the MRNN in time (as in fig. 3) we obtain a neural network with 500 layers of size 1500 if we view the mul¬tiplicative units ft as layers. This is arguably the deepest and largest neural network ever trained.

5.3. The results

The main experimental results are shown in table 5.2. We see that the MRNN predicts the test set more accurately than the sequence memoizer but less accurately than the dictionary-free PAQ on the three datasets.

5.4. Debagging

It is easy to convert a sentence into a bag of words, but it is much harder to convert a bag of words into a meaningful sentence. We name the latter the debagging problem. We perform an experiment where a character-level language model evaluates every possible ordering of the words in the bag, and returns and the ordering it deems best. To make the experiment tractable, we only considered bags of 7 words, giving a search space of size 5040.

For our experiment, we used the MRNN and the memo-izer5 to debag 500 bags of randomly chosen words from “Ana Karenina”. We use 11 words for each bag, where the first two and the last two words are used as context to aid debagging the middle seven words.

We say that the model correctly debags a sentence if the correct ordering is assigned the highest log probability. We found that the wikipedia-trained MRNN recovered the cor¬rect ordering 34% of the time, and the wikipedia-trained memoizer did so 27% of the time. Given that the problem is “word-level”, utilizing large character contexts is essen¬tial to achieve good performance.

5We were unable to modify the implementation of PAQ to make debagging feasible.

Generating Text with Recurrent Neural Networks

6. Qualitative experiments

In this section we qualitatively investigate the nature of the models learned by the MRNN.

6.1. Samples from the models

The simplest qualitative experiment is to inspect the sam¬ples generated by the three MRNNs. The most salient char

acteristic of the samples is the richness of their vocabular¬ies. Further inspection reveals that the text is mostly gram

matical, and that parentheses are usually balanced over many characters. The artifacts of the generated text, such as consecutive commas or quotes, are the result of the data preprocessing and are frequently found in the training set.

6.1.1. SAMPLES FROM THE WIKIPEDIA MODEL

We now present a sample from the Wikipedia model. We use ? to indicate the “unknown” character. The sample be

low was obtained by running the MRNN less than 10 times and selecting the most intriguing sample. The beginning

of the paragraph and the parentheses near the end are par¬ticularly interesting. The MRNN was initialized with the phrase “The meaning of life is”:

The meaning of life is the tradition of the ancient human repro¬duction: it is less favorable to the good boy for when to remove her bigger. In the show’s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE. In 1996 the primary rapford undergoes an effort that the reserve conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that activates the popula¬tion. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s thought is. To adapt in most parts of North America, the dynamic fairy Dan please believes, the free speech are much related to the

6.1.2. SAMPLES FROM THE NYT MODEL

Below is a sample from the model trained on the full NYT dataset, where the MRNN was initialized with a single space. The spaces surrounding the punctuation are an ar¬tifact of the preprocessing.

while he was giving attention to the second advantage of school building a 2-for-2 stool killed by the Cultures saddled with a half-suit defending the Bharatiya Fernall ’s office. Ms . Claire Parters will also have a history temple for him to raise jobs until naked Prodiena to paint baseball partners , provided people to ride both of Manhattan in 1978 , but what was largely directed to China in 1946 , focusing on the trademark period is the sailboat yesterday and comments on whom they obtain overheard within the 120th anniversary , where many civil rights defined, officials said early that forms , ” said Bernard J. Marco Jr. of Pennsylvania , was monitoring New York

6.1.3. SAMPLES FORM THE ML MODEL

Finally, we generate text from an MRNN trained on the ML corpus conditioned on the string “Recurrent”. This MRNN

is also able to balance parentheses (e.g., the third line of the sample):

Recurrent network with the Stiefel information for logistic regres¬sion methods Along with either of the algorithms previously (two or more skewprecision) is more similar to the model with the same average mismatched graph. Though this task is to be studied un¬der the reward transform, such as (c) and (C) from the training set, based on target activities for articles a ? 2(6) and (4.3). The PHDPic (PDB) matrix of cav’va using the three relevant informa¬tion contains for tieming measurements. Moreover, because of the therap tor, the aim is to improve the score to the best patch randomly, but for each initially four data sets. As shown in Figure 11, it is more than 100 steps, we used ?? \to \infty with 1000

6.2. Structured sentence completion

In this section, we investigate the MRNN’s response in var¬ious situations by sampling from the MRNN’s distribution conditioned on a prefix. The goal is to see whether the MRNN is able to generate “plausible” continuations to the initial strings. In our first experiment, we use the Wikipedia MRNN to complete the string “England, Spain, France, Germany,”:

England, Spain, France, Germany, and Massachusetts.

England, Spain, France, Germany, cars, and direct schools

England, Spain, France, Germany, , or New Orleans and Uganda.

England, Spain, France, Germany, , Westchester,

Jet State, Springfield, Athleaves and Sorvinhee

In the above completions, the MRNN correctly interpreted the string to be a list of locations, so the generated text was also a part of a list.

Next, we performed a similar experiment using the ML model and the pair of strings “(ABC et al” and “ABC et al”. The system has never seen the string “(ABC et al” in its training set (simply because there is no machine learn¬ing author named ABC, and its capitalization is particularly uncommon for a citation), so the MRNN needed to gener¬alize over an entirely new author name:

(ABC et al., 2003), ?13?, and for a supervised Mann-¡Whitnaguing

(ABC et al., 2002), based on Lebanon and Haussler, 1995b)

ABC et al. (2003b), or Penalization of Information

ABC et al. (2008) can be evaluated and motivated by

providing optimal estimate

This example shows that the MRNN is sensitive to the ini¬tial bracket before “ABC”, illustrating its representational power. The above effect is extremely robust. In con¬trast, both N-gram models and the sequence memoizer cannot make such predictions unless these exact strings (e.g., “(ABC et al., 2003)”) occur in the training set, which cannot be counted on. In fact, any method which is based on precise context matches is fundamentally incapable of utilizing long contexts, because the probability that a long context occurs more than once is vanishingly small. We experimentally verified that neither the sequence memoizer

Generating Text with Recurrent Neural Networks

nor PAQ are not sensitive to the initial bracket.

7. Discussion

Modeling language at the character level seems unneces¬sarily difficult because we already know that morphemes are the appropriate units for making semantic and syntactic predictions. Converting large databases into sequences of morphemes, however, is non-trivial compared with treating them as character strings. Also, learning which character strings make words is a relatively easy task compared with discovering the subtleties of semantic and syntactic struc¬ture. So, given a powerful learning system like an MRNN, the convenience of using characters may outweigh the ex¬tra work of having to learn the words All our experiments show that an MRNN finds it very easy to learn words. With the exception of proper names, the generated text contains very few non-words. At the same time, the MRNN also as¬signs probability to (and occasionally generates) plausible words that do not appear in the training set (e.g., “cryptoli-ation”, “homosomalist”, or “un-ameliary”). This is a desir¬able property which enabled the MRNN to gracefully deal with real words that it nonetheless didn’t see in the train¬ing set. Predicting the next word by making a sequence of character predictions avoids having to use a huge soft-max over all known words and this is so advantageous that some word-level language models actually make up binary “spellings” of words so that they can predict them one bit at a time (Mnih & Hinton, 2009).

MRNNs already learn surprisingly good language models using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. If we could train much bigger MRNNs with millions of units and billions of con¬nections, it is possible that brute force alone would be suf¬ficient to achieve an even higher standard of performance. But this will of course require considerably more computa¬tional power.

Acknowledgements

This work was supported by a Google fellowship and NSERC. The experiments were implemented with software packages by Tieleman (2010) and Mnih (2009).

REFERENCES

Bell, R.M., Koren, Y., and Volinsky, C. The BellKor solution to the Netflix prize. KorBell Team’s Report to Netflix, 2007.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term de-pendencies with gradient descent is difficult. IEEE Transac-tions on Neural Networks, 5(2):157–166, 1994.

Gasthaus, J., Wood, F., and Teh, Y.W. Lossless compression based on the Sequence Memoizer. In Data Compression Conference (DCC), 2010, pp. 337–345. IEEE, 2010.

Graves, A. and Schmidhuber, J. Offline handwriting recognition with multidimensional recurrent neural networks. Advances in Neural Information Processing Systems, 21, 2009.

Hochreiter, S. Untersuchungen zu dynamischen neuronalen Net-

zen. Diploma thesis. PhD thesis, Institut fur Informatik, Tech-nische Universitat Munchen, 1991.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.

Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667. Hutter, M. The Human knowledge compression prize, 2006. Jaeger, H. Observable operator models for discrete stochastic time

series. Neural Computation, 12(6):1371–1398, 2000.

Jaeger, H. and Haas, H. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78, 2004.

Mahoney, M. Adaptive weighing of context models for lossless data compression. Florida Inst. Technol., Melbourne, FL, Tech. Rep. CS-2005-16, 2005.

Martens, J. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML). ICML 2010, 2010.

Martens, J. and Sutskever, I. Training Recurrent Neural Networks with Hessian-Free optimizaiton. ICML 2011, 2011.

Mikolov, T., Karafi´at, M., Burget, L., ˇCernock`y, J., and Khudan-pur, S. Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Com¬munication Association, 2010.

Mnih, A. and Hinton, G. A scalable hierarchical distributed lan-guage model. Advances in Neural Information Processing Sys-tems, 21:1081–1088, 2009.

Mnih, Volodymyr. Cudamat: a CUDA-based matrix class for python. Technical Report UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009.

Murphy, K.P. Dynamic bayesian networks: representation, infer-ence and learning. PhD thesis, Citeseer, 2002.

Pollastri, G., Przybylski, D., Rost, B., and Baldi, P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics, 47(2):228–235, 2002.

Rissanen, J. and Langdon, G.G. Arithmetic coding. IBM Journal of Research and Development, 23(2):149–162, 1979.

Robinson, A.J. An application of recurrent nets to phone proba-bility estimation. Neural Networks, IEEE Transactions on, 5 (2):298–305, 2002. ISSN 1045-9227.

Rumelhart, D.E., Hintont, G.E., and Williams, R.J. Learning rep-resentations by back-propagating errors. Nature, 323(6088): 533–536, 1986.

Sandhaus, E. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 2008.

Taylor, G.W. and Hinton, G.E. Factored conditional restricted boltzmann machines for modeling motion style. In Proceed¬ings of the 26th Annual International Conference on Machine Learning, pp. 1025–1032. ACM, 2009.

Tieleman, T. Gnumpy: an easy way to use GPU boards in Python. Technical Report UTML TR 2010-002, University of Toronto, Department of Computer Science, 2010.

Ward, D.J., Blackwell, A.F., and MacKay, D.J.C. Dasher–a data entry interface using continuous gestures and language models. In Proceedings of the 13th annual ACM symposium on User interface software and technology, pp. 129–137. ACM, 2000.

Werbos, P.J. Backpropagation through time: What it is and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.

Wood, F., Archambeau, C., Gasthaus, J., James, L., and Teh, Y.W. A stochastic memoizer for sequence data. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1129–1136. ACM, 2009.

SPARSEHASH:

EMBEDDING JACCARD COEFFICIENT BETWEEN SUPPORTS OF SIGNALS

D. Valsesia, S. M. Fosson, C. Ravazzi, T. Bianchi, E. Magli

Politecnico di Torino - DET, Italy

name.surname@polito.it

ABSTRACT

Embeddings provide compact representations of signals to be used to perform inference in a wide variety of tasks. Random projections have been extensively used to preserve Euclidean distances or inner products of high dimensional signals into low dimensional representations. Different techniques based on hashing have been used in the past to embed set similarity metrics such as the Jaccard coefficient. In this paper we show that a class of random projections based on sparse matrices can be used to preserve the Jaccard coefficient between the supports of sparse signals. Our proposed construction can be therefore used in a variety of tasks in machine learning and multimedia signal processing where the overlap between sig¬nal supports is a relevant similarity metric. We also present an application in retrieval of similar text documents where SparseHash improves over MinHash.

Index Terms— Embedding, Jaccard coefficient, random projections, sparse matrices, MinHash

1. INTRODUCTION

Recent trends in signal processing are increasingly pushing researchers to investigate compact signal representations that build on signal sparsity. Such compact representations can be naturally used for signal acquisition and recovery, as it has been extensively studied in the compressed sensing lit¬erature [1–3]. However, similar representations are also very useful when one is not interested in signal recovery, but only in performing some signal classification tasks that are based on signal properties preserved by the compact representation.

In this second case, these representations are usually re¬ferred to as embeddings. Formally, an embedding is a trans¬formation that maps a set of signals in a high dimensional space to a lower dimensional space, in such a way that the geometry of the set is approximately preserved. An important class of signal embeddings are those preserving the distances among pair of signals. The most famous embedding is proba¬bly the one proposed by Johnson and Lindenstrauss [4], which

This work is supported by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013) / ERC Grant agreement n.279848.

preserves Euclidean distances using random projections. Sev¬eral extensions have been later proposed, allowing one to em¬bed the angle between signals [5, 6], or controlling the maxi¬mum distance that is embedded [7].

The concept of embedding has been successfully used also in the more general context of information retrieval [8], where it is usually called “hashing”. For example, it is a fun¬damental ingredient of efficient indexing techniques known as locality sensitive hashing [9]. In several information retrieval problems “bag-of-features” representations [10,11] are used to describe complex objects (e.g. images or text documents) by counting if and how many times a particular feature from a dictionary is present in the objects. In such problems, like the search of near-duplicate documents, or similar images, the usual metric is not the Euclidean distance but a similarity in¬dex between sets, where the elements in the sets are the vo-cabulary elements present in the objects under examination. One of the most used techniques for measuring set similar¬ity is min-wise hashing (also known as MinHash) [12–14], which approximately preserves the Jaccard similarity coeffi¬cient between pairs of sets and is used in a wide range of applications [15].

In this paper, we introduce an alternative embedding for the Jaccard coefficient which is based on the concepts of ran¬dom projections and signal sparsity. The proposed embed¬ding builds on recent results showing that measurements ac¬quired using a sparse random matrix can be used to estimate the number of nonzero components, i.e., the size of the sup¬port, of the acquired signal [16, 17]. Based on this result, we show that, given a pair of signals, their measurements can be efficiently used for estimating the size of both the union and the intersection of the signal supports. Hence, random projections obtained from a sparse random matrix provide an embedding of the Jaccard coefficient of the signal supports. Moreover, since these projections can be quantized using a single bit, they represent an efficient alternative to widely used MinHash.

This paper is organized as follows. In Sec.2 we provide some background on random projections and hashing tech¬niques for set similarity, namely MinHash. Sec.3 discusses the proposed method to use random projections as an embed¬ding of the Jaccard coefficient between signal supports, also

providing some theoretical results. Sec.4 validates the pro¬posed technique with synthetic and real datasets. Finally, we draw some conclusions and explore future lines of work in Sec.5.

2. BACKGROUND

In this section we provide some background material on known embeddings of common distance measures such as Euclidean distance, angular distance and Jaccard distance for set similarity.

2.1. Similarity search and embeddings

Let  = xiNi=1, xi  Rn be a collection of data points and d be a metric defined on . Given a query item ξ, the problem of proximity search is to find the items Q that are within the distance τ from ξ: Q = x X : dX (x, ξ)  τ. It should be noticed that the computation generally requires O(Nn) operations, which can be prohibitive for large N and n. An embedding is a function f : XY Rm, which maps vectors in the high dimensional space into a lower di¬mensional one (m << n) equipped with the distance metric dY, preserving the geometry of the set with a low distortion. Then, distances can be computed in the low dimensional em-beddings, rather than the original space, implying a cost re¬duction in the computation from O(Nn) to O(Nm) opera¬tions. In the following paragraphs we review two popular ap-proaches of transforming the data to a low dimensional repre¬sentation: random projections and MinHash.

2.2. Random projections

Random projections have been used as embeddings in order to reduce the dimensionality of points. Johnson-Lindenstrauss lemma [4] states that random linear mappings f (x) = Ax with A  Rmn, if properly designed, preserve the Euclidean distances of points within a small tolerance with high proba¬bility. More precisely, given  (0, 1), β > 0, and N, m  N such that

m  24 + 12β

32 23 log N,

then there exists a distribution over Rmn from which the matrix A is drawn such that for all u, v 

(1  )u  v22Au  Av22 (1 + )u  v22

with probability exceeding 1Nβ (the interested reader can refer to [18] for the proof).

The most common choice for the distribution of the en¬tries of the matrix A is i.i.d. Gaussian (0, 1/m). In [8] other distributions are proposed in order to speed up the com¬

putation using sparse random projections of the form

V sm w.p.2s1

0 w.p. 1  1

~ s s

m w.p. 2s1

where only 1/s of the data need to be processed. In [19] it is shown that, under suitable conditions, one can use s = n/ log(n) to significantly speed up the computation.

Finally, another popular embedding is constituted by Sign Random Projections [5] for angle-based distance formed by any two vectors u, v 

~ u~v ~

θ(u, v) = arccos .

π kuk2kvk2

The hash function is formulated as f (x) = sign(Au), where A  Rmn with i.i.d. Gaussian entries. It can be shown that P(f (u) = f (v)) = θ(u, v). Then the vectors can be compared in the reduced space using Hamming distances for which efficient algorithms are available in the literature [20]. Compared to regular random projections, for each data point, Sign Random Projections need to store just one bit per pro¬jection.

In this paper, we show that sparse random matrices, if properly designed, can embed the Jaccard coefficient between the supports of sparse signals.

2.3. MinHash

The Jaccard coefficient is a similarity measure between two sets S1, S2  Ω = 1,... , n and is defined as

J(S1, S2) = S1  S2 

S1  S2.

The related distance is 1  J(S1, S2).

The most popular technique to estimate the Jaccard coeffi¬cient is represented by MinHash, which works as follows. Let S  Ω = 1, ... , n, π be a uniformly chosen permutation on Ω, then the hash function h : Ω  Ω is

h(S) = min

aS π(a).

It can be easily shown that

P[(h(S1) = h(S2)] = J(S1, S2).

Then, given m hash values of two sets (all permutations are generated independently), the Jaccard coefficient is estimated as

1(hi(S1) = hi(S2))

where 1 be the indicator function. In [15] the authors propose to use only the least significant b-bits of the MinHash value,

instead of using 64 bits or 40 bits as in [21] and [12], respec¬tively. The most common solution adopted in practice is to keep a single bit, thus estimating the Jaccard coefficient from the Hamming distance between hash vectors as

1 ({[hi(S1) mod 2] =6 [hi(S2)mod 2]})

3. PROPOSED METHOD

We propose sparse random projections as a tool to estimate the Jaccard coefficient between supports of signals in high dimensional space. We define the support of a signal u  Rn as the set of nonzero elements of u:

supp(u) = {i  {1, ... , n} : ui =6 0}.

Given u, v  Rn we are interested in estimating the Jaccard coefficient J(Su, Sv) of the two sets Su = supp(u) and Sv = supp(v). To simplify the notation from now on we denote J(Su, Sv) with J(u, v).

The hash function we consider is f (x) = 1({Ax = 0}) where A  Rmn with m < n is a γ-sparsified matrix, whose entries are i.i.d. according to



N(0, γ1) w.p. γ,

Aij  (1)

δ0 w.p. 1  γ

where δ0 denotes a Dirac delta centered at zero. Also in this case, as in Sign Random Projections, each data point needs to store just one bit per projection.

Let now y, z  Rm and define

1({yi = 0, zi = 0}),

~m i=1 1({yi = 0}) ~m j=1 1({zj = 0})

I2(y, z) =

m E.mi=1 1({yi = 0, zi = 0}),

then the signals can be compared in the reduced space using the following similarity index:

In the following proposition, we state that the proposed similarity index concentrates around the Jaccard coefficient between the supports of the original signals. Due to space contraints, we delay the proof to a future article.

Proposition 1. Let u, v be a pair of arbitrary vectors. Fix > 0, then there exists q = q()  (0, 1) such that

P[|I(Au, Av)  J(u, v)| > ]  qm. (3)

0 0.2 0.4 0.6 0.8 1

Jaccard coefficient

Fig. 1. Jaccard coefficient estimation with m = 50.

In this work, the choice of distribution N(0, γ 1) for the nonzero entries of A is arbitrary and can be replaced with any continuous distribution with zero mean and variance 1/γ. However, different choices of distribution can change the er¬ror tail bound.

4. EXPERIMENTAL RESULTS

In this section we perform some experiments with the pro¬posed embedding technique. First, we experimentally show that the embedding metric we introduced in Eq. (2) indeed concentrates around the Jaccard coefficient as stated in Eq. (3). Then, we address a classification problem using a real dataset of text data.

4.1. Embedding validation

We perform some experiments to validate the theoretical re¬sult that the similarity metric I (Au, Av) between the random projections of two signals of interest concentrates around the Jaccard coefficient between their original supports. In order to show this result, we generate a large number of signals at ran¬dom with varying amount of support overlap and computed their random projections. The signals used in this experi¬ment have n = 1000 and the cardinality of the support is k = 230. The dimension of the reduced space has been fixed to m = 50. The mean and variance of the similarity index I are evaluated over 500 iterations. The γ parameter controlling the sparsity of the sensing matrix is set as the value that max

50 100 150 200 250 300 350 400 450 500 Sample size

Fig. 2. Precision and recall, threshold 0.5.

50 100 150 200 250 300 350 400 450 500 Sample size

Fig. 3. Precision and recall, threshold 0.6.

imizes the entropy of the binary measurements, i.e. generates

zero or nonzero measurements with equal probability. Since

P(fi(u) = 0) = (1  γ)k,

then we set

γ = 1  2 k  3 • 103.

Fig. 1 shows that the mean value of I(Au, Av) (solid blue curve) computed between every pair of random projections is close to J(u, v) (dashed yellow line). The shaded area rep¬resents an interval of width equal to one standard deviation above and below the mean value.

Since the proposed SparseHash method only requires to store 1 bit per measurement, we compared it to the binary version of MinHash applied to the same signals. It can be no¬ticed that SparseHash and MinHash show a similar behaviour with a slight reduction in variance for SparseHash.

4.2. Classification with real dataset

The goal of this section is to test the performance of Sparse-Hash on a classification experiment with a real dataset. Find¬ing near-duplicate or similar documents in an archive of text

data has been an important problem for a long time and sev¬eral works [12,21, 22] from the early days of the Web to more recent times have addressed the issue. Documents can be represented with bag of words or bag of shingles (groups of consecutive words) models, where what we called “signal” is the count of how many times a particular word or shingle appears. Such models are typically very sparse signals be¬cause the number of different words/shingles appearing in a particular document is typically small compared to the size of the vocabulary. Since our goal is to ascertain the quality of the embedding provided by SparseHash, we perform an experiment similar to the one reported in [15], where the au¬thors compared how various quantization rates affected the performance of MinHash. We use the standard and publicly available UCI dataset of New York Times articles [23]. This dataset is composed of about 300000 news articles, with a bag of words representation given for each article. In terms of the signal parameters that we used in this paper, n = 102660. The mean sparsity (i.e. the number of different words used in each article) is k = 232. As in Sec.4.1, γ is set as the value that maximizes the entropy. Since the sparsity degree varies, we approximate it with the mean value k, so that γ = 12 1k.

We compare the performance of SparseHash and 1-bit

0.8 0.85 0.9 0.95 1

Recall

0.8 0.85 0.9 0.95 1

Recall

Fig. 4. Precision vs recall, with ground truth threshold 0.5 (left) and 0.6 (right)

MinHash, in terms of precision and recall. Specifically we de¬fine as similar the documents with Jaccard coefficient larger than a certain threshold, and we try to retrieve them. If Q is the set of similar documents and Q its estimate, the preci¬

perior performance of the proposed method with respect to MinHash. Future work will focus on providing more detailed theoretical results, as well as a fast technique to compute the measurements that does not require the matrix-vector product.

2 and 3, we set the thresholds 0.5 and 0.6, respectively, and we show precision and recall as function of the sample size. Concerning the precision, we notice that SparseHash outper¬forms 1-bit MinHash, in particular for sample sizes smaller than 100. For larger small sizes, both methods are efficient, with precision very close to 1. On the other hand, the recall is very close to one for both methods, for all the tested sample sizes.

In Figure 4, we depict the behavior of the precision as a function of the recall. In this experiment, the goal is to recover all the documents with Jaccard coefficient larger than 0.5 (left graph) and 0.6 (right graph). We consider sample sizes in 48, 96. In all these settings, SparseHash outperforms 1-bit MinHash, i.e., at same recall the precision of SparseHash is higher. The gain is more evident for smaller sample size.

The best performance of SparseHash with respect to 1-bit MinHash is consistent with the results in Figure 1, in which we noticed a smaller variance for SparseHash in a numerical setting close to that of the dataset here considered.

5. CONCLUSIONS AND FUTURE WORK

This paper introduced SparseHash, an embedding technique that reduces the dimensionality of signals while preserving the Jaccard coefficient between their supports. Contrary to other techniques present in literature for embedding set simi¬larity, such as MinHash, we derived the method starting from the literature on random projections and compressed sensing. We showed that the method is an interesting alternative to bi¬nary MinHash, improving over it by providing lower variance for the same number of bits required by the hash. Moreover, we also tested SparseHash on a classification experiment with a real dataset of news articles. This test confirmed the su

6. REFERENCES

[1] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.

[2] E. J. Cand`es, J. Romberg, and T. Tao, “Robust un-certainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, 2006.

[3] E. J. Cand`es and T. Tao, “Near-optimal signal recov¬ery from random projections: Universal encoding strate¬gies?,” IEEE Trans. Inf. Theory, vol. 52, no. 12, pp. 5406–5425, 2006.

[4] W. B. Johnson and J. Lindenstrauss, “Extensions of Lip¬schitz mappings into a Hilbert space,” Contemporary Mathematics, vol. 26, 1984.

[5] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Comput¬ing, New York, NY, USA, 2002, STOC ’02, pp. 380– 388, ACM.

[6] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk, “Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors,” IEEE Trans. Inf. Theory, vol. 59, no. 4, pp. 2082–2102, April 2013.

[7] P. T. Boufounos and S. Rane, “Efficient coding of sig¬nal distances using universal quantized embeddings,” in Data Compression Conference (DCC), 2013, March 2013, pp. 251–260.

[8] D. Achlioptas, “Database-friendly random projections: Johnson-Lindenstrauss with binary coins,” Journal of computer and System Sciences, vol. 66, no. 4, pp. 671– 687, 2003.

[9] A. Andoni and P. Indyk, “Near-optimal hashing algo¬rithms for approximate nearest neighbor in high dimen¬sions,” Commun. ACM, vol. 51, no. 1, pp. 117–122, Jan. 2008.

[10] T. Joachims, “Text categorization with suport vector ma¬chines: Learning with many relevant features,” in Pro¬ceedings of the 10th European Conference on Machine Learning, London, UK, 1998, ECML ’98, pp. 137–142, Springer-Verlag.

[11] Y. Jiang, C. Ngo, and J. Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,” in Proceedings of the 6th ACM International Conference on Image and Video Retrieval, New York, NY, USA, 2007, CIVR ’07, pp. 494–501, ACM.

[12] A. Broder, “On the resemblance and containment of documents,” in Proceedings of the Compression and Complexity of Sequences 1997, Washington, DC, USA, 1997, pp. 21–29, IEEE Computer Society.

[13] A. Broder, M. Charikar, A. M. Frieze, and M. Mitzen-macher, “Min-wise independent permutations,” Journal of Computer and System Sciences, vol. 60, no. 3, pp. 630 – 659, 2000.

[14] P. Indyk, “A small approximately min-wise independent family of hash functions,” in Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, 1999, SODA ’99, pp. 454–456, Society for Industrial and Applied Mathematics.

[15] P. Li and C. K¨onig, “b-bit minwise hashing,” in Pro-ceedings of the 19th International Conference on World Wide Web, New York, NY, USA, 2010, pp. 671–680, ACM.

[16] V. Bioglio, T. Bianchi, and E. Magli, “On the fly es-timation of the sparsity degree in compressed sensing using sparse sensing matrices,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2015, pp. 3801–3805.

[17] C. Ravazzi, S. M. Fosson, T. Bianchi, and E. Magli, “Signal sparsity estimation from compressive noisy

projections via γ-sparsified random matrices,” in

Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2016, p. to appear.

[18] S. Dasgupta and A. Gupta, “An Elementary Proof of a Theorem of Johnson and Lindenstrauss,” Random Struct. Algorithms, vol. 22, no. 1, pp. 60–65, 2003.

[19] P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Dis¬covery and Data Mining, New York, NY, USA, 2006, KDD ’06, pp. 287–296, ACM.

[20] P. Indyk and R. Motwani, “Approximate nearest neigh¬bors: towards removing the curse of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998, pp. 604–613.

[21] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the web,” Com-put. Netw. ISDN Syst., vol. 29, no. 8-13, pp. 1157–1166, Sept. 1997.

[22] M. Henzinger, “Finding near-duplicate web pages: A large-scale evaluation of algorithms,” in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2006, SIGIR ’06, pp. 284–291, ACM.

[23] M. Lichman, “UCI machine learning repository,” http://archive.ics.uci.edu/ml, 2013.

Submitted to MNRAS

NEURAL COMPUTATION AS A TOOL FOR

GALAXY CLASSIFICATION: METHODS AND EXAMPLES

O. Lahav', A. Naim', L. Sodre Jr.' and M. C. Storrie-Lombardi'

1. Institute of Astronomy, Madingley Rd., Cambridge, CB3 OHA

2. Instituto Astron^omico e Geofisico da Universidade de S~ao Paulo,

CEP CP9638, 01065-970, S~ao Paulo, Brazil

ABSTRACT. We apply and compare various Artificial Neural Network (ANN) and other algorithms for automatic morphological classification of galaxies. The ANNs are presented here mathematically, as non-linear extensions of conventional statistical meth-ods in Astronomy. The methods are illustrated using different subsets from the ESO-LV catalogue, for which both machine parameters and human classification are available. The main methods we explore are: (i) Principal Component Analysis (PCA) which tells how independent and informative the input parameters are. (ii) Encoder Neural Network which allows us to find both linear (PCA-like) and non-linear combinations of the input, illustrat¬ing an example of unsupervised ANN. (iii) Supervised ANN (using the Backpropagation or Quasi-Newton algorithms) based on a training set for which the human classification is known. Here the output for previously unclassified galaxies can be interpreted as either a continuous (analog) output (e.g. T-type) or a Bayesian a posteriori probability for each class. Although the ESO-LV parameters are sub-optimal, the success of the ANN in re¬producing the human classification is 2 T-type units, similar to the degree of agreement between two human experts who classify the same galaxy images on plate material. We also examine the aspects of ANN configurations, reproducibility, scaling of input parameters and redshift information.

1 INTRODUCTION

The exponential growth of data in extragalactic Astronomy calls for new approaches to analysis and interpretation. Observations with large ground-based telescopes, auto-matic measurement machines and satellites have produced large data bases of imaging and

spectroscopy of galaxies. However, the advance in producing 'Gigabytes of data' has not been matched by Artificial Intelligence techniques of classification and interpretation. In spite of several attempts (e.g. Murtagh & Heck 1987; Thonnat 1989; Lauberts & Valentijn 1989; Okamura, Kodaira & Watanabe 1984; Spiekermann 1992; Storrie-Lombardi et al. 1992; Doi, Fukugita & Okamura 1993; Abraham et al. 1994; Lahav et al. 1995; Naim et al. 1995b), morphological classification of galaxies still remains a human-intensive process dependent on the eyes of a handful of dedicated individuals.

The motivation for classifying galaxies is two-fold:

(i) RC3-like catalogues for millions galaxies are needed for statistical studies (e.g. correlation functions or density-morphology relation) and as target list of selected type for observational projects, such as Tully-Fisher measurements.

(ii) Classification is important for quantifying the Astrophysics of galaxies, in analogy with the H-R diagram for stars. It allows us to incorporate multi-wavelength and dynamical properties of the galaxies, with the hope that a new 'physical Hubble Sequence' may emerge.

Automated procedures are the only practical way of classifying the enormous amount of data produced by machine scans like those obtained in the Cambridge Automated Plate Measuring (APM) facility and the Sloan Digital Sky Survey (SDSS). The Artificial Neu¬ral Network (ANN) method is a novel technique to classify objects which has only little been explored in Astronomy. In a pilot-study (Storrie-Lombardi et al. 1992; hereafter SLSS) we have investigated the ANN technique to classify galaxies. Using a Backpropaga-tion algorithm, we have shown that we could reproduce the ESO-LV classification (into 5 classes) at a success rate of 64 % 'perfect match'. More recently, we have shown (Naim et al. 1995b; Lahav et al. 1995) that ANNs can replicate the human classification of APM-selected galaxies to the same degree of agreement as that between two human experts, 1.8 T-type units. This paper provides the theoretical framework and mathematical details of the methods used in these studies.

Other recent applications of ANNs in astronomy include adaptive optics (e.g. Angel et al. 1990), star/galaxy separation (e.g. Odewahn et al. 1991), meteors monitoring (Fraser 1992), and classification of stellar spectra (von Hippel et al. 1994). For review of astronom¬ical applications see also Serra-Ricart et al. (1993), Miller (1993) and Storrie-Lombardi &

Lahav (1994, 1995). Non-astronomical applications somewhat similar to our problem are speech recognition and identification of hand-written characters. ANNs have several prac-tical advantages compared with traditional techniques. ANN algorithms make no prior assumptions about the statistical distribution of test objects, and invoke no heuristics to define class membership. The algorithms are general-purpose and can be applied to a variety of problems.

Surprisingly, in spite of the wide application of CCD imaging and the theoretical interest in the Hubble Sequence, there is no large uniform data set of galaxy images available. The largest available uniform samples include no more than 200 galaxies (e.g. Kent 1985, Simien & de Vaucoleurs 1986, Kodaira et al. 1986). The recent APM-selected sample of Naim et al. (1995a) includes 830 galaxy images. Here we use the ESO-LV data sets, although they are far from being optimal for our problem. They are based on plate material, the galaxies were not classified uniformly by one expert (but by Lauberts, Valentijn and Corwin over a decade) and the machine parameters do not optimally reflect structural parameters like spiral arms which are so apparent to the human eye. However, this is a large data set (more than 5000 galaxies) which includes both machine parameters and human classification. The results presented here should be regarded as a lower limit to what can be done with the ANN approach to classification in the future, e.g. with uniform large samples of CCD images which are currently measured (e.g. Madore et al. , in preparation; White et al. , in preparation).

In this paper we shall also address briefly astrophysical implications of our ANN results. One open question is whether galaxies were formed in a self-similar way, or in a way which mainly depends on the their total mass or potential well. For example, Simien & de Vaucoulerus (1986) showed a tight correlation between the disk-to-bulge ratio (a distance independent property) and the Hubble type, while Meisles & Ostriker (1984) argued that the absolute luminosity of the spheroidal component (a distance dependent property) is the major parameter determining the Hubble Sequence. We shall examine this question using ESO-LV diameters.

The structure of the paper is as follows. As the ANN methods are general and are currently scattered in the ANN literature (e.g. in journals of Engineering and Biology), we present them mathematically in Appendices A (Principal Component Analysis and its

non-linear extensions), B (Backpropagation and Quasi-Newton minimization algorithms) and C (Bayesian classification, Wiener filtering and weight decay). The main text of the paper gives examples of applications of these methods to the ESO-LV galaxies. Following a general Introduction (x1), x2 presents the ESO-LV parameters. x3 illustrates the use of Principal Component Analysis, while x4 presents a variety of applications of supervised non-linear ANNs. Future work is discussed in x5.

2 THE DATA SETS

Here we illustrate the method using ESO-LV galaxies (Lauberts & Valentijn 1989; hereafter LV89) at high Galactic latitude (lbl > 30o). We shall consider several samples. There are three aspects in defining the samples for training by the ANN : the sample selection (e.g. by apparent diameter), the galaxy machine parameters used, and the binning into galaxy classes.

The first sample, composed of 13 galaxy parameters, hereafter called P 13, is the same as we used in SLSS, i.e. of galaxies with visual diameters larger than 1 arcmin (the claimed completeness of the ESO-LV catalogue). Only galaxies with morphological classification performed by visual examination of the galaxy image were considered in our analysis. We use the 13 catalogue parameters shown in Table 1 of SLSS to describe each galaxy. Briefly they are : (1) the average blue minus red colour, (2) the exponent in the generalized de Vacouleurs law in the blue, (3) log of the ratio of diameters which include 80 % and 50 % of the blue light, (4) an indicator of the degree of asymmetry of the galaxy image, (5) the central blue surface brightness, (6) log of the ratio of minor to major diameters, (7) error in ellipse fit to blue isophotes, (8) gradient of the blue surface brightness profile at half-light radius, (9) log of the ratio of the blue 26 mag/arsec2 diameter and the half-light diameter, (10) the exponent in the generalized de Vacouleurs law in the red, (11) average blue surface brightness within 10 arcsec diameter circular aperture, (12) blue surface brightness at half-light radius, (13) red surface brightness at half-light radius. These 13 parameters were chosen because they are almost distance-independent, and they are very similar to those used by LV89 to perform the automated classification presented in the ESO-LV catalogue. This allows us to compare meaningfully the success rate of the classifications provided by our ANN with ESO-LV. After selecting only galaxies with all 13 parameters

available, our final data set has 5217 galaxies. We then randomly divided these galaxies in two independent sets of 1700 and 3517 objects for training and testing. We have also normalized our input data between 0 and 1 by using the minimum and maximum values of each parameter (and also have tried normalizing by the variance). We have grouped the ESO-LV catalogue sub-classes in three ways: (i) by keeping the original range of classes -5.0 < T < 10.0 where T is the coded type; (ii) by binning into 5 major classes (as in SLSS): E ( -5.0 < T < -2.5; 466 galaxies); S0 (-2.5 < T < 0.5; 851 galaxies); Sa+Sb (0.5 < T < 4.5; 2403 galaxies); Sc+Sd ( 4.5 < T < 8.5; 1132 galaxies); and Irr (8.5 < T < 10.0; 365 galaxies), and by splitting into two classes : early type, (E+S0, T < 0.5, 1317 galaxies ) and late type (T > 0.5, 3900 galaxies).

The second sample, hereafter called D7, is also extracted from ESO-LV. It includes galaxies larger than 2 arcmin (as defined by the old ESO sample) which also have redshift information, and information of 7 diameters. De, D70, D80 and D90 are the major diam¬eters of ellipse at 50 %, 70 %, 80 % and 90 % total B light, while D25, D26 and D27 are the major diameters of ellipse at B surface brightness of of 25, 26 and 27 maglarcsec2. We then converted them into metric diameters using their redshift. This sample includes 791 galaxies, which were mainly classified by one expert, H. Corwin.

3 HOW INFORMATIVE ARE THE INPUT PARAMETERS ?

A key question when providing an ANN with an input is how many input parameters to present, and how to compress them in an efficient and informative way. There is of course a trade-off between keeping the number of parameters small and the amount of information presented.

3.1 Standard PCA

Principal Component Analysis (PCA) is a widely used method which allows to judge how many independent parameters are needed, by looking at directions along which the variance is maximal. The formulation of PCA is given in Appendix A.1. It is worth em-phasizing that PCA is only meaningful for linear parameters (or `the nearest to linear', e.g. by taking log of the original variables), and may suffer from scaling problems. Never-theless, it is a useful tool for reducing the dimensionality of the input parameter space. In

the context of this paper it can be viewed as a data compression techinque for the input to the ANN, as well as an `unsupervised method' for exploring the parameter space.

We begin by applying the method to the log of the 7 metric diameters given in the D7 sample (of 791 galaxies), each scaled to have zero-mean. We do not normalize here by the variance of each variable, as they all have the same metric, and we wish to represent their relative values. Not too surprisingly, the correlation matrix indicates strong correlation between the log-diameters. Table 1 gives the eigen-values and the eigen-vectors for the log-diameters. 95 % of the variance is in the first principal component (which is found to be approximately the average of the 7 log-diameters). However, as we show in X4.4 , using the ANN, it is not sufficient to use just the first principal component to represent the 7-dimensional data space for classification.

We then applied PCA to the 13 distance-independent parameters of the P 13 sample of 5217 galaxies, with the parameters normalized to have zero-mean and unit-variance, as here the parameters are made of `apples and oranges'. Indeed, one should be more cautious about applying PCA to a set of parameters which are of different character, and may well be non-linear. However, the results give some insight into the amount of useful information in this parameter space. Tables 2a and 2b give the 13 eigen-values and the eigen-vectors corresponding to the largest 3 eigen-vectors. We find that the first 7 linear combinations give 90 % of the variance.

The projection of the 13 parameters on the first and second Principal Components is shown in Figures 1a,b. Although the distribution of all galaxies looks like a fuzzy cloud, the different morphological types actually occupy distinct regions in this new parameter space. We see that even E and S0 galaxies can actually be separated. This plot illustrates how PCA could compress a 13-dimensional parameter space into a 2-dimensional space. Although the physical meaning of the new space is not easy to interpret, it allows to segregate different classes of objects.

3.2 Encoder and Neural PCA

Generally, a multi-layer ANN consists of nodes (analogous to human neurons) arranged in a series of layers. The nodes in a given layer are fully connected to the nodes in the next layer. The free parameters of the ANN are the weights wij which are determined by least-squares of the difference between the input and the desired output, the so-called `cost

E = 2hY'

k (ok - dk)2i, (1)

where the sum is over the components of the vector (k = 1, M) and the average is over the galaxies. Layers between the input and the output layers are called `hidden layers' and allow non-linear mapping. The least-squares minimization can be done by a variety of efficient algorithms, e.g. Backpropagation and Quasi-Newton, which are described in detail in Appendix B.

We begin by demonstrating an encoder network in which the desired output is the input itself, as explained in detail in Appendix A.2. Figure 2 shows an M : M' : M network configuration, where M' is the number of `neck' units (in the 'hidden layer'), or number of linear combinations in the PCA language. While a linear network in this configuration simply reproduces a standard PCA, a non-linear transfer function can allow `non-linear PCA'.

We now apply a non-linear encoder network, with a sigmoid threshold function. In Figure 3 we plot the cost function (calculated over the 5217 ESO-LV galaxies) vs. the number of hidden units M'. Clearly when the input 13 parameters are uncorrelated, we shall need 13 hidden units to fully recover the 13 parameters at the output layer. If, on the other hand, the 13 parameters are identical, then one hidden unit will be sufficient. The figure shows that the cost function drops exponentially as a function of the number of hidden units. This behaviour may serve for guidance in selecting the number of hidden units for the classification network (see below). Serra-Ricart et al. (1993) have developed this unsupervised approach further, illustrating for our P13 data set that a non-linear encoder can identify classes in this data set much better than a standard PCA. Other algorithms for neural PCA such as Oja's rule are discussed in Appendix A.3 .

4 SUPERVISED CLASSIFICATION OF GALAXIES WITH ANNs

We now apply 'standard' supervised ANN for classifying the ESO-LV galaxies. In the 'training' process, the input vectors, containing the galaxy parameters, are presented to the network. The weights ('free parameters') wij are computed by least-squares minimization with the Backpropagation or Quasi-Newton algorithms (explained in detail in Appendix B). The ANN is then ready to handle new unclassified data for which only the machine

parameters are available. We shall present 3 different net configurations: (i) a single output (`analog') network, (ii) a two-class classifier, and (iii) a 5 -class classifier. We wish to emphasize that supervised ANNs do not produce an `objective' unique classification. Supervised networks replicate the choices of their trainer - a network trained according to the classification made by Hubble or de Vaucouleurs will classify new data in a manner similar to the original expert.

4.1 Single Continuous Output

Although the galaxy morphology is probably a continuous sequence (Hubble 1936), human experts provide us with a 'true answer' usually given in quantized units, to a first decimal point, e.g. T = 5.3. It is to our benefit that the single output configuration of the network can approximate a one-dimensional continuous sequence.

It is common in Astronomy to fit a model with several free parameters to the obser¬vations. This regression is usually done by means of X2 minimization. A simple example of a model is a polynomial with the coefficients as the free parameters. Consider now the specific problem of morphological classification of galaxies. If the type is T (e.g. on the numerical system [-5, 10]), and we have a set of parameters x (e.g. isophotal diameters and colours) then we would like to find the free parameters w (`weights') such that the cost function

[Ti - f (w, xi)]2, (2)

is minimized. The function f (w, x) is the `network'. Commonly f is written in terms of the variable

z=Y~ wkxk, (3)

where the sum here is over the input parameters to each node. A `linear network' has f (z) = z, while a non-linear threshold function could be a sigmoid f (z) = 1/[1 +exp(-z)] or f (z) = tanh(z). Another element of non-linearity is provided by the the `hidden-layers'. The `hidden-layers' allow curved boundaries around clouds of data points in the parameter space. A typical configuration with one `hidden-layer' and a single output is shown in Figure 4. While in most computational problems we only have 10-1000 nodes, in the brain there are - 1010 neurons, each with - 104 connections. We do not regard

of course our simple ANN algorithm as a model for the human brain, but rather as a non-linear statistical method.

The determination of many free parameters, the weights Wi's in our case, might be unstable. It is therefore convenient to regularise the weights, e.g. by preventing them from growing too much. In the ANN literature this is called `weight decay'. This approach is analogous to Maximum Entropy, and can be justified by Bayesian arguments, with the regularising function acting as the prior in the weight space. One possibility is to add a quadratic prior to the cost function and to minimize

Etot = aEw + , ED, (4)

where ED is our usual cost function, based on the data (e.g. eq. 2) and

Ew = 1

2 Q

i=1 W 2

i (5)

is the chosen regularising function, where Q is the total number of weights. The coefficients a and , can be viewed as `Lagrange multipliers'. While sometime they are specified ad-hoc, it is possible to evaluate them `objectively' by Bayesian arguments in the weight-space. We discuss this procedure in detail in Appendix C.3 .

To illustrate the above ideas, we built a network with configuration 13:3:1, resulting in 46 free weights (including the `bias' node, which represents an additive constant). In the training process the network was presented with 13 parameters (from sample P13) for each galaxy, using a subset of 1700 galaxies. Both the input parameters and the `true' answer T-type (in the range -5 G T G 10) were scaled to the range [0, 1], so all the weights were treated on equal footing in the regularisation process. The transfer function used was a sigmoid. By the procedure outlined in Appendix C.3 we found the weight decay regularization coefficient to be = 0.001. We then applied least-square minimization using a Quasi-Newton method (as implemented in a code kindly provided to us by B.D. Ripley).

As in other optimization problems, it is crucial to decide when to stop the minimiza-tion. One approach is to stop when the cost function drops below a certain value, or changes little between successive iterations. However, in particular when the sample size is small (relative to the number of weights) this may result in `over-fitting' (`memorising') of

the data (including the noise). Usually, the cost function with respect to the training set shows monotonic decline, and it is difficult to define a minimum for stopping. Instead, we calculate at each iteration the cost function for the testing set (with the weights derived of course from the training set). In this way we monitor the ability of the ANN to `generalize' its choice of weights to data it was not trained on. Usually the cost function with respect to the testing set decreases to a minimum and then increases, so it is easy to decide where to stop according to this minimum.

Once the training phase was completed, we presented the network with a testing set (of 1700 galaxies), but for which a human classification is known. On a Sun Sparc workstation the training of the network on 1700 galaxies takes about 1 min (CPU), while testing on a sample of similar size takes only 1 sec (CPU).

Figure 5 shows the network type Tnet versus the ESO-LV human classification Teso. The Spearman rank-order correlation coefficient is rs = 0.83. As another way of quan-tifying the network performance we calculate the variance between the network and the ESO-LV type over the number of galaxies Ngal :

Q2 = Ngal Y(Tnet -Teso)2

1 (6)

and we find Q 2.0 T-units. By a similar statistics we can compare the run of similar

network configuration which start the minimization with different random weights. Figure 6 shows the results of the two runs. The scatter between two runs is much smaller than that in Figure 5. Here the Spearman coefficient is rs = 0.98 and typical `reproducibility' scatter is Q 0.6 T-units. In Figure 6 we note a `break' in the transition from early type (T < 0) to late type (T > 0). It may be that the non-linearity of the network was not sufficient to fit both classes by the same weights (i.e. in each minimization the net finds a different compromise of weights to satisfy both early and late type galaxies), or that the quality of parameters for early and late types is different.

In the study of the blue APM images (Naim et al. 1995b, Lahav et al. 1995) we have shown that ANNs can replicate the classification by a human expert to the same degree of agreement as that between two human experts, to within 1.8 T-units (based on a comparative study where the same images were classified by 6 experts independently). The ESO-LV data give a slightly weaker result, 2 T-units, probably due to the parameters

being less informative, although they include blue minus red colour as a parameter, which is lacking in the APM sample.

4.2 Two-Class (E, S) Classifier

In a network with multiple outputs, the output vector can be interpreted in a proba-bilistic way. The j -th component of this vector can be viewed as the probability for class j given the input parameters, P(Cjlx). In fact, it can be proved theoretically (Appendix

over, as our experiments confirm, the sum of the output vector components is Ek ok 1, as C.1) that the output of an ideal ANN is indeed a Bayesian a posteriori probability. More

expected for a probabilistic classifier. It is worth noting that, unlike discrete classification of hand-written characters, galaxies form a continuous sequence. Hence the combination of probabilities assigned to different `eigen-classes' may reflect an intermediate class. The `most likely class' can be defined as the class associated with the largest output component. Here we do not include weight decay, as when included the interpretation of the output is not anymore strictly Bayesian (see Appendix C.1).

To classify into early type (-5 < T < 0.5) and late type ( 0.5 < T < 10) galaxies we have used a Backpropagation algorithm (Appendix B) and a network configuration of 13:10:2 with a tanh threshold function, a learning coefficient ri = 0.01, a momentum coefficient a = 0.9 (see Eq. B6). The network was trained on 1700 ESO-LV galaxies, and was tested on the remaining 3517 galaxies. Of the 898 galaxies classified as early type by ESO-LV, 681 were classified as such by the network, while 217 were classified as late type. Of the 2619 galaxies classified as late type by ESO-LV, 2471 were classified as such by the network, while 148 were classified as early type type. This means a success rate of 90%.

Breaking down the early class into ellipticals and lenticulars (SOs) demonstrates that the vast majority of the variance is in the classification of S0s. Of the 311 galaxies classified as ellipticals (-5 < T < -2.5) by ESO-LV the ANN agreed on 94% of them and disagreed for only 6%. On the other hand, of the 587 galaxies classified as S0 (-2.5 < T < 0.5) by ESO-LV, the ANN agreed only for 66% and disagreed for 34%. This is yet another indication that the S0s form a `transition class' along the Hubble sequence.

4.3 Five-Class Classifier

This is essentially the network we presented in SLSS. The input layer consists of the

13 parameters and the output layer consists of the 5 classes described in section 2. The

configuration used was (13;13,5) with sigmoid as our nonlinear transfer function. The learning and momentum coefficients were kept constant at Tj = 0.5 and a = 0.2, for all layers.

During training (using 1700 ESO-LV galaxies of the P13 sample), the ANN compared the output of these five nodes to the visual classification decisions of LV89. We then tested the ANN against the remaining 3517 galaxies of the P13 sample. Morphological classification was performed by assigning the galaxy to the class corresponding to the maximal output component. Further experiments we carried out with a variety of network configurations showed little effects of the number of hidden units and layers, the epoch, the size of training set, the number of iterations, and the learning and momentum coefficients.

Our main result, shown in table 2 of SLSS still stands. The percentage of galaxies correctly classified was 64 % (and 96 % to within nearest class; If either the first or the second highest outputs are considered in the comparison with the visual classification, the success rate is 90 %). On the other hand, a simple Bayesian classifier we constructed (assuming Gaussian multi-variate function, see Appendix C.1, eqs. C1 and C2) only gave 56 %. This is the same success rate reported by LV89 by their linear classifier. This clearly shows that non-linear ANNs can be superior to linear classifiers, and that the classifier itself is of great importance, not only the parameters.

4.4 PCA data compression as input to ANNs

In this section we address the question how many Principal Components are needed to recover the same classification achieved with ANN using the full input data. To illustrate this point we use the D7 sample of 791 ESO-LV galaxies, where the input parameters are the log of 7 metric diameters. The network architecture is 7:3:1, with both input and output scaled to [0, 1], and using the Quasi-Newton algorithm, with weight decay coefficient = 0.001. Training was done on 600 galaxies, and testing on the remaining 191 galaxies. The resulting rms scatter (eq. 6) evaluated over several runs is Q = 2.2. Using as input only the first Principal Component, which was derived in section 3.1 and accounts for 95% of the variance, we find a much larger scatter, Q = 3.6. Only when the first 3 PC's are used, does one recover the scatter achieved by using all 7 diameters. This shows that the fractional variance on its own is not sufficient to tell us how many PC's are needed for classification.

The failure of the first Principal Component to recover on its own the classification might be due to non-linearity in the data, the effect of noise on the deduction of Principal Components, or the fact that classification requires more information than that given just by the maximal variance (i.e. the second moment of the distribution function). We note that commonly the fractional variance of the eigen-values is used as the sole criterion in compressing the data prior to applying a classification procedure (e.g. in deriving the 'concentration parameter', of Okamura et al. 1984). However, this criterion may underestimate the importance of the minor Principal Components. It may well be that classification can be improved by using more principal components. Furthermore, our experience shows that in some cases minor Principal Components are more important than major Principal Components.

4.5 Scaled parameters vs. absolute parameters

So far, in this paper as well as in our previous studies (SLSS, Naim et al. 1995b, Lahav et al. 1995), we have not used the distance (as estimated from the redshift) to the galaxies. Our input parameters were always scaled, such that they were distance independent. In a sense, we have assumed that what matters in classification are the relative properties of the galaxies. e.g., that two ellipticals with high and low absolute luminosities will be classified as the same type if one is a scaled down (or up) version of the other.

The astrophysical question whether galaxies were formed in a self-similar way, or in a way which mainly depends on the their total mass or potential well is still open. For example, Simien & de Vaucouleurs (1986) showed a tight correlation between the disk-to-bulge ratio (a distance independent property) and the Hubble type, while Meisels & Ostriker (1984) argued that the absolute luminosity of the spheroidal component (a distance dependent property) is the major parameter determining the Hubble Sequence.

To test this question we have used the D7 data as described in X4.4 and fed the ANN with the log of ratio of 6 diameters to the half-light diameter. The resulting scatter was larger, Q = 2.4, compared with scatter of 2.2 when all 7 metric diameters were presented. Our tentative conclusion is that absolute parameters are not much more informative than the scaled properties. However, the quality of the data and the parameters used (diameters) are not sufficient to prove the theoretical prejudice some may have that only scaled (self-similar) properties control the fate of a galaxy along the Hubble Sequence.

5 DISCUSSION

In this paper we have attempted to de-mystify ANNs by showing how they generalize other statistical methods commonly used in Astronomy and other fields. The methods were illustrated using the ESO-LV galaxy data, showing that ANNs can replicate successfully human classification. These results for ESO-LV are in accord with our results for the APM sample of 830 galaxy images (Lahav et al. 1995, Naim et al. 1995b): an ANN can replicate the classification by a human expert to within 2 T-type units, similar to the scatter between two human experts.

ANNs are sometime considered as being esoteric methods. Questions commonly asked by `Neuro-sceptics' are: (i) Could we understand what the ANNs are doing, or are they just `black boxes'? (ii) If one has already selected `good parameters', does it matter what classifier is to be used ? We have shown that the ANNs approach should be viewed as a general statistical framework. Some special cases of ANNs are statistics we are all familiar with. However, the ANNs can do better, by allowing non-linearity. There is of course freedom in choosing what kind of `non-linearity' to apply, but sensible choices show that significant improvement can be achieved over the linear approaches. For cosmologists, there is an analogy here with N-body simulations of gravitational systems. Linear theory is reasonably well understood, but is not sufficient to describe complicated dynamics. One needs to use then numerical simulations, producing results which are not always understood by intuition or by analytic methods. However, one can verify what is happening by considering simple cases (e.g. the spherical infall model) to gain confidence in what the simulations give. Our approach to the ANNs is similar.

This paper does not cover of course all possible approaches to classification. For example, as described in Appendix C.2, one can use just a linear network (in which the weights effectively act like a Wiener filter), but modify the input parameters to be non-linear (in a somewhat ad-hoc way). In some case such networks can do as well as the non-linear ANNs. Another important issue, not discussed here, is how to handle noisy data.

An even more challenging task is to devise `unsupervised' algorithms, where there is no external `teacher', and the data speak for themselves. Such methods could be either `cooperative' (e.g. PCA, non-linear encoder, or the Kohonen 1989 self-organizing map) or

`competitive' (e.g. cluster analysis). For preliminary applications of unsupervised methods to galaxy classification see Naim (1995). These unsupervised algorithms may well explore new features in the data set which were previously ignored by the human experts.

On the more astrophysical side, the goals are to incorporate dynamical properties of galaxies (e.g. circular velocities) and multi-wavelength data (from radio to the UV). The hope is that these methods will help defining a new physical space of galaxies, in analogy with the H-R diagram for stars.

Acknowledgments. We thank W. Fitzgerald, S. Folkes, J. Hertz, M. Irwin, J. Lasenby, D. Lynden-Bell, D. MacKay, B. Ripley and T. von Hippel for stimulating dis¬cussions. OL acknowledges the hospitality of the Hebrew University and the Weizmann Institute, where parts of this paper were written, AN thanks an Overseas Research Stu¬dentship and an Isaac Newton Studentship, LSJ acknowledges FAPESP and CNPq for financial support, and MCSL thanks the Sheepshanks Fund and the Harrison-Watson Foundation of Clare College for financial support.

REFERENCES

Abraham, R., Valdes, F., Yee, H.K.C. & van den Bergh, S., 1994, ApJ, 432, 75

Angel, J.R.P., Wizinowich, P., Lloyd-Hart, M. & Sandler, D., 1990, Nature, 348, 221

Doi, M. , Fukugita, M. & Okamura, S., 1993, MNRAS, 264, 832

Francis, P., Hewett, P.C., Foltz, C.B., Chafee, F.H., 1992, ApJ, 398, 480

Fraser, D.D., Khan, Z. & Levy, D.C., 1992, in Artificial Neural Networks, Aleksander, I.

& Taylor, J., eds., Elsevier, Amsterdam

Geva, S. & Sitte, J. 1992, IEEE, 3, 621

Gish, H., 1990, In Proceedings of IEEE Conference on Acoustics Speech and Signal Pro

cessing, p. 1361.

Gull, S.F., 1989, in Maximum Entropy and Bayesian Methods, ed. J. Skilling, pg. 53,

Kluwer

Hebb, D.O., 1949, The Organization of Behavior, Wiley, New York

Hertz, J., Krogh, A., & Palmer, R.G., 1991, Introduction to the Theory of Neural Compu

tation, Addison-Wesley, Redwood city, California

Hubble, E., 1936, The Realm of Nebulae, Yale University Press, New Haven

Kent, S., 1985, ApJ Sup, 59, 115

Kodaira, K., Watanabe, M., & Okamura, S., 1986, ApJ Sup, 62, 703

Kohonen, T., 1989, Self-Organization and Associative Memory, Springer-Verlag, 3rd edi

tion

Lahav, O., 1995, in The World of Galaxies II, Lyon 1994, G. Paturel & C. Petit, eds.,

Astrophysical Letters H Communications, 31, 73, Gordon & Breach

Lahav, O. & Gull, S.F., 1989, MNRAS, 240, 753

Lahav, O., Naim, A., Buta, R.J., Corwin, H.G., de Vaucouleurs, G., Dressler, A. Huchra,

J. P., van den Bergh, S., Raychaudhury, S., Sodre, L. Jr. & Storrie-Lombardi, M.C.,

1995, Science, 267, 859

Lasenby, J. & Fitzgerald, W.J., 1993, CUED/F-INENG/TR.142

Lauberts, A. & Valentijn, E.A., 1989, The Surface Photometry Catalogue of the ESO

Uppsala Galaxies, ESO

MacKay, D.J.C., 1992, PhD thesis, Caltech

Meisles, A. & Ostriker, J. P., 1984, AJ, 89, 1451

Miller, A.S., 1993, Vistas in Astronomy, 36(2), 141

Murtagh, F. & Heck, A., 1987, Multivariate Data Analysis, Reidel, Dordrecht

Naim, A., 1995, PhD thesis, Cambridge University

Naim, A., Lahav, O., A., Buta, R.J., Corwin, H.G., de Vaucouleurs, G., Dressler, A.

Huchra, J. P., van den Bergh, S., Raychaudhury, S., Sodre, L. Jr. & Storrie-Lombardi,

M.C., 1995a, MNRAS, 274, 1107

Naim, A., Lahav, O., Sodre L. Jr. & Storrie-Lombardi, M.C., 1995b, MNRAS, in press

Odewahn, S.C., Stockwell, E.B., Pennington, R.L., Humphreys, R.M. & Zumach, W. A.,

1991, AJ, 103, 318

Oja, E., 1982, Journal of Mathematical Biology, 15, 267

Oja, E., 1992, Neural Networks, 5, 927

Okamura, S., Kodaira, K. & Watanabe, M., 1984, ApJ, 280, 7

Pao, Y.H., 1989, Adaptive Pattern Recognitionand Neural Networks, Addison-Wesley, New

York

Parker, D.B., 1985, Report TR-47, MIT, Center for Computational Research in Economics

and Management Science, Cambridge MA.

Press, W.H., Teukolsky, S.A., Vetterling, W.T., & Flannery, B.P., 1992, Numerical Recipes,

2nd edition, Cambridge University Press, Cambridge.

Rayner, J.W. & Lynch, M.R., 1989, IEEE, D7.10

Richard, M.D. & Lippmann, R.P., 1991, Neural Computation, 3, 461.

Ripley, B.D., 1993, in Statistics and Images, ed. K.V. Mardia, Abingdon, Carfax

Rumelhart, D.E., Hinton, G.E. & Williams, R.J., 1986, Nature, 323, 533.

Rybicki, G.B. & Press, W.H., 1992, ApJ, 398, 169

Sanger, T.D., 1989, Neural Networks, 2, 459

Serra-Ricart, M., Calbet, X., Garrido, L. & Gaitan, V., 1993, AJ, 106, 1685

Simien & de Vaucolerus, 1986, ApJ, 302, 564

Spiekermann, G., 1992, AJ, 103, 2102

Storrie-Lombardi, M.C., Lahav, O., Sodre, L. & Storrie-Lombardi, L.J., 1992, MNRAS,

259, 8p

Storrie-Lombardi, M.C. & Lahav, O., 1994, guest eds., Vistas in Astronomy, special issue

on ANNs in Astronomy, 38, (3)

Storrie-Lombardi, M.C. & Lahav, O., 1995, in Handbook of Brain Theory and Neural

Networks, M.A. Arbib, ed., MIT Press, Boston

Thonnat, M., 1989, In The World of Galaxies, p. 53. eds. H.G. Corwin, Jr. & Bottinelli,

L., Springer-Verlag, New York

von Hippel, T., Storrie-Lombardi, L.J., Storrie-Lombardi, M.C., & Irwin, M., 1994, MN

RAS, 269, 97

Werbos, P.J., 1974, PhD thesis, Harvard University, Cambridge MA

Wiener, N., 1949, Extrapolation and Smoothing of Stationary Time Series, Wiley, New

York

Zaroubi, S., Hoffman, Y., Fisher, K.B. & Lahav, O., 1995, ApJ, in press

APPENDIX A: PRINCIPAL COMPONENT ANALYSIS (PCA) AND ANN

A.1 Standard PCA

A pattern can be thought of as being characterized by a point in an M-dimensional parameter space. One may wish a more compact data description, where each pattern is described by M' quantities, with M' « M. This can be accomplished by Principal Component Analysis (PCA), a well known statistical tool, commonly used in Astronomy (e.g. Murtagh & Heck 1987 and references therein). The PCA method is also known in the literature as Karhunen-Loeve or Hotelling transform, and is closely related to the technique of Singular Value Decomposition. By identifying the linear combination of input parameters with maximum variance, PCA finds M' variables (`Principal Components') that can be most effectively used to characterize the inputs.

The first Principal Component is taken to be along the direction in the M-dimensional input parameter space with the maximum variance. More generally, the k-th component is taken along the maximum variance direction in the subspace perpendicular to the subspace spanned by the first (k- 1) Principal Components. It is convenient to apply PCA to data already standardized, e.g. transformed to zero-mean and unit-variance. However, while this scaling is appropriate for data composed of `apples and oranges' as in the present paper for the 13 ESO-LV parameters, in other problems such as the 7 ESO-LV diameters and spectral analysis of quasars and galaxies (cf. Francis et al. 1992, Lahav 1995) it is more sensible not to divide by the variance of each channel (over an ensemble of objects), in order to keep the relative strength of the lines.

The formulation of standard PCA is as follows. Consider a set of N objects (i = 1; N), each with M parameters(j = 1; M). If rij are the original measurements, we construct normalized properties as follows:

xij = rij - rj

N PN

where rj = 1 N PN i=1 rij is the mean, and s2 j = 1 i=1(rij - rj )2 is the variance. We then

construct a correlation matrix

It can be shown that the axis along which the variance is maximal is the eigen-vector u1 of the matrix equation

Cu1 = A1u1, (A3)

where the A1 is the largest eigen-value, which is in fact the variance along the new axis. The other principal axes and eigen-vectors obey similar equation. It is convenient to sort them in decreasing order, and to quantify the fractional variance by A/ E A. It is also convenient to re-normalize each component by pA, to give unit-variance along the new axis. We note that the weakness of PCA is that it assumes linearity and also depends on the way the variables are scaled. In contrast, ANNs generally allow non-linearity.

A.2 Principal Components from Encoder

PCA is in fact an example of `unsupervised learning', in which an algorithm or a `linear network' discovers for itself features and patterns (see e.g. Hertz et al. 1991 for review). A simple net configuration M : M' : M (see Fig. 2) with linear transfer functions allows finding M' linear combinations of the original M parameters. The idea is to force the output layer to reproduce the input layer, by least-squares minimization (e.g. using the Backpropagation algorithm, see Appendix B). If the number of `neck units' M' equals M then the output will exactly reproduce the input. However, if M' < M, the net will find, after minimization, the optimal linear combination. By changing the threshold function from linear to non-linear (e.g. a sigmoid) one can allow `non-linear PCA'. Some authors (e.g. Geva & Sitte 1992; Serra-Ricat et al. 1993) advocate a configuration of M : 2M + 1 : M' : 2M + 1 : M to get optimal reconstruction of non-linear shapes, e.g. a circle.

A.3 Neural PCA: Oja's Neural Network

One interesting aspect of ANN theory is that a very simple artificial neuron can be trained to extract the first Principal Component of the input parameters (Oja 1982). Consider an artificial neuron which receives a set of n scalar-valued inputs x1, ..., xn through n connections with coupling strengths (weights) w1, ..., wn and produces an output Y :

Y= wixi. (A4)

i=1

During the training of this neuron, the weights wi can be changed after the presentation of a pattern according to the Hebbian rule (Hebb 1949),

Owi = Y xi (A5)

where controls the rate of learning. Oja (1982) generalized this rule by incorporating a

normalization

wi(t + 1) =

{Enwi(t) + Y xi (A6) j=1[wj(t) + Y xj ]2}1 2 ;

where is the `learning coefficient'. Expanding this expression as a power series in and retaining only the first order term yields the learning equation known as Oja's rule:

wi(t + 1) = wi(t) + Y (t)[xi - Y(t)wi(t)] (A7)

Oja's rule, after training, chooses the direction of the weights vector w to lie in the maximal eigenvector direction of the correlation matrix (xxT) (assuming zero-mean data, and using here matrix multiplication notation; xT being the transposed vector ). Moreover, this turns out to be also the direction which maximizes the variance of the output (Y2) = wT(xxT)w

(see e.g. Hertz et al. 1991). Oja (1982) also showed that, after training, the normalization

i=1 w2i tends to be bounded and close to one. Other rules to extract the first and higher Principal Components have been proposed e.g. by Sanger (1989) and Oja (1992). While these learning rules give insight to the link between PCA and ANN, it is easier in practice to extract the Principal Components by the standard method (Appendix A.1) or by a linear encoder (Appendix A.2).

APPENDIX B: MINIMIZATION ALGORITHMS

B.1 The Backpropagation method

The Backpropagation algorithm has been re-invented several times (e.g. Werbos 1974; Parker 1985; Rumelhart, Hinton & Williams 1986) and is one of the most popular ANN algorithms. A typical configuration is shown in Fig. 4. For a given network architecture the first step is the `training' of the ANN. In this step the weights wij's (the `free parameters') are determined by minimizing `least-squares'. The novel aspect of Backpropagation is the way this minimization is done, using the chain rule (gradient descent).

Each node (except the input nodes) receives the output of all nodes in the previous layer and produces its own output, which then feeds the nodes in the next layer. A node

at layer s calculates a linear combination over the input x(s-1) i from the previous layer

s - 1 according to I(s) j = n i=0 w ij x

(s) (s 1)

i (B1)

where the wij's are the weights associated with that node. Commonly one takes x0 = 1, with w0j playing the role of a `bias' or DC level. The node then fires a signal

xj = f(z);

(s) (B2)

where z here stands for I(s)

j , and f is a non-linear transfer function usually of the sigmoid

form

f (z) = 1=[1 + exp(-z)] (B3)

in the interval [0,1], or

f (z) = tanh(z) (B4)

in the interval [-1,1].

For each object (pattern) in the training set, the network compares its output vector in the `classification space' o to the desired vector d determined by the `true answer' (e.g. as given by a human expert). For example, the elements of the vector d can be defined as zeros except for one element set to 1 corresponding to an actual class, e.g. we define d = (1; 0; 0; 0; 0) for Elliptical galaxies.

The comparison is done in terms of a cost function, usually of the form

(ok - dk)2; (B5)

where the sum is over the components of the vectors. This cost function, averaged over all the training galaxies presented to the ANN is minimized with respect to free parameters, the weights wij. The weights are updated by gradient descent backwards (hence the name Backpropagation) from the output layer to one or more hidden layers, by a small change in each time step,

Owij(t + 1) = - @E + Awij(t); (B6)

@wij

where the `learning coefficient' Tj and the `momentum' a are `knobs' which control the rate of learning and the inertia from the previous iteration respectively (see e.g. Hertz et al. 1991).

The elegance of the Backpropagation algorithm is in the way the derivative is evalu¬ated. Let us consider the case of a sigmoid output

oj = x

(s) j = f(I(s)

(s) j ) (B7)

where

. (B8)

1 + e`

In this case note that f' = f (1 - f) and the derivative can be written as:

= o(s-1)

i oj (1 - o(s)

(s) j )~j, (B9)

where

~j = (oj - dj) (B10)

for nodes in the output layer, and

~j = wjkok(1 - ok)A (B11)

for nodes in hidden layers (the sum is over k nodes in the layer above node j).

The `hidden layers' allow curved boundaries around clouds of data points in the pa-rameter space in a non-parametric way. The interpretation of the output depends on the network configuration. For example, a single output node provides a continuous output (e.g. predicting the T-type as in X4.1 or the luminosity of a galaxy), while several output nodes can be used to assign probabilities to different classes (e.g. 5 morphological types of galaxies), as explained in Appendix C.

B.2 The Quasi-Newton method

There are methods, other than Backpropagation, for minimizing the non-linear func¬tion eq. (B5). A more efficient method is Quasi-Newton. In short, the cost function E(w) in terms of the weights vector w is expanded about a current value w0 :

E(w) = E(w0) + (w w0) rE(w0) + 2 (w w0) H (w w0) + ...~ (B12)

where H is the Hessain with elements Hij = @2E

@wi@wj evaluated at w0. The minimum

approximately occurs at

rE(w) rE(w0) + H (w w0) = 0. (B13)

Hence an estimation for the optimal weights vector is at

w = w0 H-1rE(w0). (B14)

In the standard Newton's method a previous estimate of w is used as the new w0. Calcu¬lating the Hessian exactly is expensive computationally, and in the quasi-Newton method an iterative approximation is used for the inverse of the Hessian (e.g. Press et al. 1992; Hertz et al. 1991).

APPENDIX C: RELATIONS BETWEEN ANNs AND OTHER CLASSI-FIERS

C.1 Bayesian Classification and probabilities

A classifier can be formulated from first principles according to Bayes theorem:

P(Tjjx) = Ek P(xjTj) P(Tj)

P(xjTk) (C1)

P(Tk)

i.e. the a posteriori probability for a class Tj given the parameters vector x is proportional to the probability for data given a class (as can be deduced from a training set) times the prior probability for a class (as can be evaluated from the frequency of classes in the training set). However, applying eq. (C1) requires parameterization of the probabilities involved. It is common, although not always adequate, to use multivariate Gaussian:

P(xjTj) = (27T)-M=' jCjj-1=a exp[21x Cj -1 xT] (C2)

where x is of dimension M and here has zero-mean, xT is its transposed vector, and Cj = hx xTij is the covariance matrix per class j. This matrix is similar to the one used in

the PCA (Appendix A.1) for all the classes. As in PCA, the matrix Cj can be diagonalized, hence simplifying eq. (C2).

It can be shown that certain ANN configurations behave like Bayesian classifiers, i.e. the output nodes produce Bayesian a posteriori probabilities (see e.g. Gish 1990; Richard & Lippmann 1991), although it does not implement Bayes theorem directly. To illustrate this important property of the networks we follow Gish (1990) for a simple heuristic example. Let the network's single output be written as f (x, w) where x stands for the input parameters and w stands for the weights (more generally these quantities are vectors). We consider a two class problem for which the desired output of the network is 1 if x is in class T1 and 0 if it is in class T2. The cost function over all objects N is then (cf. eq. B5)

For large N and if the number of samples from each of the classes is in proportion to the a priori probability of class membership P(Tj) this can be replaced by an integral

E = I [f (x, w) - P(T1Ix)]2P(x)dx + P(T1) -I P2(T1Ix)P(x)dx. (C4)

The minimum of this function with respect to w clearly occurs for

f (x, w) = P(T1Ix), (C5)

so the output of the network can be interpreted as the a posteriori probability. This can be generalised for multiple output. It is reassuring (and should be used as a diagnostic) that the probabilities in an `ideal' network add up approximately to unity. Moreover, if both the training and testing sets are drawn from the same parent distribution, then the frequency distribution P(Tj) for the objects as classified by the ANN is similar to that of the training set. The link between minimum variance and probability also illustrates why a classification scheme where one calculates the Euclidean distance of the ANN output from the vector representing each of the possible classes and then assigns the object to the class producing the minimum distance is equivalent to assigning a class according to the highest probability (cf. Richard & Lippmann 1991). For a sigmoid output (eq. B3) it can be shown (Gish 1990) that the argument of the sigmoid, z(x, w) = ln[f (x, w)l(f (x, w) - 1)],

with f (x; w) = P(T1jx) (eq. C4) and P(T2jx) = 1 P(T1jx) gives

P (T1jx)

z(x; w) = ln P(T2jx) ; (C6)

i.e. the argument of the sigmoid is modelling the log-likelihood ratio of the two classes. With the transfer function tanh(z) = 2=[1+ exp(2z)]1 the interpretation is similar. We note that the above analysis (eq. C4) does not tell anything about the network architecture, and it only holds for `idealized' network and data. For more rigorous and general Bayesian approaches for modelling ANNs see MacKay (1992).

C.2 Linear Networks and Wiener Filtering

The weights, the free parameters of the ANN, can have a simple interpretation when the network is linear [f (z) = z] without hidden layers, commonly called the `perceptron'. For simplicity of notation we consider a network with a single continuous output, e.g. yielding the type T. In this case we can write the cost function as :

where = 1; :::N labels the objects, and k = 0; :::M the parameters. The index k = 0 stands for the `bias' term (with x0 = 1), and it plays the role of an additive constant w0 in the network equation. The minimum of E with respect to the weights occurs at

giving

X hxkxji wk = hTxji (C9)

where h:::i are averages over the N objects.

The solution of this set of linear equations (for j = 1; :::M) for the optimal weights

vector w can be written as

wopt = A-1 b; (C10)

where Ajk = hxkxji and bj = hTxji. More generally, if there are multiple output units (say a vector s) so the weights form a matrix W (not necessarily square), the minimum variance h(s Wx)(s Wx)Ti with respect to the weights occurs for

Wopt = hsxTihxxTi-1 : (C11)

This is in fact the standard Wiener (1949) filter known in digital filtering and image processing, commonly applied for signal+noise problems with x = s + n (e.g. Rybicki & Press 1992 for a review, and Zaroubi et al. 1995 and references therein for recent cosmological applications). We note that the same result can be derived by conditional probabilities with Gaussian probability distribution functions, as well as by regularisation with a quadratic prior.

For an alternative, somewhat more complicated expression see Hertz et al. (1991, pg. 102), where the weights are given in terms of a covariance matrix of the objects (useful for the case of many features and few objects).

One can go one step further to generalize the above to non-linear input. This can be done e.g. by expanding the elements of the input vector as products of their powers. For example, if the input parameters are x1 and x2 the expanded input vector is

[1, x1, x2, x2 1, x1x2, x22, ...].

This is sometime called the Volterra Connectionist model (VCM; see e.g. Rayner & Lynch 1989, Pao 1989, Lasenby and Fitzgerald 1993). Other alternatives for non-linear input are e.g. `radial basis functions' and spherical harmonics. In fact, this can be viewed as an ad-hoc hidden layer which forces the input to a new non-linear form. The advantage of VCM network is that the global minimum is unique. This provides a reproducible solution and allows fast training. On the other hand, as the connections between input and `hidden' layer are `hard wired', the freedom of the network for difficult data sets is limited.

C.3 Regularisation and Weight Decay

As in other inversion problems, the determination of many free parameters, the weights wi's in our case, might be unstable. It is therefore convenient to regularise the weights, e.g. by preventing them from growing too much. In the ANN literature this is called `weight decay'. This approach is analogous to Maximum Entropy, and can be justified by Bayesian arguments, with the regularising function acting as the prior in the weight space. Note that this is a different application of Bayes theorem from the one discussed in xC.1, applied in the class-space.

One possibility is to add a quadratic prior and to minimize

Etot = aEw + ~ED, (C12)

where ED is our usual cost function, based on the data (e.g. eqs. B5 and C6) and

Ew = 1

2 Q

i=1 w 2

i (C13)

is the chosen regularising function, where Q is the total number of weights. The coefficients and can be viewed as `Lagrange multipliers'. While sometime they are specified ad-hoc, it is possible to evaluate them `objectively' by Bayesian arguments in the weight-space. This has been done in the context of ANNs by MacKay (1992, see also Ripley 1993) following earlier analysis in relation with Maximum Entropy by Gull(1989; see also Lahav & Gull 1989). The Bayesian analysis gives the conditions on and as

w = 2^Ew = (C14)

and

2D = 2 ^ED = N - (C15)

where N is the number of data points (objects) and

q=1 q (C16)

q +

where the q's are the eigen-values of the Hessian (in the weight-space) DDED, eval¬uated with the weights at which Etot is minimum. The parameter has an interesting interpretation, the number of `well-determined' weights. If q » then Q (the total number of weights). In this case 2D N - Q, which is similar to the usual condition of 2 equals the number of degrees of freedom. Moreover, if Q « N then

where 2w = 2^Ew=Q = Ewi 2 =Q and 2D = 2^ED=N, as expected for Gaussian probability distribution functions. We note that this analysis makes sense if the input and output are properly scaled e.g. between [0, 1] with sigmoid transfer functions, so all the weights are treated in the regularisation process on `equal footing'. It can be generalized for several regularising functions, e.g. one per layer.

We note that the addition of the regularisation term E,, changes the location of the minimum, now satisfying

as from eq. (C13) rE,, = w. The effect of the regularisation term here reminds the restoring force of harmonic oscillator: the larger w is the more it will get suppressed. The addition of the regularisation term to eq. (C4), gives a minimum for the extended cost function which does not satisfy eq. (C5), i.e. it violates the probabilistic interpretation in the class-space. However, one could construct a network with regularisation which will produce probabilities self-consistently (e.g. MacKay 1992). The weight decay term also modifies the Wiener solutions in xC.2

FIGURE CAPTIONS

Figure 1a The distribution of 5217 ESO-LV galaxies of all morphological types (top right) in the 2-dimensions defined by the first and second Principal Components as derived from PCA using 13 galaxy parameters. The other three panels show subsets of this fuzzy cloud according to their classification labels Sa+Sb, Sc+Sd and E+S0 as given in ESO-LV. The different morphological types occupy distinct regions in this new parameter space.

Figure 1b The top-right panel is as in Figure 1a, and the other 3 panels are for the classes E, S0, and Irr. Note that E and S0 galaxies are segregated.

Figure 2 A schematic diagram of an encoder network with M input parameters, N (labeled as M' in the text) nodes on the hidden layer, and M output nodes. N will range between 1 and M, depending on the desired data compression factor. During training, set Input=Output to teach the encoder to reproduce a given input vector at the output layer. This network performs PCA-like dimensionality reduction when the transfer function is linear, and can be extended to perform non-linear mapping.

Figure 3 The cost function vs. the number of hidden units in encoder network with a sigmoid transfer function. The network was trained on the 5217 ESO-LV galaxies, each with 13 parameters. The cost function seems to drop roughly exponentially with the number of hidden units.

Figure 4 An ANN configuration with M input parameters, N hidden nodes and a single `analog' output. Such a network can perform a non-linear regression, and is used in our problem to predict the T-type, based on input galaxy parameters. All nodes in a given layer are connected to all nodes in the next layer. The `bias' node allows additive constants to the network equation.

Figure 5 The type Tnet predicted by the ANN for 1700 ESO-LV galaxies (based on a different set of 1700 galaxies) against the ESO-LV human classification Teso. The Spearman correlation coefficient in this diagram is 0.83, and the average rms dispersion is 2.0 T-types.

Figure 6 ANN reproducibility: A comparison between the predicted T-type of the network for 1700 ESO-LV galaxies from two runs, starting the minimization with different random weights. The Spearman correlation coefficient is 0.98 and the rms dispersion is 0.6 T-types. Note the transition between early to late type, discussed in the main text.

Int. J. Communications, Network and System Sciences, 2014, 7, 131-140 Published Online May 2014 in SciRes. http://www.scirp.org/journal/ijcns http://dx.doi.org/10.4236/ijcns.2014.75015

Application of Machine-Learning Based

Prediction Techniques in Wireless Networks

Gitanjali Bhutani

WCDMA, Alcatel-Lucent Technologies India Private Limited, Bangalore, India

Email: gitanjali.bhutani@alcatel-lucent.com

Received 18 April 2014; revised 30 April 2014; accepted 7 May 2014

This work is licensed under the Creative Commons Attribution International License (CC BY).

http://creativecommons.org/licenses/by/4.0/

Abstract

Wireless networks are key enablers of ubiquitous communication. With the evolution of net-working technologies and the need for these to inter-operate and dynamically adapt to user re-quirements, intelligent networks are the need of the hour. Use of machine learning techniques al¬lows these networks to adapt to changing environments and enables them to make decisions while continuing to learn about their environment. In this paper, we survey the various problems of wireless networks that have been solved using machine-learning based prediction techniques and identify additional problems to which prediction can be applied. We also look at the gaps in the research done in this area till date.

Keywords

Wireless Networks, Prediction, Machine Learning, Ubiquitous Communication, Pervasive Computing

1. Introduction

In the current age of Information Technology revolution, quick availability of use of information to make speedy decisions is becoming a competitive advantage for many businesses. In such an environment, to have all deci¬sion-makers connected in, ubiquitous communications have become the need of the hour. Ubiquitous computing and communication combine mobility with context awareness, adaptability, scalability and localization to create an environment where devices are smarter and take actions by predicting user behavior. It finds its applications in a large variety of areas including energy conservation, manufacturing, healthcare, banking, education and tel¬ecommunications. In order to make ubiquitous computing a reality, certain devices are absolutely necessary. At the bottom of this ubiquitous computing stack, there are the sensors or smart phones with sensor functionality. These are responsible for collecting information from the surrounding environment and reporting it to a deci

How to cite this paper: Bhutani, G. (2014) Application of Machine-Learning Based Prediction Techniques in Wireless Net¬works. Int. J. Communications, Network and System Sciences, 7, 131-140. http://dx.doi.org/10.4236/ijcns.2014.75015

G. Bhutani

sion-making entity. In order to allow these sensors to communicate, the next layer is the wireless communica¬tion layer which can be provided by the 802.11 family of networks or any other communication technology. The final level in the stack includes the applications that collect, mine and analyze the data collected by the sensing devices for patterns in order to make decisions. Wireless networks, being a key enabler of the ubiquitous com¬munication paradigm are gaining in importance. The quality of service provided by these networks is of utmost importance in determining the applications that they will be put to. Not only do these networks have to provide an enhanced quality of user experience (QoE), they must do so at optimized rates and with optimum resource usage. Wireless devices are extremely constrained for power and the network implementations have to account for this at all times.

With smart-phones being used for newer real-time applications every day, it becomes challenging for the wireless network elements to keep pace. It is thus the need of the hour for networking software to adapt to changing requirements and user trends without the need for manual intervention. One way of achieving this is to build intelligent network elements that record user behavior, characterize it, identify patterns in it and use the knowledge gained from these data to adapt various parameters. Foremost in the requirements of this type of in¬telligent software is the ability to predict different aspects of user behavior in order to determine any parameter changes or other actions sufficiently in advance, so that changing network conditions, due to mobility or other reasons, are seamless for the user. The use of artificial intelligence, specifically learning and prediction tech¬niques make these adaptable systems a reality.

In this paper we look at the applications of prediction techniques to solve different aspects of the ubiquitous computing problem. The remaining part of the paper is organized as follows: in Section 2, we discuss machine learning techniques for prediction in more detail; Section 3 categorizes the literature in the area of prediction in wireless networks, based on the problems addressed using prediction; Section 4 and beyond look at each of these areas in further detail and compare the various methodologies used; finally, we look at the research gaps in this area of building intelligent wireless networks using prediction techniques.

2. Overview of Machine Learning

Machine learning deals with algorithms that give computers the ability to learn, in much the same way as hu¬mans. This means that given a set of data, an algorithm infers information about the properties of the data, al¬lowing it to make predictions about other data it may see in the future. The main focus of machine learning is the design of algorithms that recognize patterns and make decisions based on input data. Machine learning has found uses in areas like biotechnology, fraud detection, wireless networks, stock market analysis and national security.

Machine learning algorithms can be categorized into:

1) Supervised learning: these set of algorithms use training data to generate a function that maps the inputs to desired outputs (also called labels). For example, in a classification problem, the system looks at example data and uses it to arrive at a function mapping input data into classes. Artificial neural networks, radial basis func¬tion networks and decision trees are forms of supervised learning.

2) Unsupervised learning: these set of algorithms work without previously labeled data. The main purpose of these algorithms is to find the common patterns in previously unseen data. Clustering is the most popular form of unsupervised learning. Hidden Markov models and self-organizing maps are other forms of Unsupervised Learning.

3) Semi-supervised learning: as the name indicates, these algorithms combine labeled and unlabeled data to generate an appropriate mapping function or classifier.

Artificial neural networks are extremely popular in the field of prediction in wireless networks. The other techniques like decision tress and unsupervised learning are used much lesser. Experiments prove that using a combination of techniques instead of a single one provides the best results.

3. Survey of Prediction Techniques in Wireless Networks

As wireless networks move towards being omnipresent to facilitate smart homes, offices and so on, completely eliminating any manual intervention in setting these up and operating them is essential. In order to adapt to the environment and inter-operate with other systems, making these systems intelligent will be essential for their success. One of the main problems in wireless networks is the unpredictable signal quality. The signal strength

G. Bhutani

at any point in a wireless network is impacted by several factors—topology of the area, presence of buildings, interference from different networks and appliances operating at similar frequencies and so on. Since, the net¬works or the neighboring devices can keep changing, having a static algorithm to attack this problem will not work. Instead, having wireless network elements sense interference and respond to it appropriately will allow adaptation to changing environments. One of the most researched areas in ad-hoc and wireless sensor networks is in wireless link status prediction. Being able to predict when a link’s strength will drop below threshold levels and for how long, will allow applications to take corrective action in advance, thus ensuring minimal service disruption. Section 4 looks at the application of machine learning to this problem.

As the different types of networks and devices multiply, it is essential to allow inter-operability amongst these to give users maximum flexibility. Amongst other aspects, achieving inter-operability involves achieving seam¬less handovers between these networks. Different schemes exist to ensure a seamless handover between net¬works, but to complete the handover in time with minimal wastage of resources requires prediction of the time at which the mobile station will lose connectivity to its current network. This prediction of mobility and discon¬nection time is the focus of Section 5.

The most researched areas in ad-hoc and wireless sensor networks are those of routing and intrusion detection. The ad-hoc nature of these networks means that mobile nodes are responsible for routing packets while processing their own data, but users can enter and leave the network at any time. The main focus of research in this area is to have routing algorithms that can adapt to the changing topology as quickly as possible. Having some amount of prediction capabilities built into these algorithms allows them to select the most reliable and longer duration routes to forward data. Section 6 looks at the use of prediction in routing in further detail. The ad-hoc nature of these networks also makes them extremely vulnerable to security attacks. With the emergence of newer forms of security threats all the time, being able to use the data from previous attacks to predict the general characteristics of attacks and to detect newer ones will go a long way in securing these systems. Use of prediction in intrusion detection is the main focus of Section 7.

4. Wireless Link Status Prediction

The stability and reliability of links in wireless networks is dependent on a number of factors such as the topol¬ogy of the area, inter-base station or inter-mobile station distances, weather conditions and so on. As such, there is no single way of modeling wireless link behavior that will work in all cases, making it difficult to predict wireless link availability using mathematical models. At the same time, an estimation of link quality and link availability duration can drastically increase the performance of these networks, allowing the network to take proactive measures to handle impending disconnections. One such application of link disconnection prediction is discussed in [1], wherein the authors propose a scheme that prevents Transmission Control Protocol (TCP) congestion control mechanism from kicking in during temporary disconnection of a Mobile Station (MS). Hav¬ing been designed for wired, fixed networks, TCP interprets all delayed acknowledgements and packet losses as congestion and brings its congestion control mechanisms into operation. These mechanisms can lead to a huge reduction in data rates as seen by the MS and is wasteful if a MS experienced disconnection due to transient network conditions, which is rather normal in wireless networks. In the scheme proposed in [1], a base station measures the signal strength for each MS and uses it to predict the time and duration of disconnection. Based on the predicted time, it starts caching Transmission Control Protocol (TCP) Acknowledgements (ACKs) for the MS that is going to be disconnected. When the MS is disconnected, the stored ACKs packets are then spaced out in time for the duration of the disconnection, in order to prevent the TCP source from bringing its congestion control mechanisms into play, on detecting lost packets. Another application that would greatly benefit from the wireless link predictions is that of routing in ad-hoc and wireless sensor networks. While routing itself is a chal¬lenge in these networks due to node mobility and the absence of a fixed infrastructure, it becomes further com¬plicated if the routing algorithm does not take the link lifetime into account. This can lead to numerous re-routings, degrading the performance of the network. These two applications are only a few of the plethora of applications that would benefit from the ability to predict the wireless link quality well in advance. In this sec¬tion, we look at some of the research done in this area.

Another scheme for link quality prediction and link estimation called 4C [2] uses previously collected link quality data to construct three different machine-learning models: Naïve Bayes classifier (supervised learning technique that is built using the Bayes theorem and is used to classify previously unseen data based on the

G. Bhutani

learning that was performed in the training stage), logistic regression (a statistical technique that predicts the probability of the dependent variable having a particular value based on the values of the independent variables) and artificial neural networks. These models are constructed based on a combination of Packet Reception Rate (PRR) for link estimation and Received Signal Strength Indicator (RSSI), Link Quality Input (LQI) and Signal to Noise Ratio (SNR). The output of each model is the success probability of delivering each packet. Once the models are trained, they are deployed for prediction, with the error from each prediction being fed back to the models to continue the learning process. The authors compare the prediction accuracy of each of these models against a Bernoulli process whose success probability is set to the packet reception rate. Experimental results show that all three models have a greater prediction accuracy than the Bernoulli process with the Logistic re¬gression model having the best accuracy at very low computational cost. The authors also compare the 4C algo¬rithm against other similar estimator algorithms like 4 Bit [3] and Short Term Link Estimator (STLE) [4] and find a 20% to 30% difference in accuracy with 4C performing the best.

In order to predict wireless network connectivity, that is, the signal to noise ratio for a mobile station, [5] proposes the use of a new Taylor Kriging model, which is basically the Kriging model with third order Taylor expansion for prediction. The Kriging technique is an interpolation technique used to estimate the value of a mathematical function at unknown points, based on the values at known points. It tries to fit the function to a specified number of points. The authors compare the accuracy of the Taylor Kriging model against that of a pre¬dictor built using the Ordinary Kriging model [6] and an artificial neural network based predictor [7]. The au¬thors use wireless data sets which contain the power of a tower at particular points and the Euclidean distance of these points from the tower. This is used as input to the models built in [5]-[7]. For each point, the output of the model is the logarithm (to the base 10) of the signal to ratio value for that point. The prediction accuracy of the Taylor Kriging model is significantly higher than that of the models proposed in [6] and [7], especially when it comes to constrained training sets. However, the experiments prove that in absolute terms the prediction error is still substantially high.

5. Handovers and Prediction

In this section, we look at prediction to facilitate smooth handovers in further detail. Given the ubiquitous com¬puting environment, together with the smart devices that can support multiple technologies and the application requirements to stay connected all the time, handovers across technologies are more widely supported and re¬searched than in the previous generation of technologies. Handovers can be classified into:

1) Horizontal Handovers: these are handovers between same technology base stations.

2) Vertical Handovers: these are handovers between base stations belonging to different technologies, and as such are more challenging to handle than horizontal handovers. This is because of the variable handover times depending on the target network, together with the different procedures involved in handover.

Prediction to facilitate smooth handovers involves being able to predict the next location or point of attach¬ment of mobile stations. Being able to predict the next location well in advance allows evaluation of candidate target networks to determine which one best meets the requirements, reservation of resources in the target net¬work to avoid ping-pong of handovers and minimal loss of data since, the handover can be completed just as the mobile station loses connectivity to its current network. Prediction of the next location of a mobile station is also called mobility management. A lot of the literature in this area like [8] [9] look at a user’s past movement histo¬ry and predicts future movements and locations. These mechanisms primarily differ in the information used to predict the user’s location. Some schemes like [8] look at only the current location of the user and the historical movement information to predict the future location. In this paper, the authors use a Hidden Markov Model to predict the user’s next location. They use real trace datasets to train the model. The model continues to be tuned as it is used for prediction in the network. A Markov chain model is used in [9], in order to predict the user’s next location. The disadvantage of the Markov chain model is that it only takes the user’s previous location into account. Such a model misses the different paths that different users can take to reach the same state, thus se¬verely affecting the accuracy of the model. In contrast [8] uses a k-th order Markov model, where k > 1. In this case, the system records the movements of the users for the past k transitions. The Access Point (AP) controller is responsible for collecting data from APs about transitions and using this data to predict the next location of the user. The AP controller constructs an HMM for each user and uses this to make predictions. Using an avail¬able campus wireless dataset [10], the authors build the HMM and use it to make predictions and measure the

G. Bhutani

prediction accuracy of the model. The data mined from the datasets provides interesting insights into user movements, the notable one being that most users connect to one or two APs most frequently. The prediction accuracy of the model is impacted by this nature of the dataset and hence, it is easily able to predict if a user will connect to one of its favored APs rather than one of the APs that the user rarely visits. In addition, the experi¬ments also prove that the prediction accuracy falls with the increase in sequence length.

Other schemes use information like network topology and delve deeper into user characteristics to be able to predict the user’s next location. One such technique is described in [11], where users are classified into groups, with all users of a group having similar movement patterns. The paper discusses a technique called Behavior Based Mobility Prediction (BMP), which uses the user group, location, time-of-day and duration in a cell infor¬mation to predict a user’s next point of attachment. The location represents the history of mobility patterns—the movement history of all mobile stations is recorded. The location information together with information about the direction in which the mobile station is moving allows the user’s next location to be predicted. This method allows the structure of buildings and layouts of roads to be taken into account in the next location prediction. The time-of-day factor represents the fact that user movements differ based on the time of the day. The duration factor captures the user’s speed of movement through a cell and is categorized as short, medium and long. The authors specifically look at the problem of handoff latency in wireless LANs. The main reason for the large handoff latency in Wireless LANs is the time needed for scanning for new Access Points (APs) when moving from one cell to another. The authors argue that predicting the next point of attachment of a Mobile Station (MS) can eliminate the scanning overhead and correspondingly reduce the handoff delay. Short duration reflects an unnecessary handoff. Such handoffs can be eliminated by next location prediction using BMP.

The BMP scheme is assumed to be implemented by a server which can be co-located with the authentication server. The server uses all of the user’s movement and location characteristics, together with the time of the day, to arrive at prediction lists consisting of next points of attachment for the MS. The MS then re-associates with the AP in the cells specified in the prediction list in order of their appearance in the list, if associations fail. During the next handoff, the first prediction in either list is used as the next cell prediction based on whether the duration of the MS in the current cell is medium or long. If the first prediction fails, the second is used and so on. A full scan is performed when all the predictions in the lists fail.

The authors compare the BMP technique to other techniques that are used to predict a user’s next location, such as determining next location based on signal strength, employing extra devices or an overlay network to detect APs and so on. BMP scores over these techniques because none of them can completely eliminate the need for scanning for APs, nor do they take the location topology or structure into account. The authors argue that the other prediction schemes that exist in the literature do not take the nuances of WLANs into account such as highly overlapped cells and MAC contention. In addition, the location-based schemes cannot capture mobility patterns that deviate from the norm.

Another category of literature that tries to solve the vertical handoff problem (handoff across technologies) is based on the IEEE 802.21 standard that defines a middle ware architecture to ease handoff across technologies, called Media Independent Handover Functions (MIHF). The architecture defines the Event Service which pro¬vides information about change in local and remote link layer conditions in the form of events and triggers. These triggers include: 1) Link Up (LU), 2) Link Down (LD), 3) Link Going Down (LGD), 4) Link Going Up (LGU). The LGD trigger leads to the network triggering a handover for the corresponding mobile station. Re¬ceiving this trigger too late means that the link will be lost before the handover is complete and hence, there will be data loss. Receiving this trigger too early means that there will be a wastage of network resources, since the link was still working in the source network when the handover occurred. Hence, the right timing of these trig¬gers is essential for an efficient handover. Thus, a large body of literature in this area attempts to find different mechanisms to predict the timing of these triggers. We discuss some of it in this section.

The algorithm presented in [12] uses linear prediction to predict the signal strength at the source and target networks in order to obtain the handover initiation time. The authors define the threshold signal level at which the prediction of signal strength must start to be a certain factor greater than the signal strength at which the sig¬nal is lost.

As the signal strength approaches this level, the prediction of the source network signal strength starts. The prediction process ends when the signal is lost. At this point, the MS should be able to connect to the target network for a seamless HO. Hence, the time taken to initiate and execute the HO is taken into account to calcu¬late the time at which the prediction of the target network signal strength starts. This prediction continues until

G. Bhutani

the predicted signal strength crosses the level at which the MS can safely connect to the target network without losing any data. The authors compare this scheme to the scheme used in [13], which predicts the signal strength of current and target base stations, but the HO trigger time is the average of the link up and link down time. The link up time is when the predicted Received Signal Strength (RSS) at the current base station goes below the threshold and correspondingly, link down time is when the predicted RSS at the target base station goes above the threshold. The authors argue that this scheme leads to an increase in the probability of unnecessary HOs in overlapping networks, which is overcome with the scheme they propose. Other schemes like [14] use only pre¬diction of the RSS of the current BS to obtain the LGD indication. But this may lead to a HO being triggered when there are insufficient resources in the target network, which is overcome by the prediction of the RSS of both source and target networks.

A cross-layer predictive handover architecture based on the 802.21 standard is proposed in [15]. This work differs from any of the others, in that the LGD is not triggered based on just RSS predictions, but it also takes into account the time taken to perform a handover to the identified target network. The link down trigger from layer-2 typically results in a multitude of actions at layer-3 to complete the handover, and the time required for these is different based on the target network type chosen. Hence, not taking the target network type into ac¬count can lead to very early or very late handover initiations. The authors propose a handover control layer be¬tween layer-2 and layer-3 which stores the MIHF related information, thresholds and the handover decision en¬gine. When the RSS of the MS goes below the InitAction threshold configured at this layer, the MS starts a process of neighbor network discovery and evaluation. This is done with the help of the Information Services of MIHF. Using this information, the handover time is then determined. The MS then tries to predict the time at which the signal will go below the threshold and using this and the handover time, triggers the LGD event at an appropriate time. Using simulations, the authors prove that the proposed mechanism leads to minimal service disruption times and almost zero early triggering costs. A similar mechanism of using neighboring network in¬formation to determine the time for LGD is proposed in [16].

6. Prediction in Routing and Position Estimation

Ad-hoc networks and wireless sensor networks are growing in popularity because of the limited infrastructure needed to make these networks a reality and the self-organizing nature of these networks. The self-organizing nature of these networks is an area of active research due to the need for the networks to mimic the operator’s intelligence in configuring and maintaining themselves. One such prominent research area is that of packet routing in these networks. Since, these networks work based on different mobile stations serving as intermediate hops, changes in topology and thus routing paths are very frequent. Routing algorithms that adapt to these changing topologies while consuming minimal energy are an important requirement in these networks. This sec¬tion looks at how prediction techniques can be used to overcome some of the routing problems in ad-hoc and sensor networks.

In [17], the authors propose a secure and reliable routing framework for Wireless Body Area Networks (WBANs). Wireless body area networks consist of a network of sensors to monitor bodily functions. The data from these sensors is then aggregated and sent over the Internet to a central monitoring entity. Although these networks are small in size and extremely localized, bodily movements subject them to frequently changing to¬pologies and correspondingly the routing algorithms must adjust to this quickly to allow reliable data transfer of critical data. The authors propose a framework in which each node measures the link quality of all its neighbors. Using past link quality measurements, the nodes predict the incidental quality of a link using an auto-regression model. Auto-regression models are used for modeling of time-series data and to predict the value at a particular instant of time. When the routing algorithm has to select among a set of candidate nodes, it uses the one with the best link quality. When a node sends a data packet, each node listens to the transmission even though it is not destined to it. These nodes respond back with an ACK to the sender node, the ACK packet containing the re¬ceived signal strength. The sender node uses this to update the link-quality measurement that it maintains. A node which does not respond with an ACK is marked as unreachable by the sender node. Through simulations, the authors prove that the proposed technique improves the routing reliability because one of the best links is chosen for transmission at each point, and the probability of sending a packet to a dead or a non-existent node is minimal.

Cognitive radio is a technology devised to overcome the spectrum shortage problem, by allowing unlicensed

G. Bhutani

users (also called Cognitive Users—CU) to use the unused parts of the spectrum originally allotted to Primary Users (PU). Towards this end, individual nodes sense their environment and adjust their transmission parameters to minimize interference with primary users. While this allows extremely efficient spectrum utilization, it makes routing a challenge in these networks, where the chances of interference with primary users are much higher and thus the links are unreliable and available for shorter durations. The application of prediction-based algorithms to the problem of topology control and reliable routing in cognitive radio networks is discussed in [18]. The au¬thors argue that the routing algorithms must take the link availability into account when choosing next hops. This paper proposes a distributed prediction-based cognitive topology control scheme to provide this capability to the routing layer. In order to provide a minimal risk solution, this scheme is built into a separate layer be¬tween the cognitive radio layer and the routing layer to avoid making changes to well-established routing algo¬rithms like AODV, DSR and distance-vector routing. This layer is referred to as the cognitive topology control layer and works to establish a reliable topology for routing protocols to operate on. This topology is constructed by using a new link reliability metric, determined based on:

1) Link availability time, predicted using the scheme proposed in [19],

2) Period of time spent in re-routing,

3) Link data rate.

This reliability metric is used to determine the weight of a link and the weight of a path. The algorithm then proceeds to construct the topology, by identifying all neighbors, estimating the path weights from initial node to unvisited nodes and then constructing the complete topology using the paths with maximum weight. This en¬sures that re-routings are minimized because using the path weight equation, links with high data rate and low availability time and links with low data rate and high availability time are avoided as far as possible. This al¬lows the routing algorithms to indirectly take into account the mobility of CUs as well as the interference from PUs in routing decisions. Through simulations, the authors prove that the resulting routes are more reliable and lead to lesser re-routings.

The algorithm proposed in [20] performs routing based on link lifetime and coverage area. In this case, the energy drain rate is used to predict the link energy and the movement is calculated by relative motion estimation. Using this information, the packet can be routed on a path with a longer lifetime and lesser chances of packet loss.

7. Prediction and Intrusion Detection

The emergence of ad-hoc and wireless sensor networks has brought in several advantages like efficient utiliza¬tion of the spectrum, reduction in Capital Expenditure (CAPEX) due to the absence of a fixed infrastructure, re¬duction in Operating Expenditure (OPEX) because of their self-configuring nature and allowing much better monitoring of military areas and making wireless body area networks and ad-hoc vehicular communication a re¬ality. However, greater flexibility also makes these networks more vulnerable to security threats and attacks. Hence, different authentication mechanisms, attack detection and attack prevention mechanisms have been stu¬died extensively. However, increase in the computing power allows large equations and passwords to be broken in a matter of seconds and hence, security algorithms need to keep evolving to keep finding and fixing newer and newer vulnerabilities. In all cases of network security, detection of an attack or a security threat is the big¬gest challenge. In this section we look at the use of machine learning techniques to detect intrusion and deni¬al-of-service attacks in ad-hoc and wireless sensor networks.

The algorithm presented in [21] uses support vector machines to detect a Denial-of-Service (DoS) attack in mobile ad-hoc networks. Support Vector Machine (SVM) is a machine-learning technique used for regression as well as classification. In this case, the authors use the SVM as a classifier, to classify packets as normal or attack. The packets classified as attack packets are dropped by the network. Two datasets-one of normal traffic and one of an abnormal attack are used to train the SVM classifier. Since, attacks are continuously evolving, their detec¬tion requires the algorithm to learn like the human brain does as detailed in [22]. The authors propose a three-layer hierarchical brain-like learning algorithm for intrusion detection and prediction in wireless sensor networks. The authors contend that intrusion detection schemes cannot contain the damage that intrusions cause, because by the time they detect the intrusion, it has mostly passed. So they propose a technique that predicts and detects intrusion. The scheme uses three layers and four agents to detect and predict intrusion:

1) The supervised learning layer resides in the individual sensors. A decision tree is used as a classifier to

G. Bhutani

perform this supervised learning. The decision tree is contained within the detection agent. This agent uses a set of rules to drive the classification process of the tree. The results of the classification are further used to update the rules. If an attack that is unknown to the sensors occurs, it is sent to the sink node.

2) Unsupervised learning performed at the sink. A decision tree is used to perform clustering at the base sta¬tion. If an attack unknown to the sink occurs, it is propagated to the base station.

3) Reinforcement learning performed at the base station: reinforcement learning is used to predict intrusion in advance. The authors use a convergent temporal-difference learning scheme [23] in this layer to predict an in¬trusion. Whenever an attack is detected by the sensor or sink, it is reported to the base station which uses this to further build its prediction system. The prediction system consists of the input layer, the output layer, the hidden layer and the stochastic layer. Temporal-difference learning is used to update the weights. The prediction agent contains the logic to implement intrusion prediction based on reinforcement learning.

The database agent logs all events and attacks and provides an interface for querying by the detection and prediction agents. The communication agent facilitates communication between the sensors, sinks and the base station.

The authors evaluate the time overhead, memory consumption and communication overhead of this scheme in addition to its prediction accuracy. The algorithm is found to have the lowest time overhead as compared to the schemes proposed in [24] [25]. The energy consumption in detecting an attack is obviously greater than the energy consumption in a normal operating sensor, but is only marginally greater for an unknown attack versus a known attack. The prediction rate is found to be 12 percentages higher than that of the SGA based scheme pro¬posed in [26].

A scheme to detect malicious nodes based on energy prediction is proposed in [27]. Most schemes use node interactions or traffic profiles to detect an intrusion. However, this scheme uses the energy consumption of a node to detect an attack in cluster-based wireless sensor networks. The sensors consume energy in one of 4 states: 1) transmitting, 2) receiving, 3) sensing, 4) calculating. The authors propose a means for predicting the energy consumed by a sensor by having the cluster head calculate the probability that each sensor node will move from one state to the other in a given set of time slots. Using this probability, the energy dissipation for those set of time slots is predicted. At the start of each time period, the cluster head predicts the energy dissipa¬tion of each sensor. At the end of each time period it determines the remaining energy levels at each sensor. If the actual energy consumed deviates largely from the predicted dissipation, it is classified as an attack and the node is blacklisted. A blacklisted node is removed from all routing tables and is thus isolated from the network. The authors also provide a means for determining the type of attack the malicious node is involved in by cha¬racterizing the energy dissipation deviation for five different types of attacks, namely:

1) Selective forwarding attack: in this case, the energy dissipation is lower than the predicted value.

2) Hello flood attack: substantially higher energy dissipation than predicted.

3) Sybil attack: difference in energy consumption is larger than a preset threshold.

4) Wormhole attack: double the predicted energy is consumed.

5) Sinkhole attack: difference in energy dissipation increases gradually.

Simulation results show that the scheme is more efficient than existing ones in that it has high prediction ac¬curacy and does not require any monitoring at individual sensor nodes. As a result, it can detect attacks with the least energy consumption which is ideal for limited-resource networks like wireless sensor networks.

8. Research Gaps

A large portion of the research in wireless networks relies on knowing the status of the link as reliably as possi¬ble. The wireless link prediction schemes predict the time at which the signal quality will degrade. However, these are looked at mostly from a handover perspective. Several times, the signal quality degradation is a tran¬sient condition because of environmental factors and do not eventually lead to a handover. As such, using his¬torical data to predict disconnection duration would be instrumental in a lot of applications like video streaming, browsing sessions based on TCP and so on. Being able to predict the disconnection duration in non-handover scenarios is thus a gap which needs to be researched in more detail.

Most of the research in the area of imparting intelligence to network elements uses artificial neural networks to perform prediction. With machine learning itself being an active area of research, a lot of newer models have been formulated and several experiments run to prove that these work better than artificial neural networks.

G. Bhutani

Techniques like Random Forests and Deep Learning have been proven to achieve high prediction accuracies. Applying these techniques to the current wireless network problems and comparing their accuracy against that of artificial neural networks will help to establish the feasibility of these techniques and set the stage for their usage in wireless network research.

The other area that needs to be looked at in further detail is the efficient implementations and resource con¬sumption of machine learning techniques in real-time devices. While there is a large body of literature that uses machine learning techniques to solve problems in wireless networks, only a small portion of it actually looks at how efficient and resource-usage friendly each of the techniques is. Given the computation intensive nature of some of the algorithms, it is perfectly possible that they lead to extremely accurate predictions but cannot be employed in any of the devices because of the resource consumption involved. Evaluating the various tech¬niques based on their resource consumption in different systems is thus a topic that needs further work. In addi¬tion, research in the area of building efficient implementations of machine learning techniques for wireless net¬works, which take into account the limited memory, computing power and battery life in these networks is im¬perative.

Finally, all machine learning models are heavily dependent on the availability of real datasets. Today, there are few datasets available [10] and these are used by researchers to validate their experiments. The usability and ease of deployment of machine learning prediction techniques will be determined by how well the models have been trained and tuned. Having unrealistic models can lead to side-effects on the network causing service dis¬ruption and wastage of resources. Concentrated research effort thus needs to be spent on determining how rea¬listic datasets can be generated or how these can be captured from the network without side-effects to the user and the network operator.

9. Conclusion

An ever-increasing customer base and the need for ubiquitous computing pose new challenges to network oper¬ators. The network elements must be able to continuously evolve with the user demands. This is only possible if they are designed to adapt to changing network conditions. Building adaptability into a system involves provid¬ing it with some level of intelligence that enables it to take decisions in different situations. In this paper, we looked at the application of prediction techniques to different wireless network problems like handover latency reduction, routing, link duration prediction and so on. In most cases, these techniques provided a significant im¬provement over their static counterparts. The large body of research in this area also indicates that the industry is slowly but surely realizing that systems have to be more and more adaptable in order to be able to handle the data explosion. As more machine learning techniques evolve, researchers are beginning to look at more un-or-thodox techniques that give higher prediction accuracy and are less performance-intensive. A considerable amount of effort is also being put into adapting existing techniques for use in real-time systems.

References

[1] Bhutani, G. (2010) A Near-Optimal Scheme for TCP ACK Pacing to Maintain Throughput in Wireless Networks. Proceedings of the 2nd International Conference on Communication Systems and Networks, Bangalore, January 2010, 491-497.

[2] Liu, T. and Cerpa, A.E. (2011) Foresee (4C): Wireless Link Prediction Using Link Features. 2011 10th International Conference on Information Processing in Sensor Networks (IPSN), Chicago, 12-14 April 2011, 294-305.

[3] Fonseca, R., Gnawali, O., Jamieson, K. and Levis, P. (2007) Four-Bit Wireless Link Estimation. Proceedings of the Sixth Workshop on Hot Topics in Networks (HotNets VI), Atlanta, 14-15 November, 2007.

[4] Alizai, M.H., Landsiedel, O., Link, J.Á.B., Götz, S. and Wehrle, K. (2009) Bursty Traffic over Bursty Links. Proceed¬ings of the 7th ACM Conference on Embedded Networked Sensor Systems, New York, 4-6 November 2009, 71-84.

[5] Liu, H., Al-Khafaji, S.K. and Smith, A.E. (2011) Prediction of Wireless Network Connectivity Using a Taylor Kriging Approach. International Journal of Advanced Intelligence Paradigms, 3, 112-121. http://dx.doi.org/10.1504/IJAIP.2011.039744

[6] Konak, A. (2009) A Kriging Approach to Predicting Coverage in Wireless Networks. International Journal of Mobile Network Design and Innovation, 3, 65-71. http://dx.doi.org/10.1504/IJMNDI.2009.030838

[7] Capka, J. and Boutaba, R. (2004) Mobility Prediction in Wireless Networks Using Neural Networks. Management of Multimedia Networks and Services, 3271, 320-333. http://dx.doi.org/10.1007/978-3-540-30189-9_26

G. Bhutani

[8] Prasad, P.S. and Agrawal, P. (2010) Movement Prediction in Wireless Networks Using Mobility Traces. 7th IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, 9-10 January 2010, 1-5.

[9] Prasad, P.S. and Agrawal, P. (2009) Mobility Prediction for Wireless Network Resource Management. 41st Southeas¬tern Symposium on System Theory, Tullahoma, 15-17 March 2009, 98-102.

[10] Crawdad: Wireless Traces from Dartmouth. http://crawdad.cs.dartmouth.edu/

[11] Wanalertlak, W., Lee, B., Yu, C., Kim, M., Park, S.M. and Kim, W.T. (2011) Behavior-Based Mobility Prediction for Seamless Handoff in Mobile Wireless Networks. Wireless Networks, 17, 645-658. http://dx.doi.org/10.1007/s11276-010-0303-x

[12] Pahal, S., Singh, B. and Arora, A. (2013) A Prediction Based Handover Trigger in Overlapped Heterogeneous Wireless Networks. 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC), Solan, 26-28 September 2013, 1-6.

[13] Yan, J., Zhao, L. and Li, J. (2011) A Prediction-Based Handover Trigger Time Selection Strategy in Varying Network Overlapping Environment. 2011 IEEE Vehicular Technology Conference (VTC Fall), San-Francisco, 5-8 September 2011, 1-5.

[14] Wang, Q. and Ali Abu-Rgheff, M. (2003) A Multi-Layer Mobility Management Architecture Using Cross-Layer Sig¬nalling Interactions. 5th European Personal Mobile Communications Conference, Glasgow, 22-25 April 2003, 237¬241.

[15] Yoo, S.J., Cypher, D. and Golmie, N. (2010) Timely Effective Handover Mechanism in Heterogeneous Wireless Net¬works. Wireless Personal Communications, 52, 449-475. http://dx.doi.org/10.1007/s11277-008-9633-8

[16] Salih, Y.K., See, O.H. and Yussof, S. (2012) A Fuzzy Predictive Handover Mechanism Based on MIH Links Trigger¬ing in Heterogeneous Wireless Networks. International Proceedings of Computer Science & Information Technology, 41, 225.

[17] Liang, X., Li, X., Shen, Q., Lu, R., Lin, X., Shen, X. and Zhuang, W. (2012) Exploiting Prediction to Enable Secure and Reliable Routing in Wireless Body Area Networks. 2012 Proceedings IEEE INFOCOM, 25-30 March 2012, 388-396.

[18] Guan, Q., Yu, F.R., Jiang, S. and Wei, G. (2010) Prediction-Based Topology Control and Routing in Cognitive Radio Mobile Ad Hoc Networks. IEEE Transactions on Vehicular Technology, 59, 4443-4452. http://dx.doi.org/10.1109/TVT.2010.2069105

[19] Alavi, B. and Pahlavan, K. (2006) Modeling of the TOA-Based Distance Measurement Error Using UWB Indoor Ra¬dio Measurements. Communications Letters, 10, 275-277. http://dx.doi.org/10.1109/LCOMM.2006.1613745

[20] Ravi, R.J. and PonLakshmi, R. (2013) A New Lifetime Prediction Algorithm Based Routing for VANETs. Interna-tional Journal of Computer Science & Applications (TIJCSA), 1, 72-78.

[21] Sharma, A.K. and Parihar, P.S. (2013) An Effective DoS Prevention System to Analysis and Prediction of Network Traffic Using Support Vector Machine Learning. International Journal of Application or Innovation in Engineering & Management, 2, 249-256.

[22] Wu, J., Liu, S., Zhou, Z. and Zhan, M. (2012) Toward Intelligent Intrusion Prediction for Wireless Sensor Networks Using Three-Layer Brain-Like Learning. International Journal of Distributed Sensor Networks, 2012, 243841. http://dx.doi.org/10.1155/2012/243841

[23] Maei, H.R., Szepesvari, C., Bhatnagar, S., Precup, D., Silver, D. and Sutton, R.S. (2009) Convergent Temporal-Dif-ference Learning with Arbitrary Smooth Function Approximation. Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS’09), Vancouver, 7-10 December 2009.

[24] Eik Loo, C., Yong Ng, M., Leckie, C. and Palaniswami, M. (2006) Intrusion Detection for Routing Attacks in Sensor Networks. International Journal of Distributed Sensor Networks, 2, 313-332. http://dx.doi.org/10.1080/15501320600692044

[25] Chen, C., Ma, J. and Yu, K. (2006) Designing Energy-Efficient Wireless Sensor Networks with Mobile Sinks. Pro-ceeding of the 4th ACM Conference on Embedded Networked Sensor Systems (SenSys 2006), Colorado, 31 October-3 November 2006.

[26] Yan, K.Q., Wang, S.C. and Liu, C.W. (2009) A Hybrid Intrusion Detection System of Cluster-Based Wireless Sensor Networks. Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, 18-20 March 2009, 18-20.

[27] Shen, W., Han, G., Shu, L., Rodrigues, J.J. and Chilamkurti, N. (2012) A New Energy Prediction Approach for Intru¬sion Detection in Cluster-Based Wireless Sensor Networks. Lecture Notes of the Institute for Computer Sciences, So¬cial Informatics and Telecommunications Engineering, 51, 1-12.

Recurrent Neural Network Interaction Quality

Estimation

Louisa Pragst1, Stefan Ultes2, and Wolfgang Minker1

1 Institute of Communications Engineering, Ulm University, Germany

{louisa.pragst,wolfgang.minker}@uni-ulm.de

2 Cambridge University, Engineering Department, UK

su259@cam.ac.uk

Abstract. Getting a good estimation of the Interaction Quality (IQ) of a spoken dialogue helps to increase the user satisfaction as the dialogue strategy may be adapted accordingly. Therefore, some research has al¬ready been conducted in order to automatically estimate the Interaction Quality. This paper adds to this by describing how Recurrent Neural Networks may be used to estimate the Interaction Quality for each di¬alogue turn and by evaluating their performance on this task. Here, we will show that RNNs may outperform non-recurrent neural networks.

Keywords: RNN, Sequential Data, Quality of Dialogue, recurrent neu¬ral network, neural network, interaction quality, user satisfaction, spoken dialogue system

1 Introduction

While spoken human-computer interaction (HCI) is now wildly available on mo-bile technical devices, still, the applications are limited and the system behaves in a non-user-adaptive manner, i.e., the system does not take into account short-term properties of the user. However, having information about these short-term properties or user states like user satisfaction will help rendering this type of HCI more natural and user friendly [1–4].

Automatically deriving the user satisfaction during the ongoing interaction for each turn is a difficult task. Here, we have proposed several approaches for solving this problem before using either static methods [5–7] or sequential methods [8, 9]. For the latter, the problem is regarded as a sequence of user satisfaction values which are not independent from each other (which seems to be natural as a dialogue interaction may also be considered to be a sequence).

For such sequential problems, recurrent neural networks (RNNs) [10] have shown to perform well for modelling specific problems of Spoken Dialogue Sys-tems (SDSs) [11–14]. An RNN extends a regular neural network by adding cyclic connections with a time delay thus making them very suitable for their appli¬cation for sequential problems such as the automatic estimation of the user satisfaction which will be represented by the Interaction Quality (IQ) in this work. Hence, in this work, we will investigate and analyse the potential of RNNs

2 Louisa Pragst, Stefan Ultes, and Wolfgang Minker

for estimating the IQ level which will be presented in Sect. 2 together with other relevant related work. For applying RNNs for IQ estimation, we will propose possible approaches with different topologies (Sect. 3). Based on a data set of annotated dialogues, all approaches will be evaluated and their results presented in Sect. 4.

2 Significant Related Work

Famous work on determining the satisfaction level automatically is the PAR-ADISE framework by Walker et al. [15]. Assuming a linear dependency between objective measures and User Satisfaction (US), a linear regression model is ap-plied to determine US on the dialogue level. This is not only very costly, as dialogues must be performed with real users, but also inadequate if quality on a finer level is of interest, e.g., on the exchange level. In the following, we will present work on automatically determining the user satisfaction on the exchange level followed by a description of the Interaction Quality and related estimation approaches.

2.1 User Satisfaction

For predicting subjective quality measures on the exchange level, several research work has been conducted hitherto. However, most of this body of work lacks of either taking account of the sequential structure of the dialogue or resulting in insufficient performance.

Engelbrecht et al. [16] presented an approach using Hidden Markov Models (HMMs) to model the SDS as a process evolving over time. Performance ratings on a 5 point scale (“bad”, “poor”, “fair”, “good”, “excellent”) have been applied by the users of the SDS during the dialogue. The interaction was halted while the user rated.A HMM was created consisting of 5 states (one for each rating) and a 6-dimensional input vector. While Engelbrecht et al. [16] relied on only 6 input variables, we will pursue an approach with 29 input variables. Moreover, we will investigate dialogues of a real world dialogue system annotated with quality labels by expert annotators.

Higashinaka et al. [17] proposed a model for predicting turn-wise ratings for human-human dialogues. Ratings ranging from 1 to 7 were applied by two expert annotators labeling for smoothness, closeness, and willingness. They achieved an UAR3 of only 0.2-0.24 which is only slightly above the random baseline of 0.14.

Hara et al. [18] derived turn level ratings from overall ratings of the dialogue which were applied by the users after the interaction on a five point scale within an online questionnaire. Using n-grams to model the dialogue by calculating n-gram occurrence frequencies for each satisfaction value showed that results for distinguishing between six classes at any point in the dialogue to be hardly above chance.

3 Unweighted Average Recall, see Sect. 4.1

RNN IQ Estimation 3

2.2 The Interaction Quality Paradigm and the LEGO Corpus

As a more objective means of describing the user’s satisfaction with the inter-action, Schmitt and Ultes [19] proposed a measure called “Interaction Quality” (IQ) which fulfills the requirements of a quality metric for adaptive dialogue identified by Ultes et al. [20]. For the authors, the main aspect of user satisfac¬tion is that it is assigned by real users. However, this is impractical in many real world scenarios. Therefore, the usage of expert raters is proposed. Further studies have also shown that ratings applied by experts and users have a high correlation [21].

The general idea of the Interaction Quality (IQ) paradigm—IQ being defined as user satisfaction annotated by expert raters—is to derive a number of inter-action parameters from the dialogue system and use those as input variables to train a statistical classifier targeting IQ. Interaction quality is modelled on a scale from 5 to 1 representing the ratings “satisfied” (5), “slightly unsatisfied” (4), “unsatisfied” (3), “strongly unsatisfied” (2), and “extremely unsatisfied” (1).

The IQ paradigm [22] is based on automatically deriving interaction param¬eters from the SDS. These parameters are fed into a statistical classification module which estimates the IQ level of the ongoing interaction at the current system-user-exchange4. The interaction parameters are rendered on three lev¬els (see Fig. 1): the exchange level, the window level, and the dialogue level. The exchange level comprises parameters derived from SDS modules Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU), and Dia¬logue Management (DM) directly. Parameters on the window and the dialogue level are sums, means, frequencies or counts of exchange level parameters. While dialogue level parameters are computed out of all exchanges of the dialogue up to the current exchange, window level parameters are only computed out of the last three exchanges.

These interaction parameters are used as input variables to a statistical clas-sification module with the goal of estimating the IQ value. The statistical model is trained based on annotated dialogues of the Lets Go Bus Information System in Pittsburgh, USA [23]. For the LEGO corpus [24], 200 calls from 2006 consist¬ing of 4,885 exchanges have been annotated by three different raters resulting in a rating agreement of κ = 0.545. Furthermore, the raters had to follow labeling guidelines to enable a consistent labeling process [24]. The median of all three ratings is used to establish a ground truth for the IQ value of a given turn.

Schmitt et al. [19], [22] performed IQ recognition on the LEGO corpus using linear support vector machines (SVMs). They achieved an UAR3 of 0.58 based on 10-fold cross-validation which is clearly above the random baseline of 0.2. Using the same data, Ultes et al. [8] put an emphasis on the sequential character of the IQ measure by applying Hidden Markov Models (HMMs) and Conditioned Hidden Markov Models (CHMMs). Both have been applied using 6-fold cross validation and a reduced feature set of the LEGO corpus achieving an UAR3 of

4 A system-user exchange comprises a system turn followed by a user turn.

5 UAR, κ and ρ are defined in Sect. 4.1

4 Louisa Pragst, Stefan Ultes, and Wolfgang Minker

0.44 for HMMs and 0.39 for CHMMs. In addition to modelling the observation probabilities of the HMM with generic Gaussian Mixture Models [8], Ultes et al. also proposed the combination of static classifiers with HMMs resulting in a hybrid approach [9]: there, the observation probabilities are modelled using the confidence scores of the static classification algorithms. By that, they were able to beat the performance of the confidence-providing SVM by 2.2% in UAR.

Fig. 1: This figure originally published by Schmitt et al. [24] shows the three pa-rameter levels constituting the interaction parameters: the exchange level con-taining information about the current exchange, the window level, containing information about the last three exchanges, and the dialogue level containing information about the complete dialogue up to the current exchange.

3 Recurrent Neural Networks for IQ Estimation

Following the approaches presented by Ultes et al. [8], we focus on the sequential character of the IQ measure and propose recurrent neural networks [10] for IQ estimation. Recurrent neural networks enhance regular neural networks with cyclic, time delayed connections, that can be used as memory for earlier events. This makes them a reasonable choice for sequential problems such as the one presented.

The architecture of recurrent neural network can be varied by many factors, e.g. the number of layers, the number of nodes in a layer as well as the num¬ber, placement and time delay of the recurrent connections. All of these factors potentially influence the performance of the recurrent neural network.

In this work, we evaluate recurrent architectures that differ in the placement of the recurrent connections, as depicted in Fig. 2, as well as the number of nodes in the hidden layer. More variations, such as the number of hidden layers and different time delays, were assessed, but yielded comparable results and are therefore not discussed further in this work.

We analyse the IQ estimation capability of three recurrent architectures for neural networks. They are introduced in the following.

An Elman network [25] is a simple recurrent network, with a recurrent con-nection from the hidden layer to itself. The previous results of the hidden layer may be combined with the current input values in order to achieve better a estimate of the IQ value.

RNN IQ Estimation 5

Hidden Layer Output Layer

(a) Architecture of a non-recurrent neural network.

time delayed

(b) Architecture of an Elman network.

time delayed

Hidden Layer Output Layer

(d) Architecture of a time delay network.

Fig. 2: The different architectures of recurrent neural networks we evaluate in this work, exemplary with 2 nodes in the hidden layer. The nodes of the hidden layer are connected to the input variables and a bias. They use a tanh transfer function. The output layer contains one node that gets it input from the nodes of the previous layer and a bias. It uses a linear transfer function.

6 Louisa Pragst, Stefan Ultes, and Wolfgang Minker

A NARX network [26] utilises a recurrent connection from the output layer to the hidden layer. Thus, the estimated IQ value of the preceding exchange may be used for the current estimation. As raters were instructed to change the IQ value only by one from one exchange to the next (except in severe cases), the estimation might prove to be a valuable input.

Finally, a time delay network [27] is assessed. This architecture incorporates a recurrent connection from the input to the hidden layer. By comparing preceding input values with the current ones, differences that lead to an adjustment of the IQ value might be discovered and taken into consideration.

For all architectures, the nodes of both layers involved in recurrence are interconnected with each other by recurrent connections, and the time delay is set to one time step.

The number of nodes in the hidden layer varies from 2 to 20. The number of nodes available has an impact on the complexity of the calculation the network is able to perform. A high amount of nodes is beneficial if IQ estimation from the provided features requires an intricate mathematical formula. On the other hand, a high amount of nodes impedes the training process, and a satisfactory result might not be reached with the given amount of training data. By varying the number of nodes, the best trade-off can be determined.

The following characteristics are shared by all networks covered in this work. The networks contain only one hidden layer. All nodes of this layer are connected to all input variables as well as a bias and use a tanh transfer function. The output layer consists of one node, that gets its input from all nodes of the hidden layer and possibly a recurrent connection. It has a bias and uses a linear transfer function.

The networks perform a regression by implementing this architecture in their output layer, although the presented problem is a classification task. A different output layer implementing classification was tested and yielded worse results. Therefore, this architecture was adopted. The final estimated IQ value is ob¬tained by rounding the result of the network.

4 Experiments and Results

The performance of different recurrent neural networks is evaluated with the metrics described in the following.

4.1 Evaluation Metrics

Interaction Quality estimation is done by using three commonly used evaluation metrics: Unweighted Average Recall (UAR), Cohen’s Kappa [28] and Spearman’s Rho [29]. These are also selected as the same metrics have been used in Schmitt et al. [22] as well.

Recall in general is defined as the rate of correctly classified samples belonging to one class. The recall in UAR for multi-class classification problems with N

RNN IQ Estimation 7

classes recalli is computed for each class i and then averaged over all class-wise recalls:

Cohen’s Kappa measures the relative agreement between two correspond¬ing sets of ratings. In our case, we compute the number of label agreements corrected by the chance level of agreement divided by the maximum proportion of times the labelers could agree. However, Cohen’s weighted Kappa is applied as ordinal scores are compared [30]. A weighting factor w is introduced reducing the discount of disagreements the smaller the difference is between two ratings:

w = rmax  rmin .

r1  r2 (2)

Here, r1 and r2 denote the rating pair and rmax and rmin the maximum and minimum ratings possible.

Correlation between two variables describes the degree by which one variable can be expressed by the other. Spearman’s Rho is a non-parametric method assuming a monotonic function between the two variables [29].

4.2 Experimental Setup

To evaluate the performance of recurrent neural networks for IQ estimation, we use the LEGO corpus presented in Sect. 2.2. Exchanges with less than three IQ ratings are excluded from the evaluation for being too unreliable, just as dialogues with less than three exchanges, as the IQ value can not be adjusted before the third exchange. Furthermore, only numeric features which can be automatically derived are used, leaving 41 input features.

For this corpus, we evaluate the performance of the recurrent neural network architectures described in Sect. 3 using 10-fold cross validation, as well as a neural network without recurrence in order to determine the impact of sequential information. Whole dialogues are assigned to the train or test condition of the 10-fold cross validation, instead of assigning each exchange separately, in order to account for the sequential character of the data. The neural networks are trained with the dialogues of the train condition using dynamic backpropagation [31]. Each dialogue is provided to the neural network sequentially, as in real conversation.

In related work ([8, 9], [19], [22]), the evaluation metrics are obtained using the available data in its entirety. As we feel that this approach does not ade¬quately take into account the sequentiality of the data, we utilised a different approach: the evaluation metrics are obtained for each one of the 200 dialogues separately and then averaged over all dialogues. The performance of SVMs [19], [22] has been re-evaluated using this new approach in order to achieve compa¬rability. This results in an UAR of 0.55, a κ of 0.30, and a ρ of 0.50 for SVMs.

8 Louisa Pragst, Stefan Ultes, and Wolfgang Minker

4.3 Results

The average results over all 200 dialogues for all configurations and evaluation metrics can be found in Table 1.

The best results are achieved using a NARX network with ten nodes in the hidden layer: this architecture yields an UAR of 0.60, a κ of 0.37, and a ρ of 0.55. It outperforms the SVM approach (UAR: 0.55, κ: 0.30, ρ: 0.50).

When comparing the recurrent architectures with the non-recurrent archi-tecture, statistical tests suggest that recurrence does not significantly increase the performance of neural networks for IQ estimation (p > 0.05 for Elman and NARX network). The time delay network even significantly decreases the per-formance (p < 0.001).

These results might be due to the fact that training recurrent neural networks is more complex than training non-recurrent neural networks [32, 33]. By utilis¬ing more elaborate training approaches, it may be possible to achieve different results.

Furthermore, the number of nodes does not appear to have a significant im-pact on the ability of a neural network to estimate the IQ value. While occasional significant differences between consecutive levels of the number of nodes can be found, those differences are isolated and not consistent across evaluation met¬rics. The assumption that the performance would become better at first as more complex calculations are possible, and then decrease as training becomes more difficult, can not be affirmed.

5 Conclusion

This work presented recurrent neural networks for the estimation of Interaction Quality on the exchange level. It could be shown that both non-recurrent and recurrent neural networks can outperform previous approaches such as SVMs in this task.

The sequential structure of dialogues suggests that recurrent neural networks might exceed the performance of non-recurrent neural networks in estimating the Interaction Quality on the exchange level, as they incorporate information about previous exchanges in their calculations. By evaluating the performance of different recurrent neural network architectures and comparing them to the performance of a neural network without recurrence, we found that Elman and NARX networks do not significantly outperform a corresponding non-recurrent neural network. Moreover, the time-delay network performs significantly worse than the non-recurrent network. This might be due to the fact that the training of recurrent networks is more complex than the training of non-recurrent networks. It is probable that more training data or a more refined training approach is needed.

For future research, the number of input features may be reduced in a mean-ingful way, or a larger corpus of training data may be used, in order to train the recurrent neural networks more efficiently. Furthermore, the training approach

RNN IQ Estimation 9

Table 1: Results given by the Unweighted Average Recall (UAR), Cohen’s Kappa (κ), and Spearman’s Rho (ρ).

(a) Performance of the baseline net-work. (b) Performance of the Elman net¬work.

#nodes UAR rc p #nodes UAR rc p

2 0.56 0.35 0.53 2 0.54 0.31 0.48

3 0.55 0.34 0.52 3 0.55 0.33 0.51

4 0.56 0.34 0.52 4 0.56 0.33 0.52

5 0.56 0.35 0.51 5 0.56 0.33 0.50

6 0.55 0.35 0.53 6 0.55 0.32 0.52

7 0.57 0.36 0.52 7 0.57 0.35 0.52

8 0.54 0.33 0.52 8 0.55 0.34 0.51

9 0.56 0.35 0.52 9 0.55 0.34 0.50

10 0.55 0.35 0.52 10 0.56 0.34 0.52

11 0.53 0.31 0.49 11 0.56 0.35 0.54

12 0.52 0.34 0.54 12 0.57 0.35 0.53

13 0.57 0.36 0.53 13 0.57 0.35 0.52

14 0.58 0.38 0.54 14 0.58 0.37 0.53

15 0.56 0.35 0.52 15 0.58 0.35 0.51

16 0.57 0.37 0.53 16 0.55 0.35 0.53

17 0.57 0.35 0.53 17 0.59 0.37 0.55

18 0.57 0.37 0.53 18 0.58 0.35 0.53

19 0.58 0.36 0.53 19 0.57 0.36 0.53

20 0.56 0.35 0.53 20 0.58 0.36 0.53

#nodes UAR rc p #nodes UAR rc p

2 0.57 0.34 0.52 2 0.55 0.32 0.50

3 0.55 0.33 0.53 3 0.55 0.33 0.51

4 0.57 0.35 0.52 4 0.55 0.32 0.49

5 0.54 0.32 0.52 5 0.56 0.33 0.49

6 0.56 0.32 0.51 6 0.51 0.30 0.49

7 0.55 0.33 0.52 7 0.53 0.30 0.48

8 0.57 0.35 0.54 8 0.55 0.33 0.50

9 0.57 0.35 0.53 9 0.52 0.29 0.47

10 0.60 0.37 0.55 10 0.55 0.33 0.50

11 0.58 0.36 0.53 11 0.55 0.32 0.48

12 0.57 0.33 0.51 12 0.55 0.32 0.49

13 0.52 0.31 0.51 13 0.54 0.32 0.49

14 0.58 0.35 0.52 14 0.56 0.33 0.50

15 0.57 0.35 0.52 15 0.57 0.31 0.48

16 0.57 0.34 0.53 16 0.53 0.30 0.47

17 0.54 0.31 0.49 17 0.55 0.32 0.47

18 0.56 0.33 0.51 18 0.52 0.29 0.46

19 0.52 0.32 0.51 19 0.55 0.32 0.48

20 0.58 0.35 0.52 20 0.51 0.29 0.44

10 Louisa Pragst, Stefan Ultes, and Wolfgang Minker

could be refined. Different recurrent architectures, for example a combination of the networks presented in this work, could be tested in addition.

Acknowledgments This paper is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645012.

References

1. Ultes, S., Heinroth, T., Schmitt, A., Minker, W.: A theoretical framework for a user-centered spoken dialog manager. In: Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop. pp. 241 – 246. Springer (Sep 2011)

2. Ultes, S., Dikme, H., Minker, W.: Dialogue management for user-centered adaptive dialogue. In: Proceedings of IWSDS (Jan 2014), http://www.uni-ulm.de/fileadmin/website\_uni\_ulm/allgemein/2014\ _iwsds/\-iwsds2014\_lp\_ultes.pdf

3. Ultes, S., Dikme, H., Minker, W.: First insight into quality-adaptive dialogue. In: LREC. pp. 246–251 (May 2014)

4. Ultes, S., Kraus, M., Schmitt, A., Minker, W.: Quality-adaptive spoken dialogue initiative selection and implications on reward modelling. In: Proceedings of SIG-DIAL. pp. 374–383. ACL (Sep 2015)

5. Ultes, S., Minker, W.: Improving interaction quality recognition using error cor-rection. In: Proceedings of SIGDIAL. pp. 122–126. ACL (Aug 2013), http: //www.aclweb.org/anthology/W/W13/W13-4018

6. Ultes, S., Minker, W.: Interaction quality: A review. Bulletin of Siberian State Aerospace University named after academician M.F. Reshetnev (4), 153–156 (2013),http://www.vestnik.sibsau.ru/images/vestnik/ves450.pdf

7. Ultes, S., Platero S´anchez, M.J., Schmitt, A., Minker, W.: Analysis of an extended interaction quality corpus. In: Proceedings of IWSDS (Jan 2015)

8. Ultes, S., ElChabb, R., Minker, W.: Application and evaluation of a conditioned hidden markov model for estimating interaction quality of spoken dialogue systems. In: Proceedings of IWSDS. pp. 141–150. Springer (Nov 2012)

9. Ultes, S., Minker, W.: Interaction quality estimation in spoken dialogue systems using hybrid-hmms. In: Proceedings of SIGDIAL. pp. 208–217. ACL (Jun 2014), http://www.aclweb.org/anthology/W14-4328

10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

11. Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: IN-TERSPEECH. pp. 3771–3775 (2013)

12. Henderson, M., Thomson, B., Young, S.: Robust dialog state tracking using delex-icalised recurrent neural networks and unsupervised adaptation. In: SLT. pp. 360– 365. IEEE (2014)

13. Wen, T.H., Gaˇci´c, M., Kim, D., Mrkˇsic, N., Su, P.H., Vandyke, D., Young, S.: Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. pp. 275–284. ACL (Sep 2015)

RNN IQ Estimation 11

14. Su, P.H., Vandyke, D., Gasic, M., Kim, D., Mrksic, N., Wen, T.H., Young, S.: Learning from real users: Rating dialogue success with neural networks for rein¬forcement learning in spoken dialogue systems. arXiv preprint arXiv:1508.03386 (2015)

15. Walker, M., Litman, D.J., Kamm, C.A., Abella, A.: Paradise: a framework for evaluating spoken dialogue agents. In: Proceedings of ACL. pp. 271–280. ACL, Morristown, NJ, USA (1997)

16. Engelbrecht, K.P., G¨odde, F., Hartard, F., Ketabdar, H., M¨oller, S.: Modeling user satisfaction with hidden markov model. In: Proceedings of SIGDIAL. pp. 170–177. ACL, Morristown, NJ, USA (2009)

17. Higashinaka, R., Minami, Y., Dohsaka, K., Meguro, T.: Issues in predicting user satisfaction transitions in dialogues: Individual differences, evaluation criteria, and prediction models. In: Spoken Dialogue Systems for Ambient Environments, Lec¬ture Notes in Computer Science, vol. 6392, pp. 48–60. Springer Berlin / Heidelberg (2010), 10.1007/978-3-642-16202-2 5

18. Hara, S., Kitaoka, N., Takeda, K.: Estimation method of user satisfaction using n-gram-based dialog history model for spoken dialog system. In: Proceedings of LREC. ELRA, Valletta, Malta (May 2010)

19. Schmitt, A., Ultes, S.: Interaction quality: Assessing the quality of ongoing spo-ken dialog interaction by experts—and how it relates to user satisfaction. Speech Communication (2015), accepted for publication

20. Ultes, S., Schmitt, A., Minker, W.: Towards quality-adaptive spoken dialogue man¬agement. In: NAACL-HLT: SDCTD. pp. 49–52. ACL, Montr´eal, Canada (Jun 2012),http://www.aclweb.org/anthology/W12-1819

21. Ultes, S., Schmitt, A., Minker, W.: On quality ratings for spoken dialogue systems – experts vs. users. In: Proceedings of NAACL-HLT. pp. 569–578. ACL (Jun 2013)

22. Schmitt, A., Schatz, B., Minker, W.: Modeling and predicting quality in spoken human-computer interaction. In: Proceedings of SIGDIAL. pp. 173–184. ACL, Portland, Oregon, USA (Jun 2011)

23. Raux, A., Bohus, D., Langner, B., Black, A.W., Eskenazi, M.: Doing research on a deployed spoken dialogue system: One year of let’s go! experience. In: Proc. of ICSLP (Sep 2006)

24. Schmitt, A., Ultes, S., Minker, W.: A parameterized and annotated spoken dialog corpus of the cmu let’s go bus information system. In: LREC. pp. 3369–337 (May 2012)

25. Elman, J.L.: Finding structure in time. Cognitive science 14(2), 179–211 (1990)

26. Lin, T., Horne, B.G., Tiˇno, P., Giles, C.L.: Learning long-term dependencies in narx recurrent neural networks. Neural Networks, IEEE Transactions on 7(6), 1329–1338 (1996)

27. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recog-nition using time-delay neural networks. Acoustics, Speech and Signal Processing, IEEE Transactions on 37(3), 328–339 (1989)

28. Cohen, J.: A coefficient of agreement for nominal scales. In: Educational and Psy¬chological Measurement. vol. 20, pp. 37–46 (Apr 1960)

29. Spearman, C.E.: The proof and measurement of association between two things. American Journal of Psychology 15, 88–103 (1904)

30. Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagree¬ment or partial credit. Psychological bulletin 70(4), 213 (1968)

31. De Jes´us, O., Hagan, M.T.: Backpropagation algorithms for a broad class of dy¬namic networks. Neural Networks, IEEE Transactions on 18(1), 14–27 (2007)

12 Louisa Pragst, Stefan Ultes, and Wolfgang Minker

32. De Jes´us, O., Horn, J.M., Hagan, M.T.: Analysis of recurrent network training and suggestions for improvements. In: Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on. vol. 4, pp. 2632–2637. IEEE (2001)

33. Horn, J., De Jes´us, O., Hagan, M.T.: Spurious valleys in the error surface of re¬current networksanalysis and avoidance. Neural Networks, IEEE Transactions on 20(4), 686–700 (2009)

18th Annual

STEM Graduate School and Internship Fair

&Career Connections Job Fair

October 20 2017 8:00 am – 2:30 pm Room A300 & Cullen Aud.

8:00 - 8:50 am Registration/Check-in (Mural Area outside A300); Exhibitor Setup (Room A300)

8:00 - 8:50 am Breakfast Mixer (UHD Coffee House for Students; A300 for Exhibitors)

8:50 - 9:00 am GSIF Welcome – Drs. Parker and Uzman (Cullen Auditorium)

9:00 - 9:30 am Introduction of GSIF Exhibitors (Cullen Auditorium)

9:30 - 10:05 am Student Poster Session (Mural Area outside A300)

10:05 - 11:35 am Graduate School and Internship/Career Fair (Room A300)

11:35 - Noon GSIF 2017 Ends/Exhibitor Teardown (Room A300)

12:00 - 1:00 pm Luncheon for All Exhibitors, Student Poster Presenters, & Faculty Research Mentors (Milam & Travis)

1:00 - 2:30 pm Career Connections Job Fair (A300)

Theoretical Studies of Gallium(III) Complexes University of Houston

Containing Nitrogen and Oxygen Donor Atoms Dr. Maria Benavides Downtown

pyGWml: Gravitational Wave Noise Classification University of Houston

Dr. Ting Zhang

with Machine Learning Downtown

Investigating the role of LSD2/KDM1B in the Dr. Margarida The University of Texas

oncogenic program of MLL-fusion acute myeloid Albuquerque Almeida MD Anderson Cancer

leukemia Santos Center

Elucidating Novel Biotin Biosynthetic Pathway in Dr. Gareth Butland Lawrence Berkeley

Ancient Bacteria Desulfovibrio vulgaris Dr. Valentine Trotter National Laboratory

Use of Numeric Methods in the Analysis of the Dr. Katarina Jegdic University of Houston

Nonlinear Wave Systems Downtown

A Novel Approach to Patient Education: Cellular Effects of Chemotherapeutic Drugs Visualized by Time-lapse Fluorescent Microscopy The University of Texas

Dr. Richard R. Behringer MD Anderson Cancer

Center

Supercapacitors Based on Transition-metal University of Houston

Dr. Mian Jiang

Cyanoferrates for Renewable Energy Devices Downtown

Mitral valve co-culture model in 3-Dimensional Dr. Jane Grande-Allen Rice University

hydrogels

Study of Nickel and Cobalt Cyanoferrates Based University of Houston

Dr. Mian Jiang

Supercapacitor Property Downtown

The Field of Optometry as Observed Through Dr. Hamida B. Qavi Eye Excellence

Shadowing

Investigation of the Role of ATO Genes in UT McGovern Medical

Dr. Michael Lorenz

Candida albicans School

Preparation of Cobalt Hydroxide Film Modified University of Houston

Dr. Mian Jiang

Electrode for Battery and Sensor Applications Downtown

Using the Smart Glove with Visual Feedback for University of Houston

Dr. Ting Zhang

Stroke Rehabilitation Downtown

Promoting Computer Science to Young Minority Dr. Dvijesh Shastri University of Houston

Students Downtown

Sponsored by: The Brown Foundation, Inc.; National Science Foundation (0934913); DOED MSEIP (P120A130040);

Nuclear Regulatory Commission MSIP (NRC-27-10-1121); NSF NOYCE; and UH-Downtown.

UHD Scholars Academy, One Main Street, 725-North; Houston, TX 77002-1001

Phone: 713-222-5344; scholars@uhd.edu; www.uhd.edu/scholars

Sustainable Impact through a Deep Water

Culture Aquaponic System from Experiential Dr. Lisa Morano University of Houston

Learning in the Center of Urban Agriculture and Dr. Vassilios Tzouanas Downtown

Sustainability (EL CUAS)

Networks of Co-Expressed Protein Coding Genes and Long Non-Coding RNAs in Multiple Cancers University of Houston

Dr. Benjamin Soibam Downtown

Analysis of Microbial Biofilms on Plastic Samples Dr. Poonam Gulati University of Houston

from Galveston Bay Salhotra Downtown

Theoretical Studies of Radiofluorinated Gallium Complexes with Potential Use as PET imaging agents University of Houston

Dr. Maria Benavides Downtown

Functional Dissection of Cancer Genetic University of Texas MD

Dr. Nidhi Sahni

Mutations via Variomics Profiling Anderson Cancer center

Novel Determination of Species Involved in Solar University of Houston

Dr. Mian Jiang

Energy Storage and Maintenance Downtown

Sponsored by: The Brown Foundation, Inc.; National Science Foundation (0934913); DOED MSEIP (P120A130040); Nuclear Regulatory Commission MSIP (NRC-27-10-1121); NSF NOYCE; and UH-Downtown.

UHD Scholars Academy, One Main Street, 725-North; Houston, TX 77002-1001

Phone: 713-222-5344; scholars@uhd.edu; www.uhd.edu/scholars

Welcome to the Fall 2017 STEM Graduate School and Internship Fair!

The UHD Scholars Academy STEM Graduate School and Internship/Career Fair (GSIF) is an excellent opportunity for students within the College of Sciences and Technology to connect with prospective graduate schools and research/internship programs. Students are strongly encouraged to speak with as many different programs as possible to explore all available opportunities. Good luck and have fun!

8:00 - 8:50 am Registration/Check-in (Mural Area outside A300); Exhibitor Setup (Room A300)

8:00 - 8:50 am Breakfast Mixer (UHD Coffee House for Students; A300 for Exhibitors)

8:50 - 9:00 am GSIF Welcome – Drs. Parker and Uzman (Cullen Auditorium)

9:00 - 9:30 am Introduction of GSIF Exhibitors (Cullen Auditorium)

9:30 - 10:05 am Student Poster Session (Mural Area outside A300)

10:05 - 11:35 am Graduate School and Internship/Career Fair (Room A300)

11:35 - Noon GSIF 2017 Ends/Exhibitor Teardown (Room A300)

12:00 - 1:00 pm Luncheon for All Exhibitors, Student Poster Presenters, & Faculty Research Mentors (Milam & Travis) – If not attending lunch, GSIF exhibitors are dismissed at this time.

1:00 - 2:30 pm Career Connections Job Fair (A300)

Graduate School & Internship Fair Exhibitors

Baylor College of Medicine

28. Master of Science in Genetic Counseling Program

29. Human Genome Sequencing Center (HGSC)-Diversity Initiative

30. SMART Program and Graduate School of Biomedical Sciences

iEducate

21. Work-Study

Lamar University

20. Graduate Programs

Midwestern State University

8. McAda Graduate School

Rice University

1. Electrical and Computer Engineering Ph.D. Program

2. Professional Master's in Engineering and Science

3. Professional Science Master's Program

Sam Houston State University

13. SHSU Online Graduate Programs

14. Graduate Admissions

Southern Methodist University

51. Guildhall (Graduate Video Game Education Program)

52. Lyle School of Engineering

St. Mary's University of San Antonio

9. Office of Graduate Admission and School of Science, Engineering & Technology

Stephen F. Austin State University

47. Mathematics and Statistics

48. Graduate School

Texas A&M International University

7. Graduate School

Texas A&M University

4. Materials Science & Engineering, D3EM

5. College of Veterinary Medicine & Biomedical Sciences

6. Department of Visualization

Texas Chiropractic College

23. Doctor of Chiropractic

Texas Tech University

31. Graduate Programs

32. Rawls College of Business

University of Houston

25. College of Pharmacy Graduate Programs

University of Houston-Clear Lake

26. Admissions/College of Science and Engineering

27. Biological and Environmental Sciences- Biotechnology and Biology Programs

University of Houston-Downtown

33. English Department - Master of Science in Technical Communication

34. Graduate Admissions

35. MS in Data Analytics

36. MS - Professional Certificate for Dual-Credit Secondary

Mathematics Teaching (Noyce Esprit Program)

University of Houston-Victoria - Katy

24. School of Arts & Sciences

University of the Incarnate Word

16. Doctor of Physical Therapy

17. Feik School of Pharmacy

University of North Texas (UNT)

15. College of Engineering

The University of Texas at Austin

18. Graduate School

The University of Texas at San Antonio

19. The Graduate School

The University of Texas Medical Branch at Galveston

12. School of Health Professions

University of Texas Medical Branch

10. School of Medicine

11. Biochemistry and Molecular Biology Graduate Program

Sponsored by: The Brown Foundation, Inc.; National Science Foundation (0934913); U.S. Department of Education (P120A130040); and UH-Downtown.

UHD Scholars Academy, One Main Street, 725-North; Houston, TX 77002-1001

Phone: 713-222-5344; scholars@uhd.edu; www.uhd.edu/scholars

University of Texas MD Anderson Cancer Center

39. School of Health Professions

MD Anderson Cancer Center UTHealth

40. Graduate School of Biomedical Sciences

University of Texas Health Science Center-Houston

41. Office of Diversity & Inclusion, Admissions and Student Affairs

42. Biostatistics and Data Science

UTHealth School of Biomedical Informatics

43. Office of Academic Affairs

UT Health School of Nursing

44. School of Nursing

UT Health School of Public Health

45. Admissions

University of Texas Health Science Center at San Antonio

46. School of Health Professions

University of St. Thomas

22. Cameron School of Business- Masters in Clinical Translation

Management

Special Visiting Guest

Dr. Michael C. Lorenz

Professor

The University of Texas Health Science Center Houston

McGovern Medical School

Department of Microbiology and Molecular Genetics

Sponsored by: The Brown Foundation, Inc.; National Science Foundation (0934913); U.S. Department of Education (P120A130040); and UH-Downtown.

UHD Scholars Academy, One Main Street, 725-North; Houston, TX 77002-1001

Phone: 713-222-5344; scholars@uhd.edu; www.uhd.edu/scholars

AUDIO FORENSICS MEETS MUSIC INFORMATION RETRIEVAL - A TOOLBOX FOR

INSPECTION OF MUSIC PLAGIARISM

Christian Dittmar, Kay F. Hildebrand*, Daniel Gaertner, Manuel Winges, Florian M¨uller*, Patrick Aichroth

Fraunhofer IDMT, Department Metadata, Ilmenau, Germany

*University of M¨unster, Department of Information Systems, M¨unster, Germany

ABSTRACT

This paper presents a toolbox that has been developed in or¬der to facilitate the inspection of suspected music plagiarism cases. The basic concept is the use of techniques from Music Information Retrieval for semi-automatic inspection of origi¬nal and suspect song. Basic types of music plagiarism are discussed. Several signal processing approaches suitable to reveal these types are introduced. They are intended to be used under supervision of a human expert. Evaluation of the proposed methods in a non-supervised scenario is not within the scope of this paper.

Index Terms— Music Information Retrieval, Music Pla¬giarism, Audio Forensics

1. INTRODUCTION

Music plagiarism, i.e. the use of another work while presen¬ting it as one’s own original music, has always been a topic of public interest making headlines now and then. One re¬cent example of so-called sampling plagiarism has been the case of the German Rap artist Bushido [1], whose producer used music excerpts from songs of the bands Dark Sanctuary and Dimmu Borgir without authorization or attribution. A prominent example for melody plagiarism was George Har-rison’s song My Sweet Lord, released 1970. The music label Bright Tunes Music sued Harrison for unauthorized usage of the song melody of He’s So Fine by the Chiffons, released in 1962. The lawsuit lasted for more than ten years, finally fin¬ding that Harrison indeed imitated the melody, even though the responsible judge believed that he did so unintentionally [2].

Today, with huge public music databases and services such as YouTube1, SoundCloud2 or Spotify3, there are end¬less opportunities not only for musical inspiration, but also for unintentional and intentional plagiarism. Thus, there is a need for approaches and tools to efficiently and transparently measure indications for plagiarism, thus helping to sift out

1www.youtube.com

2www.soundcloud.com

3www.spotify.com

the qualified cases, and to lower the costs associated with settling disputes.

Typically, when music plagiarism cases are brought to court, independent music experts, often musicologists, are as¬ked to analyze the similarities between two songs, and the judges rely on their opinion. We believe that, in order to sup¬port such analysis, specialized software can be provided to analyze musical features of the suspected music recordings. Similarities can be identified by applying well-described pat¬tern matching algorithms from the Music Information Retrie¬val (MIR) literature. Moreover, such software can display similarities in a way that experts can not only use to evaluate their importance, but also to visualize and explain it to an un-trained audience.

This paper is organized as follows: Sec. 2 describes the types and intricacies of music plagiarism. Sec. 3 explains the proposed plagiarism analysis toolbox by outlining the signal processing approaches to inspect each plagiarism type. Sec. 4 concludes this work and gives future directions. It is import¬ant to note that a formal evaluation of the proposed algorithms is omitted in this paper. As the build-up of comprehensive test corpora is still under work, this remains a subject for future work.

2. TYPES OF MUSIC PLAGIARISM

A clear and precise definition of the term “plagiarism” is dif¬ficult to derive. However, the notion of intellectual property is known since ancient times. A poet named Martial called ano¬ther poet “kidnapper” (lat. plagiarius), because he presented Martial’s poems without permission and claimed that these were his own. This incident is perceived as the first mentio¬ning of author’s rights, even though an established copyright was unknown [3]. Authorship became more important with the invention of letterpress printing in the late 151h century. In the realm of music, it became common practice to credit the composer for his sheet music since the 161h century.

2.1. Sampling Plagiarism

The term sampling describes the re-use of recorded sounds or music excerpts in another song [4]. The samples are often

manipulated in pitch or tempo to fit the rhythm and tonality of the new song. It is very common to mix additional in¬struments to the sample, such as additional vocals or drums. The most common use of samples is to crop an excerpt of one or more bars and loop them. More elaborate forms of sampling include rearrangement and post-processing of the respective sample beyond recognition. Sampling has strongly influenced popular music culture. Thus, there exist websites4, where sampling cases are collected by a community of music aficionados. Due to the fact that sampling is basically the use of “a song in a song” it is related to the task of cover song de¬tection [5]. Cover song detection is commonly approached by chroma features, as described in [6], [7] and [8]. A more re¬cent approach is presented in [4], where a well-known audio fingerprinting algorithm [9] is modified in order to retrieve samples inside songs, based on spectral peak signatures.

2.2. Rhythm Plagiarism

A prominent example for rhythm plagiarism is the so-called “Amen Break”. It originates from the 1969 Funk recording Amen Brother by The Winstons and is considered one of the most widely used drum loops in the history of Rap and Elec¬tronic music. Some of such extraordinary beats are protected by the law. But it is often difficult to judge, whether two songs share the same rhythm. There is no definition of which instru¬ment is playing the rhythm. Commonly, the drums make up the beat. But a guitar can also be a dominant rhythmical in-strument. In general, rhythm is formed by periodical pattern of accents in the amplitude envelopes of different frequency bands. Rhythm plagiarism has been scarcely covered in the li¬terature but it is closely related to rhythm similarity estimation [10]. Paulus and Klapuri took the melody as a reference for rhythm [11]. They transformed the melody into rhythmical strings which are easier to compare along structural dimen¬sions. Others extracted rhythmical features such as the beat spectrum or tempo in order to measure rhythmical similarity [12].

2.3. Melody Plagiarism

Copied melodies are less obvious than the previously explai¬ned plagiarism types. A melodic motive is considered to be identical, even if it is transposed to another key, slowed down, sped up or interpreted with different rhythmic accentuation. Thus, melody plagiarism is a gray area, where it is hard to discern copying from citation. However, MIR techniques [13], [14] are suited for inspection of such cases. In the MIR literature, a closely related task is Query-by-Humming (QbH). QbH can be used to retrieve songs from a database by letting the user hum or sing the respective melody [15]. Melody plagiarism inspection can be done with basically the same approach, since means to identify and evaluate melodic

4www.whosampled.com, www.the-breaks.com

similarity are required. The main difference is, that QbH searches across extensive databases while plagiarism detec¬tion concentrates on one single comparison, which has to be more precise.

3. PLAGIARISM ANALYZER APPLICATION

We introduce the plagiarism analyzer application, which is developed in the scope of the REWIND5 project. It featu¬res a graphical user interface and allows to import two music excerpts for analysis and comparison.

3.1. Sampling Plagiarism Inspection

As described in Sec. 2.1, sampling plagiarism occurs in dif¬ferent characteristics. In this paper, we consider the most common and simple approach of music excerpts that are re-sampled and looped as basis for the plagiarism song.

3.1.1. Brute Force Approach

The most straightforward approach to detect and inspect sampling plagiarism is to compare a time-frequency repre¬sentation of both music excerpts [5]. We compute the magni¬tude spectrogram by means of Short-Term Fourier Transform (STFT) with an approximate hop-size of and block-size of 90ms. We convert each spectral frame to a constant-Q re¬presentation by means of re-sampling to a logarithmically spaced frequency axis, yielding the spectrograms of original Xo and suspected plagiarism X3 respectively. A number of hypotheses f for the applied re-sampling factor is derived by computing the pair-wise ratio of the strongest periodicities in the energy envelope of Xo and X3. In order to retrieve the occurrences of Xo inside X3, it is re-sampled both in time and frequency according to each entry in f, yielding ˜Xo. Each ˜Xo is shifted frame-wise along all frames of X3 and the accumulated, absolute difference d is computed bet-ween all corresponding time-frequency tiles. Assuming only re-sampling and looping were applied, periodic minima will occur in d. These correspond to the point, where an optimal matching can be found. At this point, it is also possible to subtract the energy of ˜Xo from X3, perform inverse STFT and auralize the result.

3.1.2. Decomposition Approach

The alternative approach is depicted in Fig. 1 and is based on decomposition of both Xo and X3 by means of Non-Negative Matrix Factorization (NMF) [16] and the modifications pro¬posed in [17]. NMF is suited to factorize a spectrogram ac¬cording to

X  B • G (1)

5www.rewindproject.eu

Fig. 1. Algorithm for inspecting sampling plagiarism

The base spectra B represent characteristics of the used so¬unds. The amplitude envelopes G represent the time-varying gains associated to each spectrum in B. The plagiarism can be interpreted as a mixture of the known sample and unknown additional mixed sources. We create two sets of base spectra: one to model the original sample Bo and another to model the additional sounds B3. The Bo are initialized by a preceding NMF of Xo. During the following NMF, these vectors stay fixed and will not be updated. The B3 are initialized with random values and are adapted with every NMF iteration. At the end of NMF the overall G are supposed to contain sepa¬rated amplitude envelopes to model the sample and to model the additional sounds. Of course, this only works if the sam¬ple and the plagiarism have the same pitch and tempo, which is not guaranteed. Therefore, all possible variations in pitch are checked against by just shifting the Bo along the logarith¬mically spaced frequency axis, which equals re-sampling in the time domain. The above described process is repeated for all variations and scored via the reconstruction error.

3.2.1. Rhythmical Source Separation

First all rhythmical components of both Xo and X3 are again extracted by means of NMF. We follow the principle approach to compute the NMF with large number of components, and cluster these later on. From both the Bo,B3 and Go,G3 fea¬tures can be extracted that indicate an assignment to a cer¬tain instrument. We use a measure for periodicity [18] and further remove all components that show a low percussiven¬ess [19]. Afterwards, a clustering of the components is ne¬cessary, since NMF often splits one instrument into several components. The assignment of components to each other is based on evaluating the correlation between the amplitude envelopes. For sake of brevity, further reading is referred to [20]. An example for the clustered Go is depicted in Fig. 3. This visualization is also presented in the plagiarism analyzer application for visual inspection by the user.

As described in Sec. 2.2, rhythm plagiarism either poses a special case of sampling, or a complete re-arrangement. In this paper, we assume, that the original rhythm may have undergone a number of manipulations, such as time stret¬ching, pitch shifting, re-sampling or even shuffling of indi¬vidual beats. An overview of our approach to inspect rhythm plagiarism is depicted in Fig. 2.The single processing steps will be explained in the following sections.

Fig. 3. Extracted amplitude envelopes of a drum-loop

3.2.2. Tempo Alignment

In order to compare the extracted sources, the tempi of the sequences have to be aligned to each other. Therefore, the Go and G3 are transformed to logarithmically re-sampled auto

correlation functions (Log-Lag ACF) as described in [21]. On the log-lag axis, the lag shift between the Log-Lag ACF of the songs is retrieved by means of cross correlation. The shift corresponds to a re-sampling factor with which the tempo dif¬ference of the sequences can be compensated. Details of the method are described in [22].

3.2.3. Similarity of Sources

Every extracted source from the original is compared to the extracted ones from the suspected plagiarism. We take Pear-son’s correlation coefficient [23] of all possible permutations of G,, and Gs as well as B,, and Bs as similarity measure. Temporal and spectral shifts are accounted for by the use of normalized cross correlation. The best correlation values in¬dicate that a pair of components can be assigned to each other. All distances are accumulated to an overall similarity measure by means of averaging across the pairs.

3.3. Melody Plagiarism Inspection

As explained in Sec. 2.3, melody plagiarism can be very hard to detect automatically. Thus, our approach is an automa¬tic melody transcription as described in [24] yielding discrete MIDI note objects. Since automatic music transcription is still not mature enough to cope with any style and complexity of music recordings, we built a piano-roll view into the plagia¬rism analyzer that allows the user to transpose, stretch, move, merge and split notes as in conventional MIDI sequencers. Once a satisfactory transcription of the melody in original song and suspected plagiarism is available, the comparison is conducted by means of melodic similarity measurement.

3.3.1. Pitch Vector Similarity

The first approach is a local alignment similarity measure, utilizing Euclidean distances as described in [25]. The al¬gorithm splits both melodies into smaller time windows of duration w, each of them represented by a vector with l pitch values. Rests are overwritten by extending the previous note. Each note of a melody has its own time window, starting with the onset of the note and ending with the note active w se¬conds later. The course of the pitch information within the time w is sub-sampled with sampling interval wl . We subtract the mean-value from each of the sub-sampled vectors in or¬der to guarantee invariance with respect to the musical key. Secondly, the duration of the windows is varied according to ws = µ • w,, where w,, is the window size of the original. A multiplier of µ = 1 models the case where original and vector share the same tempo, a multiplier of µ < 1 indicates that the suspect melody is played faster and a multiplier of µ > 1 mo¬dels the case where the suspect melody is played slower than the original.

3.3.2. Sequence Alignment

The second approach relies on the Smith-Waterman algorithm [26] to find a local alignment between symbol-sequences as described in [27]. The algorithm tries to identify sub¬sequences of symbols, which encode intervals between con¬secutive notes in the MIDI transcription. On execution, each of these melody fragments is compared to the entire suspect sequence. The resulting scores are ordered descending and presented via the graphical user interface. Fig.4 shows an example of this visualization, the upper melody poses the original and the lower the suspect plagiarism. In this case the suspected melody is played faster, which is indicated by the relative length of both note sequences.

Fig. 4. Visualization of melodic similarity

4. CONCLUSIONS AND FUTURE WORK

We presented a signal processing toolbox for music plagia¬rism inspection. It combines several techniques from the MIR literature in order to allow semi-automatic analysis of suspec¬ted music plagiarism cases. A formal evaluation of the des¬cribed methods has been omitted and will be subject to future work. Furthermore, other aspects of music plagiarism, such as the re-use of functional chord progressions are subjects for further research. It is planned to make a basic version of the plagiarism analyzer software toolbox freely available to the public.

5. ACKNOWLEDGMENTS

This research has been partially funded by the EU project RE¬WIND. The project acknowledges the financial support of the Future and Emerging Technologies (FET) Programme within the Seventh Framework Programme for Research of the Eu¬ropean Commission, under FET-Open grant number: 268478.

6. REFERENCES

[1] Billboard.biz, “Court Rules Bushido Plagiarized French Act,” Billboard.biz, 2010, last accessed 2012–03–09.

[2] Music Copyright Infringement Resource, “Bright Tunes Music v. Harrisongs Music 420 F. Supp. 177 (s.d.n.y. 1976),” UCLA - School of Law, 1976, last accessed on 2011-09-24.

[3] G. Schricker, A. Dietz, C. Rudloff, D. Thum, and S. Ha-madeh, Urheberrecht, Beck, 2006.

[4] J. Van Balen, Mart´ın Haro, and J. Serr`a, “Automatic identification of samples in hip hop music,” M.S. thesis, London, UK, 2012.

[5] J. S. Downie, M. Bay, A. F. Ehmann, and M. C. Jones, “Audio cover song identification: Mirex 2006-2007 re¬sults and analyses,” in ISMIR Systems Evaluation La¬boratory University of Illinois at Urbana-Champaign, 2008.

[6] J. Serr`a, E. G´omez, and P. Herrera, “Audio cover song identification and similarity: background, approaches, evaluation, and beyond,” in Advances in Music Infor¬mation Retrieval, 2010.

[7] Emanuele Di Buccio, Nicola Montecchio, and Nicola Orio, “A scalable cover identification engine,” in Pro¬ceedings of the international conference on Multimedia, New York, NY, USA, 2010, MM ’10, pp. 1143–1146, ACM.

[8] Emanuele Di Buccio, Nicola Montecchio, and Nicola Orio, “Falcon: Fast lucene-based cover song identifica¬tion,” in Proceedings of the international conference on Multimedia, New York, NY, USA, 2010, MM ’10, pp. 1477–1480, ACM.

[9] A. L.-C. Wang, “An industrial strength audio search algorithm,” in Proc. of the 4th International Confe-rence on Music Information Retrieval (ISMIR), Balti¬more, USA, 2003.

[10] A. Kapur, M. Benning, and G. Tzanetakis, “Query-by-beat-boxing: Music retrieval for the DJ,” in Proc. of the 5th International Conference on Music Information Retrieval (ISMIR), 2004.

[11] J. Paulus and A. Klapuri, “Measuring the similarity of rhythmic patterns,” in 3rd International Conference on Music Information Retrieval (ISMIR), 2002.

[12] J. Foote, M. Cooper, and U. Nam, “Audio retrieval by rhythmic similarity,” in 3rd International Conference on Music Information Retrieval (ISMIR), 2002.

[13] R. Typke, F. Wiering, and R. C. Veltkamp, “A survey of music information retrieval systems,” in Proc. of the 6th International Conference on Music Information Retrie¬val (ISMIR), 2005.

[14] M. M¨uller, D. Ellis, A. Klapuri, and G. Richard, “Si¬gnal processing for music analysis,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, oct. 2011.

[16] D. D. Lee and H. Sebastian Seung, “Algorithms for non-negative matrix factorization,” in In NIPS. 2001, MIT Press.

[17] P. Smaragdis, “From learning music to learning to sepa¬rate,” in Invited paper at Forum Acusticum, Budapest, Hungary, 2005.

[18] D. G¨artner, “Tempo estimation from urban music using non-negative matrix factorization,” in Proc. of the AES 42nd International Conference on Semantic Audio, Il-menau, Germany, 2011.

[19] C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from polyphonic music using independent sub¬space analysis,” in Proc. of the 4th International Sym¬posium on Independent Component Analysis and Blind Signal Separation, 2003.

[20] C. Dittmar and C. Uhle, “Further steps towards drum transcription of polyphonic music,” in Proceedings Au¬dio Engineering Society 116th Convention (AES), Ber¬lin, Germany, 2004.

[21] C. Dittmar, M. Gruhne, and D. G¨artner, “Prepro-cessing methods for rhythmic mid-level features,” in Proceedings of International Conference on Acoustics NAG/DAGA, Rotterdam, The Netherlands, 2009.

[22] T. V¨olkel, J. Abel3er, C. Dittmar, and H. Grol3mann, “Automatic genre classification on latin music using characteristic rhythmic patterns.,” in Proceedings of the Audio Mostly Conference, Pitae, Sweden, 2010.

[23] J. L. Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, February 1998.

[24] K. Dressler, “Pitch estimation by the pair-wise evalua¬tion of spectral peaks,” in Proc. of the AES 42nd Inter¬national Conference on Semantic Audio, Ilmenau, Ger¬many, 2011.

[25] M. Ryyn¨anen and A. Klapuri, “Query by humming of midi and audio using locality sensitive hashing,” in Proc. of the IEEE International Conference on Acou¬stics, Speech and Signal Processing (ICASSP), 2008.

[26] D. Gusfield, Algorithms on Strings, Trees and Se-quences, Cambridge University Press, 1st edition, April 1997.

[27] M. Dittrich, “Untersuchung und Optimierung eines mu-sikalisch fundierten Suchverfahrens zur Melodieerken-nung,” M.S. thesis, Ilmenau University of Technology, 2003.

[15] M. Ryyn¨anen, Singing Transcription, Springer, 2006.

ISSN: 2321-7782 (Online)

Volume 2, Issue 12, December 2014

International Journal of Advance Research in

Computer Science and Management Studies

Research Article / Survey Paper / Case Study

Available online at: www.ijarcsms.com

Content Delivery Network Approach to Improve Web

Performance: A Review

Meenakshi Gupta1 Atul Garg2

Research Scholar Associate Professor

MMICT&BM (MCA) MMICT&BM (MCA)

Maharishi Markandeshwar University Maharishi Markandeshwar University

Mullana, Haryana – India Mullana, Haryana – India

Abstract: The explosive growth of web traffic has affected the efficient delivery of contents forcing it to become a major concern. Due to the huge data and usage of web servers, servers are becoming overwhelmed with the increasing number of users and escalating volume and size of contents. This is having an adverse affect on web service providers and organizations relying on the web. Content Delivery Network is an effective approach to alleviate the congestion on network and servers to improve the response to end-users. It optimizes the content delivery by replicating the contents on surrogate servers placed at the edge of Internet. Apart from increasing web traffic, flash crowd is a new congestion phenomenon on the Internet. Flash crowd, different from Internet congestion, suddenly causes heavy workload towards particular websites. Hence, it becomes crucial to maintain web performance in such unpredictable situations. Further, streaming media objects are steadily becoming an increasing fraction of the contents transmitted on the Internet. These objects demand for higher bandwidth and consistency. This paper analyzes the existing strategies for content replication, request routing, flash crowd mitigation and streaming media contents for design and implementation of efficient content delivery networks.

Keywords: Content delivery network; content replication; request routing; flash crowd; media streaming.

I. INTRODUCTION

Web has emerged as a universal medium for exchanging information and services. It is now, not only a medium for accessing information, rather it is becoming a platform for business and society. More and more activities of our life are moving online and web users are growing at a fast pace. With the increasing number of web users and web services, retrieval of information from the web is posing various challenges such as latency, network bottleneck, security and reliability problems. Further, the web users are no longer passive users; rather they have become active contributors to the web. They are creating web contents by posting on social networking sites, blogs, wikis and feedbacks.

The increasing use of web in the existing client-server (CS) networking model is resulting in poor performance for popular websites. These websites usually floods with web requests that may either cause delay in giving the response to end-users or no response at all. This results in negative response and end-users may switch to some other websites resulting in loss to the website owners. Even with the constant improvement in Internet infrastructure and increasing capacity of various servers, web users are still suffering from very significant access delays [9]. To alleviate this problem, various approaches have been suggested by researchers such as proxy cache, clustering, multihoming and mirroring. In spite of the usefulness of these approaches, the limits with these approaches have resulted in the development of Content Delivery Network (CDN). The concept of CDN is based on placing the replica of contents closer to the end users in order to increase scalability, availability and accessibility of the contents and as a result improves the user-perceived performance in receiving the requested web contents. It transparently delivers the content to the end-users on behalf of the origin server. Request redirection algorithm is

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

used to select the best replica server and user’s request is redirected to that server. A CDN also supports to enhance the performance of web during burst traffic. It is also used for delivering streaming services economically and reliably.

In this paper, various approaches proposed by the renowned researchers to improve web performance are discussed in section 2. Section 3 describes the basic architecture and working of CDN. A detailed discussion on content distribution and request routing system is provided in section 4 and 5. Section 6 introduces the strategies to mitigate the effect of flash crowd, while strategies for streaming media contents are described in section 7. Section 8 concludes the paper.

II. RELATED WORK

In centralized model of servicing web requests, a single physical location is used to provide information to all its users. This approach is not suitable for popular websites to handle increasing volume of contents and web users. The centralized model is not scalable and server becomes easily overloaded causing failure of requests. It adversely affects the performance of web. Various approaches have been suggested by researchers to improve the web performance. Some of these are discussed below [19]:

Increase server capacity: A solution to the problem may be improving server capacity by adding more memory, storage capacity and improving speed of server. However this approach is scalable only to a particular limit.

Proxy Caching: In this approach the web contents that are frequently or recently accessed by the end-users are stored in cache memory of proxy servers. The future requests for these contents are satisfied through proxy server rather than sending these requests to origin server. This reduces network traffic, load on web server and response time [34]. In [3] Portable Extended Cache Approach (PECA) is proposed to store frequently used data at user-end in an extended cache memory to enhance the computational performance of web service. The extended cache memory may be in the form of pen drive, compact disk (CD), Digital Versatile Disk (DVD) or any other secondary memory. However in cache memory contents are placed after the users request them. Further the cache must be properly updated otherwise the users may get stale contents. Further these days, users usually access the web to find something new, as a result cache hit ratio tends to be low. This hinders in improving the performance of web content delivery to end-users.

Clustering: Local clustering can improve reliability and scalability of the server. However, it does not help either in

reducing the latency of web request or the corresponding response. Further if the data center or ISP fails, the entire cluster is inaccessible to end-users. It is also difficult to scale cluster to thousand of servers.

Mirroring: To solve the problems of clustering, mirroring can be used in which clusters are deployed at different locations

and the contents are copied on these mirrors. However mirroring approach is expensive and complex as it requires the cost of establishing the mirrors and maintaining the consistency of contents on them.

Multihoming: It uses multiple ISPs to connect to the Internet in order to improve the reliability of accessing the contents.

In multihoming, it may be the case that two independent links may use the same transmission line and this reduces the reliability.

Although these approaches are useful to improve user-perceived web performance, but they aim at one or more aspects rather than addressing all the issues to improve performance of web content delivery. Therefore, CDN approach has become more popular to overcome the limitations of these approaches and further improve the performance of web content delivery to end-users. The interest in CDNs has also flourished due to the successful emergence of various CDN service providers such as Akamai, DigitalIsland, MirrorImage, EdgeCast, Limelight Networks etc. Some content providers that make use of CDN services to boost the delivery of contents to their users are – Yahoo, Twitter, Rediff, Adobe, MTV, Kellogg’s, Pinterest, Kodak, StumbleUpon, Facebook, BBC, Toyota, Samsung, LinkedIn, McAfee etc.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

III. ARCHITECTURE AND WORKING OF CDN

A content delivery network is a system of distributed surrogate servers (also called replica servers) to deliver web contents to end-users on behalf of the origin server. The contents of the origin server are replicated on the surrogate servers. The requests from end-users are redirected to surrogate severs closer to them as is shown in Figure 1. As a result load on the origin server is reduced and network bandwidth expands.

Figure 1: A Model of Content Delivery Network

The rationale behind using CDN is to improve the web content delivery to end-users. This also increases availability and

accessibility of contents. It offers fast and reliable applications and services to end-users on the behalf of the origin server. The basic components and working of the Content Delivery Network are shown in figure 2.

Figure 2: Basic Components and Working of Content Delivery Network

The basic working of CDN is as follows:

1. Origin server provides the contents to be replicated to the Distribution System.

2. The Distribution System replicates the contents on surrogate servers and also maintains the consistency of data at the surrogate severs.

3. The Distribution system also provides the information about replication to request routing system to help in surrogate server selection for redirecting end-users requests.

4. The request for the contents from end-user is directed to Request Routing System.

5. Request Routing System redirects the request to suitable surrogate server. This process is transparent to end-users.

6. Selected surrogate server satisfies the end-user request on the behalf of the origin server.

7. Surrogate server also sends the log of data transferred to Accounting System.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

8. Accounting System aggregates this information for use by the origin server and for billing purpose according to the agreement with content provider.

9. Origin server uses this aggregated information for decision about that which contents should be replicated and where in order to further improve web performance as well to minimize the cost.

IV. CONTENT DISTRIBUTION SYSTEM

Content distribution system deals with placement of surrogate servers, selection of surrogate servers for content replication, content distribution and consistency management. The purpose is to maximize the throughput and minimize the cost, hence making the optimum utilization of network resources. An optimal distribution of contents on surrogate servers assists to improve the network performance. This facilitates in designing scalable, reliable and efficient content delivery system.

A. Surrogate Server Placement:

The efficient operation of CDN requires the decision about the number of replica servers, their location and capacity. Various approaches have been proposed for the purpose considering one or more factors such as proximity with end-users, load on surrogate server and cost of updating the contents. Generally, the problem is to place M number of surrogate servers among N different sites where N>M in a way that leads to the lowest cost. Several algorithms have been proposed for surrogate servers placement such as Random, Greedy, Hot spot, Super Optimal, Tree based, Topology informed etc. [4, 23 & 33]. However we eschew going in details of decision regarding placement of surrogate servers as most of the CDN service providers already have their well established infrastructure. Therefore we will mainly consider how this infrastructure can be better utilized to further improve the web content delivery performance.

B. Content Replication:

An important decision related to design and implementation of CDNs is the optimal replication of contents on surrogate servers as the number and capacity of surrogate servers is limited. This requires the decision about what contents should be replicated and on what surrogate servers. The surrogate servers for replication of contents are selected in a way so that the cost of placing and accessing the contents is minimized and user-perceived performance of web content delivery is maximized.

The problem of content replication is considered as NP-Complete. Several heuristics have been proposed to solve this problem to improve CDN performance. These heuristics mostly concentrated on efficient replication of the contents from one origin server. J. Kangasharju et al., [21] considered a global case for replication of contents from several origin servers. They considered content replication as a combinatorial optimization problem and showed that this optimization problem is NP-complete. They assumed nodes with finite storage capacity and made replication decision on per-object granularity taking into consideration the cooperation between CDN servers. They have developed four natural heuristics namely Random, Popularity, Greedy-Single, Greedy-Global for possible best placements of contents and showed that the best performing heuristic is Greedy-Global which has all the CDN servers cooperating.

Most of the object replication algorithms assume static placement of replicas and do not consider adaptation with the changes in client’s access pattern. Scalable Content Access Network (SCAN) [40] is a scalable replica management framework that dynamically places a minimal number of replicas to meet client Quality of Service (QoS) and server capacity constraints. It tries to minimize the number of replicas while meeting these constraints. Adaptive Genetic Replication Algorithm (AGRA) [37] is a hybrid genetic algorithm that combines the features of both static and dynamic algorithms. It uses the current replica distribution as input and calculates a new one using the network attributes and the changes occurred. AGRA adapts to changing environment very quickly. Adaptive Distributed Request Window (ADRW) [25], an adaptive object replication algorithm, dynamically adjusts the allocation scheme of objects to minimize the total servicing cost of the arriving requests in a distributed system.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

Latency based object placement (Lat-cdn) [12] is a network-adaptive technique to replicate the contents to surrogate servers, which does not require any prior knowledge of request statistics. This algorithm uses object’s latency to make replication decision using cooperative push-based scheme. However this approach does not consider the load of the objects. Therefore it is possible that during a flash crowd event some surrogate servers may be overloaded. Later this technique was improved as il2p algorithm [13] that integrates both the network latency and object load on servers to improve the response time of the requests significantly.

CDN outsources the contents on behalf of origin server and in return charges according to the usage. Constraint P-median (CPM) [17] is an optimization model based on Multiple Minimum Cost Flow Model. CPM algorithms consist of three parts; replication algorithm preprocess, constraint P-median model and algorithm of solving constraint P-median problems with the iteration method. It allocates replicas of files on storage capacity constrained servers in order to minimize the total cost. Whereas in [10] CDN utility metric is proposed as a parameter to pricing policy. The metric specifies the relation between the number of bytes of the served content against the number of bytes of pulled content. The purpose is to place the requested objects to surrogate servers with respect to CDN utility metric. The best surrogate server is the one that produces the maximum CDN utility value and improves CDN’s performance. Figure 3 shows different content placement strategies:

Content Placement Strategies

Figure 3: Content Placement Strategies

Various approaches suggested for content replication have focused on different parameters such as web objects size, server and network limits, coarse/fine grain replication, static/dynamic replication and cooperation between surrogate servers. Selecting a proper mix of these parameters for content replication will have a profound effect on web access performance through CDN.

C. Content Consistency Management:

The contents replicated on surrogate servers must be consistent with the origin server. The end-users must get the updated

contents otherwise no scheme for content replication will be effective. Various approaches used for this purpose are as follows [11]:

Cooperative push-based: Initially contents are pushed from origin server to surrogate servers. Surrogate servers cooperate to minimize the replication and update cost. CDN maintains a mapping between content and surrogate servers. Web request is routed to the closest surrogate server. If the surrogate sever does not have requested web content then the request is directed to the origin server.

Uncooperative pull-based: Web request is routed to the closest surrogate server. If surrogate server does not have requested web contents then the contents are pulled from the origin server to satisfy the request.

Cooperative pull-based: In this approach surrogate servers are cooperative with one another. If the contents are not available on the surrogate server then it pulls them from nearby surrogate server having the requested contents.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

V. REQUEST ROUTING SYSTEM

A request routing system redirects the client’s request to a suitable surrogate server. The redirection decision is complementary to the decision about placement of surrogate servers. The policies used for placing the surrogate servers are implemented during request redirection. The Request routing system consists of redirection algorithms and redirection mechanisms. Redirection algorithm decides the selection of surrogate server to satisfy client request whereas redirection mechanism informs the client about selection [35].

A. Request Routing Algorithms:

Request redirection approaches [35] can be classified as either client-side or server-side or somewhere in the network between these two. The classification is based on the point of redirection decision. In client-side redirection approach, decision is taken by a client-side proxy where as in server-side redirection approach, request is first sent to origin server which then redirects it to suitable surrogate server.

One of the important factors in the efficient utilization of surrogate servers is to redirect client’s request to the best server based on some optimality criteria. The simplest approach to select server for redirection is random approach. This approach requires less computation complexity. However it is not efficient as it does not take into consideration server load and client latency. Round robin approach tries to balance the load on servers but still does not take into account proximity with the client. In [42] a light-weight server push approach has been combined with client-probe approach to estimate the expected response time. The goal is to allocate a server that minimizes client’s response time. However this approach requires the installation of proxies to act as probing agents.

As the contents are replicated on multiple surrogate servers, therefore Transmission Control Protocol (TCP) parallel access scheme [32] has been suggested to retrieve file from multiple servers at the same time instead of using complex algorithms for selection of a suitable surrogate server. In this scheme, the transfer time is enhanced by providing the addresses of best servers to the clients. In [29] replicated servers are ranked based on prediction of content transfer time while considering the load on servers as well as the characteristics of path between the client and servers. The clients can download from the best server or a subset of servers at the top can be used in parallel.

The performance of redirection scheme directly or indirectly depends upon replication scheme. Hence replication and redirection approaches have been considered jointly. Co-Operative Cost Optimization Algorithm (COCOA) is a placement and retrieval algorithm [16] that is based on hybrid architecture. It is as scalable as the popularity-local algorithm and provides a similar performance as the greedy- global replica placement algorithm (RPA). Randomized and Next Neighbor (RNN) algorithm [7] is also scalable to increasing number of surrogate servers and request load. Each server maintains the information about its load and its next-neighbor load. It randomly selects a server and makes a choice between that server and its next-neighbor depending on their load.

Typically there is a trade-off between load-aware and locality-aware approaches. Locality-aware approaches usually lack load-balancing and vice-versa. Load-Aware Network Coordinates (LANCs) approach [30] tries to make a balance between these two. Popular contents are replicated among nearby content servers with low load to balance the request between content servers and assign clients to nearby servers.

An efficient redirection approach may be the one that considers multiple factors at a time such as proximity with the client, availability of the contents on the surrogate server, load on the surrogate server and network conditions. FuzzyCDN, an adaptive redirection algorithm, based on fuzzy logic to choose the best replica server has been proposed in [38]. It takes into consideration queue size, service time and response time metrics simultaneously. On the other hand, Hybrid Network Heuristic (HNH) [39] is a hybrid method that solves Replica Placement and Request Distribution Problem (RPRDP) jointly. It considers

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

exact and heuristic constraints such as server disk space and bandwidth, QoS requirements of requests and changes in the network conditions simultaneously.

Various request redirection algorithms can be categorized as adaptive or non-adaptive. The adaptive algorithm selects a

server for satisfying the request on the basis of current state of surrogate servers. In non-adaptive algorithm decision is based on some heuristics without any overhead about the current state of surrogate servers [35 & 2].

B. Request Routing Mechanisms:

According to the decision taken by the request routing algorithm, request routing mechanism is used to redirect the client’s request to selected surrogate server. Various request routing mechanisms [9] have been proposed such as:

Client Multiplexing: The client gets the addresses of suitable surrogate servers and selects one of them to send the request.

HTTP Redirection: The request is sent to the origin server which in return redirects it to a new Uniform Resource Locator (URL). The end-user gets the response from this new URL.

DNS Indirection: The request is sent to Domain Name System (DNS) server which in return sends the Internet Protocol (IP) address of one of a set of surrogate servers. This technique is transparent to the client.

Anycasting: An IP anycast address is used to represent a set of servers providing the same service. The client requests for

the contents using this anycast address. Anycast aware routers direct this request to one of the servers identified by anycast address.

Peer-to-Peer Routing: In this mechanism, a peer node in adhoc-network has the information about contents available on some other peer nodes in the network. On the basis of this request is redirected to one of the candidate peer nodes.

VI. FLASH CROWD MITIGATION

Flash crowd refers to a situation when a very large number of users simultaneously access a popular website. The reason may be to get information about some popular event. Such type of events may be known in advance such as the sports events like Olympics or the world cup. However there may be flash crowd without any advanced warning such as September 2001 terrorist attack [15]. In such circumstances, the network traffic pattern to the website is different from the usual one and often in burst mode. Though this is usually for a short term, however, for that period it may make the CDN inefficient. During flash crowd, CDN may not be able to handle such sudden increase in web requests and its performance may deteriorate considerably. The end-users may not be able to access the website or it may take too long due to network congestion. The strategies that are used to direct the web requests to surrogate servers during normal workload may not be fit to handle flash crowd. Therefore CDN requires some specific strategies to handle such situations in order to satisfy the web requests.

Denial-of-Service (DoS) attacks also create the same situation as flash crowds. However flash crowds are different from DoS attacks. Flash crowds are legitimate requests where as DoS attacks are malicious requests with the intention to degrade the normal functioning of the website. Identifying DoS attacks helps websites to make a provision for discarding malicious attacks and efficiently handle the legitimate requests.

Main challenges to address the issue of flash crowd events are:

• Monitoring CDN to find that a flash crowd situation has occurred

• Adjusting the policies and resources to handle the situation

Different strategies to handle flash crowd events are categorized based on network architecture as server-layer,

intermediate-layer and client-layer [31]. At server layer over provisioning is used to handle flash crowd events in static CDNs. However, large-scale over provisioning in CDN is costly and inefficient approach due to unpredictable nature of flash crowds.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

At intermediate-layer, caching techniques are used to curb the server load during flash crowds. At client-layer, clients also play the role of server to reduce the load on servers. The request from a client is redirected to another client that has recently downloaded the contents. Different mechanisms based on client-layer have been proposed to handle flash crowd such as CoopNet, PROOFS etc.

The decision regarding redirection of request to suitable surrogate server is critical in handling flash crowd. Redirection algorithms that adjust dynamically the number of replica servers [24] for a given object, makes the system better to support load without affecting the user-perceived response latency. These algorithms do not require any perfect information about sever load and work well under a wide range of loads and are robust during flash crowd and Distribute Denial-of-Service (DDoS) attacks. An adaptive CDN architecture is proposed in [15] based on dynamic delegation of requests to handle flash crowds. In this approach, surrogate servers are organized into groups. Within a group, one server is considered as primary for a given website and rest of the servers are considered primary for other websites. Client’s requests are redirected only to primary servers in the groups. During flash event, when load on the primary servers increases, the requests are distributed among other servers in the group called delegates. When the flash event ends, the delegate servers are released. In [1], a prototype is proposed for dynamic allocation of existing resources to effectively handle flash crowds with different characteristics without over-provisioning. According to this prototype, to handle very sharp growth in load, the dynamic allocation scheme must be either extremely responsive or employ low overhead mechanism. Whereas gradually increasing flash crowds can be equally handled with larger overheads and slower reaction times.

Caching and replication techniques are mainly implemented in proxy servers and CDNs respectively. In [22] surrogate servers are used both as replicators and proxy caches to make the CDN system robust during a flash crowd event. In this approach, storage capacity at surrogate servers is partitioned into static cache and proxy cache. Static cache is used for replicating contents statically and proxy cache for running a cache policy replicating contents dynamically. With this integration, CDN may take the benefit of dynamic nature of caching while using replication for availability, reliability and bounded update propagation cost. CDN has also been combined with peer-to-peer (P2P) network to handle flash event. Peer Assisted Peer-Allocation (PAPA) algorithm [8] is based on collaboration between servers and peers. In PAPA, proactive strategies are used to handle sudden workload on the system. Based on the users preferences on similar files published earlier, the servers pre-allocate the contents to idle peers. PAPA is not applicable to live streaming where the contents are not available before release. An adaptive CDN, Flash Crowd Alleviation Network (FCAN), is proposed in [6] that changes its network structure dynamically between client-server and cache proxy P2P mode based on load fluctuation to deal with flash crowd.

The existing approaches show that a flash crowd can be handled effectively, if its occurrence can be predicted in advance. However, still CDNs need some specific strategies to handle a flash crowd situation.

VII. MEDIA CONTENTS STREAMING

The nature of contents on web is changing from text and images to multimedia. With the rise of online shopping, social networking, gaming, education, music listening, movie watching etc, there is an explosion of multimedia applications. Further with the advancement of broadband technology and cheaper internet access, the users prefer streaming more than downloading. It can be live or on demand. In streaming, play back begins along with downloading and users do not have to wait for the complete downloading. The media streaming has dominated the Internet traffic. However it requires consistent performance so that end-users can listen and view without any jitter. Efficient media streaming requires coherent decisions regarding content selection and placement, request routing and delivering of contents to end-users. A main issue is to design content distribution system that is scalable, places replica servers closer to clients and minimizes the total server and network delivery cost [20].

Media streaming can be unicast or multicast based. In unicast streaming, shortest path routing is used to minimize network

bandwidth and total server bandwidth usage. It does not take into account the number of replicas and their placement. In

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

multicast streaming, contents are delivered to multiple interested receivers simultaneously. However, the issue with multicasting is that if the minimum distance is considered then it reduces the number of clients that can be served simultaneously and vice-versa. In [20] greedy min-cost tsp (tree of shortest path) heuristic for placement and shortest path routing/greedy ordered min-cost heuristic for routing client requests and multicast streams has been suggested to produce the best near-optimal solution. In Multiple Description–Content Delivery Network (MD-CDN) architecture, [18] Multiple Description Coding (MDC) is combined with path diversity to achieve reduced clients response time, servers load balancing, scalability and robustness in streaming media CDN.

Multimedia files are usually large in size. Taking into consideration the characteristics of multimedia files, deploying as many replicas as possible is always not a good strategy. It will increase the cost of replicating the contents and may also degrade the performance of clients. Therefore considering the cost of distributing contents, optimal number of servers should be selected from potential servers for replication [28]. Further instead of placing entire file on n nodes, the file can be subdivided into n subfiles of equal size and each subfile can be placed on a different node. FastReplica [26] algorithm is based on this approach for scalable and reliable replication of large files to speed up the overall download time. For media files encoded with MDC, different descriptions can be treated as subfiles.

To fulfill the users expectations and to improve Hyper Text Transfer Protocol (HTTP) media streaming quality, bit rate adaptation metric is used in Dynamic Adaptive Streaming over HTTP (DASH) architecture [5]. The metric detects the network congestion and spare network capacity. The step-wise switch up and multi-step switch down strategies adapts the bit rate to match the end-to-end network capacity for fetching of serial and parallel media segments.

In general, users watch the first part of video and then switch to another one. The performance of CDN can be improved by storing only the first part of video in cache memory with reduced extra cost. In [36], SSD (Solid State Drive) caching scheme is used to store the first part of videos for increasing the accessibility of popular contents for greater number of users simultaneously as data rate and access time of it is multiple times faster than HDD (Hard Disk Drive).

Flash crowd can also occur in media streaming. FCAN, an adaptive CDN, for static content delivery [6] has been extended

for handling flash crowds during live and on-demand streaming by adding dynamic resizing and quality restriction to enhance the resilience of the system [41].

Hybrid CDN-P2P networks (HCDN) have been used for handling media streaming. Hybrid Replica Placement Mechanism (HRPM) [27] is a dynamic economical mechanism to optimize the number and placement of replica servers for HCDN. The streaming content delivery services are based on recursive hierarchical push-based cooperative replica placement strategy. End-users can receive the service in either CS or P2P mode. The streams are replicated with different qualities for CS and P2P end-users. Content providers are charged less for delivery in P2P mode. Therefore, it takes into account not only the content delivery costs and popularity of contents but also revenue of the HCDN. TrustStream architecture [14] combines the best features of scalable coding, CDN and P2P networks to achieve security, scalability, heterogeneity and certain QoS simultaneously. In this video is encoded in two layers- base and enhanced. Base layer is delivered through CDN-featured single-source multi-receiver P2P network to guarantee a minimum level of quality. Enhanced layer is delivered through pure multi-source multi-receiver P2P network to achieve maximum scalability and bandwidth utilization.

Streaming media files are usually larger objects and have specific characteristics such as unequal access to different parts of

file, consistent delivery and more bandwidth requirement. Therefore distribution system for media streaming should be designed to fulfill these characteristics for better quality of experience to end-users.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

VIII. CONCLUSION AND FUTURE WORK

Content Delivery Network approach is based on optimizing the delivery of contents by replicating the contents on surrogate servers placed at the edge of Internet. This helps in reducing the bandwidth consumption and improving the user-perceived latency. In this paper, the basic strategies used for content delivery networks have been analyzed. This analysis helps in getting the thorough insight into content delivery networks. This shows that the design and implementation of a scalable, reliable and efficient content delivery network entails focusing on a number of technical aspects such as what contents should be replicated and where, which is the appropriate server on which the request should be redirected, how to handle the load in case of flash crowd events, what strategy should be used to deliver streaming media contents. A lot of research work has been done on these fronts. However still there is a scope for improving these strategies of content delivery networks for enhancing web performance. As future work we intend to propose a new framework for efficient distribution of contents over surrogate servers.

References

1. Abhishek Chandra, Prashant Shenoy, "Effectiveness of Dynamic Resource Allocation for Handling Internet Flash Crowds", TR03-37, Department of Computer Science, University of Massachusetts, USA, Nov. 2003.

2. Al-Mukaddim Khan Pathan, Rajkumar Buyya, “A Taxonomy and Survey of Content Delivery Networks”, Technical Report, GRIDS-TR-2007-4, Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia, 2007.

3. Atul Garg, Anil Kapil, “Potable Extended Cache Memory to Reduce Web Traffic”, International Journal of Engineering Science and Technology, Vol. 2(9), pp. 4744-4750, 2010.

4. Bo Li, Mordecai J. Golin, Giuseppe F. Italiano, Xin Deng and Kazem Sohraby, “On the Optimal Placement of Web Proxies in the Internet,” In Proceedings of IEEE INFOCOM’99, pp. 1282-1290, 1999.

5. Chenghao Liu, Imed Bouazizi, Miska M. Hannuksela, Moncef Gabbouj, “Rate Adaptation for Dynamic Adaptive Streaming over HTTP in Content Distribution Network”, Signal Processing: Image Communication, Elsevier, 27(4), 288–311, 2012.

6. Chenyu Pan, Merdan Atajanov, Mohd. Belayet Hossain, Toshihiko Shimokawa, Norihiko Yoshida, “FCAN: Flash Crowds Alleviation Network using Adaptive P2P Overlay of Cache Proxies”, IEICE Trans. On Communications, Vol. 89, No. 4, pp. 1119–1126, 2006.

7. Chung-Min Chen, Yibei Ling, Marcus Pang, Wai Chen, Shengwei Cai, Yoshihisa Suwa, Onur Altintas, "Scalable Request Routing with Next-Neighbor Load Sharing in Multi-server Environments", 19th International Conference on Advanced Information Networking and Applications, AINA’05, IEEE, Vol. 1, pp. 441-446, 2005.

8. Dan Huang, Min Zhang, Yi Zheng, Changjia Chen, Yan Huang, "Pre-allocation based Flash Crowd Mitigation Algorithm for Large-scale Content Delivery System", Peer-to-Peer Networking and Applications, pp. 1-8, 2014.

9. Gang Peng, “CDN: Content Distribution Network”, CoRR, arXiv:cs/0411069v1, Technical Report TR-125 of Experimental Computer Systems Lab, Stony Brook University, NY, 2004.

10. George Pallis, “Improving Content Delivery by Exploiting the Utility of CDN Servers”, In Proc. of the 5th Int. Conf. on Data Management in Cloud, Grid and P2P Systems (Globe). LNCS, Springer, 88–99, 2012.

11. George Pallis, Athena Vakali, “Insight and Perspectives for Content Delivery Networks”, Communications of the ACM - Personal information management, Volume 49 Issue 1, pp. 101-106, 2006.

12. George Pallis, Athena Vakali, Konstantinos Stamos, Antonis Sidiropoulos, Dimitrios Katsaros, Yannis Manolopoulos, “A Latency-Based Object Placement Approach in Content Distribution Networks,” In Proceedings of the 3rd Latin American Web Congress (La-Web 2005), IEEE Press, Buenos Aires, Argentina, pp. 140-147, October 2005.

13. George Pallis, Konstantinos Stamos, Athena Vakali, Dimitrios Katsaros, Antonis Sidiropoulos, Yannis Manolopoulos, “Replication Based on Objects Load under a Content Distribution Network”, 22nd International Conference on Data Engineering Workshops (ICDEW’06), IEEE, p. 53, 2006.

14. Hao Yin, Chuang Lin, Qian Zhang, Zhijia Chen, Dapeng Wu, "TrustStream: A Secure and Scalable Architecture for Large-scale Internet Media Streaming", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 18, No. 12, pp. 1692-1702, 2008.

15. Jaeyeon Jung, Balachander Krishnamurthy, Michael Rabinovich, “Flash Crowds and Denial of Service Attacks: Characterization and Implications for CDNs and Web Sites”, Proceedings of the 11th International Conference on World Wide Web, ACM, pp. 293-304, 2002.

16. Jan Coppens, Tim Wauters, Filip De Turck, Bart Dhoedt, Piet Demeester, "Evaluation of Replica Placement and Retrieval Algorithms in Self-organizing CDNs", Proceeding of the IFIP/IEEE International Workshop on Self-Managed Systems & Services (SelfMan'05), 2005.

17. Jing Sun, Suixiang Gao, Wenguo Yang, Zhipeng Jiang, “Heuristic Replica Placement Algorithms in Content Distribution Networks”, Journal of Networks. Vol. 6, No. 3, pp. 416-423, March 2011.

18. John Apostolopoulos, Tina Wong, Wai-tian Tan, Susie Wee, “On Multiple Description Streaming with Content Delivery Networks”, IEEE INFOCOM, Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 3, pp. 1736-1745, 2002.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

19. John Dilley, Bruce Maggs, Jay Parikh, Harald Prokop, Ramesh Sitaraman, Bill Weihl, “Globally Distributed Content Delivery”, Internet Computing, IEEE, Vol. 6, No. 5, 50–58, 2002.

20. Jussara M. Almeida, Derek L. Eager, Mary K. Vernon, Stephen J. Wright, “Minimizing Delivery Cost in Scalable Streaming Content Distribution Systems”, IEEE Transactions on Multimedia, 6(2), 356–365, 2004.

21. Jussi Kangasharju, James Roberts, Keith W. Ross, “Object Replication Strategies in Content Distribution Network”, Computer Communicaiton, Elsevier, Vol. 25, No. 4, pp. 376-383, 2002.

22. Konstantinos Stamos, George Pallis, Athena Vakali, "Integrating Caching Techniques on a Content Distribution Network", In Advances in Databases and Information Systems, Springer, pp. 200-215, 2006.

23. Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker, “On the Placement of Web Server Replicas”, Proc. IEEE INFOCOM’01, Vol. 3, 1587-1596, 2001.

24. LiminWang, Vivek Pai, Larry Peterson, “The Effectiveness of Request Redirection on CDN Robustness”, ACM SIGOPS Operating Systems Review, 36, SI, 345-360, 2002.

25. Lin Wujuana, Bharadwaj Veeravalli, “Design and Analysis of an Adaptive Object Replication Algorithm in Distributed Network Systems”, Computer Communications, Elsevier, 31(10), pp. 2005-2015, 2008.

26. Ludmila Cherkasova, Lee Jangwon, “FastReplica: Efficient Large File Distribution within Content Delivery Networks”, In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, 2003.

27. Mehran Garmehi, Morteza Analoui, Mukaddim Pathan, Rajkumar Buyya, “An Economic Replica Placement Mechanism for Streaming Content Distribution in Hybrid CDN-P2P Networks”, Computer Communications, Elsevier, Vol. 52, pp. 60–70, 2014.

28. Mengkun Yang, Zongming Fei, “A Model for Replica Placement in Content Distribution Networks for Multimedia Applications”, Proceedings of IEEE international conference on communications (ICC’03), vol. 1, pp. 557–561, 2003.

29. Mohammad Malli, Chadi Barakat, Walid Dabbous, "An Efficient Approach for Content Delivery in Overlay Networks", Consumer Communications and Networking Conference, IEEE, pp. 128-133, 2005.

30. Nicholas Ball, Peter Pietzuch, “Distributed Content Delivery using Load-Aware Network Coordinates”, Proceeding of ACM CoNEXT Conference, 2008.

31. Norihiko Yoshida, “Dynamic CDN against Flash Crowds”, in Content Delivery Networks, Springer, pp. 275-296, 2008.

32. Pablo Rodriguez, Ernst W. Biersack, “Dynamic Parallel-Access to Replicated Content in the Internet”, IEEE/ACM Transactions on Networking, 10(4), 455-465, 2002.

33. Pavlin Radoslavov, Ramesh Govindan, Deborah Estrin, “Topology-Informed Internet Replica Placement”, Computer Communications, Elsevier, 25(4), 384–392, 2002.

34. Radhika Malpani, Jacob Lorch, David Berger, "Making World Wide Web Caching Servers Cooperate", In Proceedings of the Fourth International World Wide Web Conference, pp. 107-117, 1995.

35. Swaminathan Sivasubramanian, Michal Szymaniak Szymaniak, Guillaume Pierre, Maarten Van Steen, “Replication for Web Hosting Systems,” ACM Computing Surveys (CSUR), Vol. 36, No. 3, pp. 291–334, 2004.

36. Taekook Kim, Eui-Jik Kim, “Hybrid Storage-based Caching Strategy for Content Delivery Network Services”, Multimedia Tools and Applications, Springer, 1-13, 2014.

37. Thanasis Loukopoulos, Ishfaq Ahmad, “Static and Adaptive Distributed Data Replication using Genetic Algorithms,” Journal of Parallel and Distributed Computing, Elsevier, 64(11), pp. 1270-1285, 2004.

38. Thiago Queiroz de Oliveira, Marcial P. Fernandez, “Fuzzy Redirection Algorithm for Content Delivery Network (CDN)”, ICN 2013, The Twelfth International Conference on Networks, pp.137-143, 2013.

39. Tiago Neves, Luiz Satoru Ochi, Célio Albuquerque, “A New Hybrid Heuristic for Replica Placement and Request Distribution in Content Distribution Networks”, Optimization Letters, Springer, pp. 1-16, 2014.

40. Yan Chen, Randy H. Katz, John D. Kubiatowicz, “SCAN: A Dynamic, Scalable and Efficient Content Distribution Network”, In Proceedings of First International Conference on Pervasive Computing 2002, LNCS 2414, Springer, pp. 282-296, 2002.

41. Yuta Miyauchi, Noriko Matsumoto, Norihiko Yoshida, Yuko Kamiya, Toshihiko Shimokawa, “Adaptive Content Distribution Network for Live and On-Demand Streaming”, ARCS Workshops, pp. 27-37, 2012.

42. Zongming Fie, Samrat Bhattacharjee, Ellen W. Zegura, Mostafa H. Ammar, “A Novel Server Selection Technique for Improving the Response Time of a Replicated Service", in Proceedings IEEE INFOCOM'98, Vol. 2, 1998.

Meenakshi et al. International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 pg. 374-385

AUTHOR(S) PROFILE

Meenakshi Gupta received degree of Master of Computer Applications from IGNOU, New Delhi and M. Phil. (Comp. Sc.) from Periyar University in 2005 and 2008 respectively. Currently, she is working as an Assistant Professor at Maharaja Agrasen Institute of Management and Technology, Jagadhri, Haryana. She has 8 years of teaching experience and has several national and international publications to her credit. Her area of interest is fuzzy logic based systems and web optimization.

Atul Garg received degree of Master of Computer Applications from Kurukshetra University, Kurukshetra in 2004 and completed his Ph. D degree from Maharishi Markandeshwar University, Mullana (Ambala) in 2013. Currently, he is working as an Associate Professor at M.M.I.C.T.&B.M., Maharishi Markandeshwar University, Mullana (Ambala), Haryana. He is a Senior Member of the association of Universal Association of Computer & Electronics Engineers (UACEE), Australia, member in the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (ICST), Belgium and Member in the International Association of Engineers, Hong Kong. His area of interest is web, Query Optimizations and mobile ad hoc networks.

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1159

Anomalies Classification Approach for

Network-based Intrusion Detection System

Qais Saif Qassim, Abdullah Mohd Zin, and Mohd Juzaiddin Ab Aziz

(Corresponding author: Qais Saif Qassim)

Research Center for Software Technology and Management, Information Science and Technology University

Kebangsaan Malaysia, 43600 Bangi, Selangor Darul Ehsan, Malaysia

(Email: qaisjanabi@gmail.com)

(Received Sep. 1, 2015; revised and accepted Dec. 7, 2015 & Jan. 15, 2016)

Abstract

Anomaly based intrusion detection system (A-IDS) is con-sidered to be a better option than signature based system since it does not require prior knowledge of attack signa-ture before it can be used to detect an intrusion. However managing alarms generated by this system is more diffi¬cult than signature-based intrusion detection systems (S-IDSs). This is due to the fact that S-IDS generates rich information along with the reported alarms whereas A¬IDS may just identify the connection stream that is de¬tected as malicious. A-IDS raises an alarm every time it detect an activity that deviates from the baseline model of the normal behaviour. Therefore, the cause of the anomaly itself is unknown to the intrusion detection sys¬tem. This brings in a substantial challenge problem in managing IDS alarms and recognizing false positive from true alarms. Therefore, determining the class of an attack detected by anomaly-based detection systems is a signifi¬cant task. This paper serves two folds; firstly, it presents a set of network traffic features that deemed to be the most relevant features in identifying wide range of net¬work anomalies. Secondly, the paper presents an A-IDS alarm classifier based on machine learning technologies to automatically classify activities detected by a packet header-based anomaly detection system. Evaluation ex¬periments showed that machine learning algorithms are capable of classifying malicious activities in an effective and efficient means.

Keywords: Alarm classification, anomaly-based, feature selection, machine learning

1 Introduction

Anomaly-based detection system is designed to uncover abnormal patterns of behaviors, in which anything that widely deviates from normal usage patterns will be con-sidered as an intrusion [4]. It is considered to be a better option than signature based system since it does not re¬

quire prior knowledge of attack signature before it can be used to detect an intrusion. However, identifying the class of attack poses a significant problem in anomaly based IDS. In signature based IDS, this process is trivial since each signature is a result of an analysis of the correspond-ing attack conducted by security experts; in which the attack class is manually assigned during the signature de-velopment process [9, 12]. Unlike signature-based IDS, the anomaly-based detection system cannot associate the detected activity with an attack class. In fact one of the major weaknesses of anomaly-based intrusion detection system is that, it cannot classify the detected activity to determine the severity level and the consequences of the detected activity [10].

By classifying an attack, it is possible to set default actions for handling a certain alarm. As well as, in or¬der to estimate the risk of unknown attacks, a solution to automate the classification of anomaly-based alarms is required. However, so far no effective and efficient automatic or semi-automatic approach that is currently available, able to classify anomaly-based alarms at run¬time [15, 31]. Thus, any anomaly-based alarm must be manually processed to identify its class; this may increase the workload of security analyst, and will effectively in¬crease time required; as well as, the dependence on secu¬rity analysts. Another limitation of manual alarm pro¬cessing is that the complexity and dynamically changing traffic statistics may introduce the possibly of human er¬ror. This paper presents Network Anomalies Classifier (NAC) that uses machine learning technologies to auto¬matically classify activities detected by a packet header-based anomaly detection system.

The rest of this paper is organized as follows: Section 2 presents an overview of the current state of attack identi-fication and classification addressing the feature sets have been monitored, Section 3 presents the attack scenarios providing the common network traffic features to be mon¬itored to identify different attack classes, Section 4 de¬scribes the research methodology, Section 5 discusses the evaluation of the proposed system and Section 6 presents

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1160

the conclusions and future works.

2 Related Works

IDS alarm classification has been an active research area for the past few years, recent researchers have focused on managing the generated alarms to identify real threats from false alarms and to classify the alarms into distinct classes. Several methods have been proposed to analyse the reported alarms based on different classification al-gorithms and network traffic features [1]. This section presents some of the recently proposed methods.

Entropy based analysis [21] have been employed to analyze a signature-based IDS alarms (more specifically, Snort) and detect real network attacks. The proposed method uses Shannon entropy to examine the distribu-tions of five statistical features of Snort alarms as illus-trated in Table 1. The features used are; the number of alarms generated from each distinct source IP address, the number of alarms sent to a destination host, source and destination threats’ severity grade and datagram length. An adaptive false alarm filter [23] have been utilized to filter out false alarms with the best machine learning al¬gorithm based on distinct network features. The Au¬thors have intended to reduce the false alarms gener¬ated by signature-based IDS (Snort) in real time, and have selected 8 network features to represent the gener¬ated alarms as follows; Snort’s description of an attack, attack classification, priority of an attack, packet type, source IP address, source port number, destination IP address and destination port number. They have used DARPA dataset to evaluate six different machine learn¬ing algorithms; K-nearest neighbor, support vector ma¬chines, naive bayes, neural networks and decision trees using Weka platform. And then, they have designed an adaptive false alarm filter to select the best single-performance algorithm in filtering out false alarms.

An approach of semi-supervised learning mechanism have been introduced by Chiu [3] to build an alarm filter for signature-based intrusion detection system. The au-thors have selected eight network features specifically; the connection’s start time, the connection’s duration, local and remote IP addresses that participated in the connec-tion, connection’s service, local and remote ports used by the connection, the number of bytes sent and received and the state of the connection. In [27] the author has used Lincoln laboratory dataset to find suitable subsets of fea¬tures for network attack detection. The feature subsets were formed using prior knowledge from previous IDS re¬searches and in addition, from analysing network attacks and their effect to the traffic flows, the selected features are illustrated in Table 1. The author showed that attacks of similar type, have similar effect to the network traffic and thus, subsets of features were formed for each attack type.

Flow-based analysis has been considered by Knuuti [16]. The author has compared the usabil¬

ity and performance of three different intrusion detection systems based on the identified network traffic flow features. The evaluated systems were Snort, Bro-IDS and TRCNetAD. Snort and Bro-IDS are signature-based intrusion detection systems while the later is an anomaly-based IDS. The features set that the author used are as illustrated in Table 1, which are statistical representa¬tions of the network traffic flow. The study conducted two, one week long, traffic capturing periods to collect data for the evaluation. Using the selected features, Snort was able to detect over 1.5 million intrusions during the one-week traffic capturing period. Snort was able to detect buffer overflow attacks, Trojan, denial of service, VoIP attacks, Heap overflow attacks, DNS spoofing attack and spyware. Bro-IDS detected approximately eight thousand intrusions which were address and port scan. TRCNetAD detected 150 thousand anomalies during the same time period.

Rule adaptation approach in managing IDS alarms have been considered by Lin [20]. The study has pro¬posed a Weighted Score-based Rule Adaptation (WSRA) mechanism; which have the facility to learn from expert’s feedback. Features used in this work are illustrated in Table 1 and as follows; total number of source and desti¬nation IP addresses in defined time window, source and destination port number, snort’s signature, attack class, and timestamp.

Monitor deviations in network traffic features distri-butions from baseline model had been considered in IDS alarm management approaches [5]. The study analysed events that affect the distribution of traffic features and mark them as anomalies. The proposed system monitored network-wide backbone traffic using the features listed in Table 1. They have monitored the changes on the four IP packet header features between traffic flows using differ¬ent algorithms. However, the study didn’t evaluate the proposed method in real network traffic.

3 Feature Selection Based on At¬tack Scenarios

Feature selection is an important step in building intru¬sion detection and constructing alarm classification mod¬ules. During feature selection phase, a set of network traf¬fic attributes or features deemed to be the most effective attributes is extracted in order to construct suitable clas-sification module [29, 33]. A key challenging problem that many researchers face is how to choose the optimal set of features [1, 28], as not all features are relevant and have an impact on the classification performance, and in many cases, irrelevant features can impact the classification ac-curacy and cause slow training and testing processes. By analysing known attacks and their influence to the normal network traffic, it is possible to define which traffic fea-tures are relevant and therefore should be monitored. The idea behind this approach is to define the characteristics of a specific attack category. This is done by analysing

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1161

Table 1: Network traffic features used in prior studies

Study Features Used Num. of Features

[21] The number of alarms generated from each distinct source IP address, the number of alarms sent to a destination host, source and destination threats’ severity grade and datagram length 5

[23] Description of the attack, Snort’s classification, Alarm priority, packet type, source IP address, source port number, destination IP address and destination port number. 8

[3] The connection’s start time, the connection’s duration, local and remote IP addresses that participated in the connection, connec-tion’s service, local and remote ports used by the connection, the number of bytes sent and received and the state of the connection 8

[27] IP address, timestamp, number of receiving sequences, number of receiving sequences from different IP’s, number of sending se-quences, number of sending sequences to different IP’s, amount of data received, amount of data sent, amount of packets received, amount of packets sent, number of different port numbers used over 1024, number of port numbers used over 1024, number of different port numbers used below or at 1024, number of port numbers used below or at 1024, number of UDP flows, number of TCP connections, number of ICMP packets, number of SMTP connections, number of FTP connections, number of HTTP con-nections, number of DNS connections, number of Telnet connec-tions, number of SSH connections 24

[20] Total number of source and destination IP addresses in defined time window, source port number, destination port number, snort’s signature, attack class, and timestamp. 5

[16] IP address, timestamp, number of ICMP packets, number of UDP flows, number of TCP connections, amount of received data, amount of sent data, number of received packets, number of sent packets, number of different port numbers used over 1024, number of port numbers used over 1024, number of different port numbers used below 1024, number of port numbers used below 1024, num¬ber of receiving sequences from different IP’s, number of receiving sequences, number of sending sequences to different IP’s and num¬ber of sending sequences. 17

[5] Source IP address, destination IP address, source port number and destination port number. 4

the attacks classification done by MITRE Corp [24]. Re-searchers at MITRE Corp. have developed attack taxon¬omy for the United State Department of Homeland Se¬curity [7]; the main goal of this taxonomy is to create a list of patterns employed by attackers when compromising information systems, along with a comprehensive schema and classification taxonomy [34]. The project entitled as the Common Attack Pattern Enumeration and Classifi¬cation (CAPEC). The classification in CAPEC is based on the mechanism used to attack that include; resource depletion, network reconnaissance, spoofing, exploitation of authentication, and exploitation of privileges.

3.1 Resource Depletion (DOS)

An attacker depletes a resource to the point that the tar-get’s functionality is affected. The result of a successful resource depletion attack is usually the denial of one or more services offered by the target [11, 19]. In order to deplete the target’s resources the attacker must interact with the target and a client or script capable of mak¬ing repeated requests over a network. If the attacker has some privileges on the system the required resource will likely be the ability to run a binary or upload a compiled exploit, or write and execute a script or program that consumes resources. Most of resource depletion attacks are detectable by monitoring from the traffic flows and the amount of data sent by the source. Therefore, the features that should be monitored for resource depletion

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1162

attacks are as follows [26, 27];

1) Number of sequences received during the observation period;

2) Amount of bytes received during the observation pe-riod;

3) Total number of packet received;

4) Total number of sequences received during the obser¬vation period from different IP’s;

5) Number of sequences sent during the observation pe¬riod;

6) Amount of bytes sent during the observation period;

7) Total number of packet sent;

8) Total number of sequences sent during the observa-tion period to different IP’s;

9) Total number of different TCP and UDP port num-bers used by source;

10) Total number of different TCP and UDP port num-bers used by the host;

11) Number of TCP requests for transmission;

12) Number of half open connections;

13) Number of established connections which represents an open connection;

14) Number of connection termination requests sent;

15) Number of confirming connection termination re-ceived;

16) Total number of TCP connections during the obser-vation period;

17) Total number of UDP flows during the observation period;

18) Total number of TCP connections initiated by source;

19) Total number of UDP flows received;

20) Total number of TCP connections initiated by the host;

21) Total number of UDP flows sent.

3.2 Network Reconnaissance (Probe)

An attacker engages in network reconnaissance opera¬tions to gather information about a target network or its hosts. Network Reconnaissance techniques can range from stealthy to noisy and utilize different tools and methods depending upon the scope of the reconnais¬sance [24, 26]. Host discovery and port scanning are common examples of network reconnaissance, where the attacker tries to map out IP addresses and operating sys¬tems that are in use, as well as what services the hosts are providing [14]. In general, in network reconnaissance operations the attacker tries to find out all the possible means and methods that it can use to perform other at¬tacks such as denial of service or gaining an unauthorised access to the inner network. Most of network reconnais¬sance attacks are detectable by monitoring from the traffic flows. Therefore, the features that should be monitored for such attacks are as follows [17, 23];

1) Number of sequences received during the observation period;

2) Total number of sequences received during the obser¬vation period from different IP’s;

3) Number of sequences sent during the observation pe-riod;

4) Total number of sequences sent during the observa-tion period to different IP’s;

5) Total number of different TCP and UDP port num-bers used by source;

6) Total number of different TCP and UDP port num-bers used by the host;

7) Number of half open connections;

8) Number of connection termination requests sent;

9) Number of confirming connection termination re-ceived;

10) Total number of TCP connections during the obser-vation period;

11) Total number of UDP flows during the observation period;

12) Total number of TCP connections initiated by source;

13) Total number of UDP flows received;

14) Total number of TCP connections initiated by the host;

15) Total number of UDP flows sent.

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1163

3.3 Spoofing

An attacker interacts with the target in such a way as to convince the target that it is interacting with some other principal and as such take actions based on the level of trust that exists between the target and the other prin¬cipal [30]. Many of the protocols in the TCP/IP suite do not provide mechanisms for authenticating the source or destination of a message. They are thus vulnerable to spoofing attacks when extra precautions are not taken by applications to verify the identity of the sending or re-ceiving host. IP spoofing may be used to leverage man-in-the-middle attacks against hosts on a computer network. Spoofing attacks which take advantage of TCP/IP suite protocols may be mitigated with deep packet inspection. The features that should be monitored for such attacks are as follows [25];

1) Total number of sequences received during the obser¬vation period from different IP’s;

2) Total number of sequences sent during the observa-tion period to different IP’s;

3) Number of privileged port numbers used during the observation period;

4) Number of different privileged port numbers used during the observation period;

5) Number of registered ports used during the observa-tion period;

6) Number of different registered port numbers used;

7) Total number of different TCP and UDP port num-bers used by source;

8) Total number of different TCP and UDP port num-bers used by the host;

9) Number of TCP requests for transmission;

10) Number of half open connections;

11) Number of established connections which represents an open connection;

12) Number of connection termination requests sent;

13) Number of confirming connection termination re-ceived.

3.4 Exploitation of Authentication

An attacker actively targets exploitation of weaknesses, limitations and assumptions in the mechanisms a tar¬get utilizes to manage identity and authentication. Such exploitation can lead to the complete subversion of any trust the target system may have in the identity of any entity with which it interacts. The exploitation of au-thentication attacks are detectable from the payload data by looking for specific patterns. Some of the attacks are

though also detectable from the network traffic by look¬ing for malformed packets that are oversized, fragmented or using, for example, abnormal TCP flag options [22]. Therefore, the features that should be monitored for such attacks are as follows;

1) Total number of sequences received during the obser¬vation period from different IP’s;

2) Number of privileged port numbers used during the observation period;

3) Number of different privileged port numbers used during the observation period;

4) Number of registered ports used during the observa-tion period;

5) Number of different registered port numbers used;

6) Number of half open connections;

7) Total number of TCP connections during the obser-vation period;

8) Total number of UDP flows during the observation period;

9) Total number of TCP connections initiated by source;

10) Total number of UDP flows received;

11) Total number of TCP connections initiated by the host;

12) Total number of UDP flows sent. 3.5 Exploitation of Privilege/Trust

An attacker actively targets exploitation of weaknesses, limitations and assumptions in the mechanisms a tar¬get utilizes to manage access to its resources or autho¬rize utilization of its functionality. Such exploitation can lead to the complete subversion of any control the tar¬get has over its data or functionality enabling almost any desired action on the part of the attacker. Similarly to ex¬ploitation of authentication attacks, this type of attacks detectable from the payload data by looking for specific patterns. However, some of the attacks are though also detectable from the network traffic. Therefore, the fea¬tures that should be monitored for such attacks are as follows [35, 36];

1) Total number of sequences received during the obser¬vation period from different IP’s;

2) Total number of sequences sent during the observa-tion period to different IP’s;

3) Number of privileged port numbers used during the observation period;

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1164

Figure 1: Online anomalies classifier

4) Number of different privileged port numbers used during the observation period;

5) Number of registered ports used during the observa-tion period;

6) Number of different registered port numbers used;

7) Number of half open connections;

8) Total number of TCP connections during the obser-vation period;

9) Total number of UDP flows during the observation period.

4 Network Anomalies Classifier (NAC)

This section presents an A-IDS alarm classification method which relies on machine learning algorithm and attack examples learnt from S-IDS during the training process. The proposed method monitors the network communication pattern and actively extracts the required network traffic features. The proposed system analyse IDS alarms and attempt to classify them based on pre-learnt classification model. The classification model is constructed based on attack examples supplied during training phase, during the training phase Snort have been used to provide alarm class definitions of the activities de-tected by the anomaly detection system. The proposed system is represented by the Network Anomalies Clas¬sifier (NAC) module depicted in Figure 1. The NAC is responsible for an automatic classification of activities de-tected by a packet header-based anomaly detection sys¬tem (specifically, PHAD) based on predefined set of pat¬terns of attack mechanisms.

The proposed network anomalies classifier uses ma¬chine learning algorithm to assign class labels to the de¬tected activities. The NAC consist of two interacting components; the Packet Features Extractor (PFE) and Anomaly Classifier Engine (ACE) as illustrated in Fig¬ure 1. The PFE monitors network traffic flow and extracts traffic flow features to generate alarm meta-information

as a vector representing symptoms vector. The symp¬toms vector then, to be directed to the anomaly classi¬fier engine for further analysis. The most suitable traffic flow features are selected by handpicking from the feature spectrum based on the prior knowledge about the envi¬ronment that the IDS is monitoring and the analysis of known attack types.

The ACE is responsible for automatically classify the detected activity and determine the attack class. Before the classifier engine is able to classify new incoming alarms automatically, the ACE is trained with several types of attack meta-information to build a classification model. During the training phase, the attack meta-information is provided automatically by extracting specific information from known attack signatures. In this work, a signature-based IDS is deployed next to the anomaly detection sys-tem and both monitor the same network traffic. Conse-quently, the S-IDS is responsible to feed the NAC with the attack class of any alarm generated by the two systems.

4.1 Packet Features Extractor (PFE)

Network traffic contains features that are redundant or their contribution to the classification process is little. Therefore, it is essential to choose among the data what is relevant to consider and what is not [8]. By reducing the amount of features, the classifier’s computational speed is improved and the overall performance is increased. Thus, Feature selection plays an important role when creating a model of the network traffic. The features should repre-sent the traffic data as accurate as possible. The challenge is on discovering the most suitable features having major contribution to the classification process [4].

Network traffic is collected based on either packet data or network traffic flow, each provides a different type of visibility and collectively can provide a complete view of the network activities. As data streams flow across the network, the network packet-based sniffer captures each packet and decodes the packet’s raw data, showing the values of various fields in the packet. The network traf¬fic contains users’ confidential information [33]. Conse-quently, a deep packet analysis cannot be done, and only limited analysis for the network traffic can be achieved. Therefore, the header fields of the packets can be checked, but not the user’s data in the payload.

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1165

A traffic flow can be described as; all network packets with the same source/destination IP address, source/destination ports, protocol interface and class of service are grouped into a flow. Traffic flow is summarized data that provides a simple, effective, and scalable way to gain visibility into traffic types and bandwidth usage on the network. One important fact about network flows is that flows do not provide any packet payload. Rather, only meta-information about network connections is col¬lected. The meta-information contains several attributes (e.g., the packets or bytes transferred in a flow). Unlike packet data approach, since network flows do not carry packet payload, all information which was transported in the original payload is irretrievably lost. While the lack of payload contributed to some advantages such as privacy and scalability [13].

Based on the available information from the literature it seems that an efficient attack classification can be done by using the network traffic flow information. Recent re-searches showed that network traffic flows could improve the accuracy of attack classification [13, 18]. Therefore, the network traffic flow method has been used in this work to monitor network behaviour. There are many advan¬tages in using flow data instead of packet data. The ma¬jor advantage comes from protecting the privacy and the confidentiality of the protected network as well as the re¬duced need of storage space for the data, since network flows requires a one tenth of the original packet-based data which is a huge difference. Network traffic flow pro¬vides abstract overview of the network state, performance and behaviour which are required to train the anomalies classifier engine.

Two approaches were used to select the relevant fea-tures from the network traffic. Initially, an analysis of what information the field of literature holds on this topic; then an evaluation of different attack scenarios and how they affect the network traffic behaviour have been pre¬pared. The most suitable traffic flow features are selected from the feature spectrum based on the prior knowledge about the environment that the IDS is monitoring and the analysis of known attack types.

4.1.1 Packet Features Selection

After analysing the features from the attack scenarios point of view and what have been utilized in the liter¬ature, it seemed that the features used by [16] are very comparable to the features that should be monitored for each attack class. Therefore the features used by [16] were chosen as well as some other related features ob¬tained from the attack scenarios analysis. The features to be monitored are listed in Table 2. The selected fea¬ture set containing statistical information that reflects the amount of change within each time interval.

As illustrated in Table 2, twenty five features have been selected to be monitored. The selected features will be represented as a vector of 25 elements, where each el-ement represents its designated value. At this stage the

extracted vectors will be defined as the symptoms vectors. To expound on the functionality of the packet features extractor, the functional model of the proposed system is shown in Figure 2.

4.2 Anomaly Classifier Engine (ACE)

The anomaly classifier engine is responsible for automat-ically classify the detected activity and determine the at-tack class, based on predefined set of patterns of known attack mechanisms that are defined in the CAPEC and CVE databases. The PFE monitors network traffic flow and extracts traffic flow features to generate alarm meta-information as a vector representing symptoms vector. The symptoms vector is then passed to the anomaly clas-sifier engine that automatically determines the attack class. The development of ACE goes through two stages. First, the ACE is trained with several types of attack symptoms vectors. Then, when the training is completed, the ACE is ready to classify new incoming alarms auto¬matically.

During training phase, a signature-based IDS is de-ployed next to the A-IDS such that the two systems mon-itor the exact network traffic as illustrated in Figure 3. Once the A-IDS generates an alarm the anomaly classi¬fier engine learns the alarm class from the signature-based system. The strategy of alarm labelling process is as fol-low; if A-IDSs’ reported activity did not trigger the S-IDS to generate an alarm it shall be considered as false alarm otherwise the classification engine will acknowledge S-IDS classification of the detected activity. Once the training phase is over, the proposed system enters the classifica-tion phase. During this phase, the packet header extrac¬tor actively extracts network traffic flow features of A-IDS reported activities and the anomaly classifier engine clas-sifies the events based on the learnt classification model. The ACE includes the algorithm used to classify attacks; machine learning technologies have been used for classifi-cation process, to automatically and systematically clas¬sify attacks detected by an anomaly-based intrusion de¬tection system. Machine learning can help to automate tasks and provide predictions where humans have diffi¬culties to comprehend large amount of data. One major benefit of machine learning is the generalization ability, in which it has the ability of an algorithm to function ac¬curately on new, unseen examples after having trained on a learning data set.

4.2.1 Machine Learning Algorithm Selection

The choice of which specific learning algorithm should be used is a critical step. The classifier’s evaluation is most often based on classification accuracy (the percent¬age of correct classifications divided by the total number of events in the data set). There are various techniques available used to calculate a classifier’s accuracy. One technique is to split the training set by using two-thirds for training and the other third for estimating perfor

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1166

Table 2: Selected network traffic flow-based features (RD: Resource Depletion, NR: Network Reconnaissance, Spf: Spoofing, ExA: Exploitation of Authentication, ExP: Exploitation of Privilege/Trust)

Label Feature RD NR Spf ExA ExP

F1 Number of sequences received during the ob- servation period  

F2 Amount of bytes received during the observa- tion period 

F3 Total number of packet received 

F4 Total number of sequences received during the observation period from different IP’s     

F5 Number of sequences sent during the observa- tion period  

F6 Amount of bytes sent during the observation period 

F7 Total number of packet sent 

F8 Total number of sequences sent during the ob- servation period to different IP’s    

F9 Number of privileged port numbers used dur- ing the observation period   

F10 Number of different privileged port numbers used during the observation period   

F11 Number of registered ports used during the observation period   

F12 Number of different registered port numbers used   

F13 Total number of different TCP and UDP port numbers used by source   

F14 Total number of different TCP and UDP port numbers used by the host   

F15 Number of TCP requests for transmission  

F16 Number of half open connections     

F17 Number of established connections which rep- resents an open connection  

F18 Number of connection termination requests sent   

F19 Number of confirming connection termination received   

F20 Total number of TCP connections during the observation period    

F21 Total number of UDP flows during the obser- vation period    

F22 Total number of TCP connections initiated by source   

F23 Total number of UDP flows received   

F24 Total number of TCP connections initiated by the host   

F25 Total number of UDP flows sent   

mance. In another technique, known as cross-validation, the training set is divided into mutually exclusive and equal-sized subsets and for each subset the classifier is trained on the union of all the other subsets. The aver¬age of the error rate of each subset is therefore an esti¬

mate of the error rate of the classifier. If the error rate evaluation is unsatisfactory, the selected features must be re-examined.

Since the attack class and the related meta-information can be obtained, only supervised machine learning algo

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1167

Generate symptoms vectors

Figure 2: Functional model of the proposed method

Figure 3: Online anomalies classifier during training phase

rithms have been considered in this work. These algo-rithms generally achieve better results than unsupervised methods. However, the classification algorithm must meet several requirements as listed in Table 3.

In this work, five machine learning algorithms have been considered as follows; Random Committee, Rota¬tion Forest, PART, Random Forest and Random Tree. These algorithms implement supervised techniques, their training and classification phase are fast and able to han¬dle large amount of data. In this work, machine learning evaluations have been implemented by using Weka plat¬form. Weka is a well-known collection of machine learning algorithms, it also provide a comprehensive framework to execute benchmarks on several datasets under the same testing conditions.

Random Tree is a decision tree that considers number of randomly chosen attributes at each node. Random Tree have been introduced by [2] as a base classifier for his random forest classification algorithm. Random Tree

develops un-pruned decision trees furthermore, it does not perform and optimization on its resultant rulesets.

Random Committee is an ensemble of randomized Random Tree classifiers. Each Random Tree classifier is built using a different random number seed. The final pre-diction is a straight average of the predictions generated by the individual base classifiers. Rotation Forest [32] have proposed an ensemble-classifier based on feature ex-traction. The model uses decision tree algorithms (J48) as base classifier and the feature extraction is based on Prin-cipal Component Analysis (PCA). PCA have been used to determine features feasibility and find out whether they do contribute to increased classification accuracy. In gen-erating the training dataset, the feature set is randomly split into number of subsets and the Principal Compo¬nent Analysis (PCA) is applied to each subset. The co¬efficients of the principal components is represented in a vector for each subset, and organized in a rotation matrix. All principal components are retained in order to preserve

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1168

Table 3: Machine learning selection criteria

Num. Criteria Description

1 Support for multiple classes The attacks fall into five different categories. Therefore, it is required that, the selected al-gorithm supports multiclass classification.

2 Able to handle large amount of data Using large amounts of memory can seriously degrade the system. Quite a few learning algo-rithms can be trained incrementally, one data row at a time. These methods generally have runtime that is linear in the number of rows and fields in the data and only require the current data row to be presented in the main

memory. Because of this, they can process

large amount of data.

3 High accuracy classification One of the significant requirements is that, the machine learning algorithm should clas¬sify with high accuracy and low false positive and negative.

4 Able to train with small data set (fast training) It is required that, the machine learning al-gorithm is able to develop the classification model in a small number of data set, to de-crease the amount of alarms required.

5 Having an explicit underlying probability model The machine learning algorithm should be based on statistical approaches, which pro¬vides a probability that an instance belongs in each class, rather than simply a classifica¬tion.

6 Developed for academic re-

searches Because machine learning is beyond the scope of this work.

the variability information in the data. Thus, number of axis rotations takes place to form the new features for a base classifier. The proposed rotation forest ensemble have been evaluated on a selection of 33 benchmark data sets from the UCI repository and compared it with Bag¬ging, AdaBoost, and Random Forest. The classification accuracy was more accurate than in AdaBoost and Ran¬dom Forest, and more diverse than these in Bagging as well.

PART [6] have introduced PART rule-induction algo-rithm which utilized C4.5 and RIPPER rule-learning al-gorithms to propose a classification technique for inferring rules by repeatedly generating partial decision trees with¬out the needs for complex optimization. It adapts the separate-and-conquer strategy in that it builds a rule, re¬moves the instances it covers and continues creating rules recursively for the remaining instances until none are left. In essence to make a single rule, a decision tree is build for a selected set of instances, then the leaf with the largest coverage is made into a rule and that decision tree will be discarded. PART is a partial decision tree algorithm, which is the developed version of C4.5 and RIPPER al¬gorithms. The main speciality of the PART algorithm is that it does not need to perform global optimisation

like C4.5 and RIPPER to produce the appropriate rules; instead it utilises separate-and-conquer methodology to builds a partial C4.5 decision tree recursively and makes the “best” leaf into a rule.

Random Forests; is a combination of decision trees such that each constructed tree depends on the values of a ran¬dom vector sampled independently with the same distri¬bution for all trees in the forest. The concept behind the random forests is that, significant improvements in classi¬fication accuracy would achieve from growing an ensemble of trees furthermore each tree to vote for the most popu¬lar class. Random forests have been introduces by [2] and have been defined as an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

5 Evaluation Results of NAC

This section presents the evaluation results of the pro-posed network anomaly classifier. First, it describes the dataset employed and then the evaluation results are pre-sented.

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1169

Table 4: Machine learning selection criteria

Attack Class Dataset A Dataset B Dataset C Dataset D Dataset E

dos 615 1004 689 0 1378

u2r 15 0 1807 1084 3614

r2l 310 333 148 0 270

data 41 0 2 198 114

probe 171 13 144 148 195

Total 1152 1350 2790 1430 5571

5.1 Evaluation Dataset

The selected machine learning algorithms have been eval-uated against five different datasets. The evaluation was based on the classification accuracy using the defined net-work traffic features. The datasets contain network traffic features representing network state during alarms identi¬fied by security analyst or raised by signature-based IDS (attack only dataset); each dataset contains number of instances representing network traffic audit records dur¬ing a detected malicious activity as shown in Table 5 and Figure 4 illustrate the percentage distribution of attack types in datasets

Dataset A: This dataset contains 1152 instances, hav¬ing majority of denial of service attacks by random selection. The occupancy ratio of denial of service attacks and remote to local attacks is nearly 2:1, and the ratio of remote to local attacks and probe is also about 2:1. The dataset contains some attacks repre-senting the user to root and data attacks. However, some classes have few audit records, which may im-pact negatively to the detection accuracy.

Dataset B: contains 1350 instances, having majority of dos attacks and some other attacks randomly se¬lected, this dataset represents a scenario when an attacker uses probe and remote to user attacks to cause network resource unavailable to its intended users, which is common in real scenarios.

Dataset C: include 2790 instances represents a scenario when an attacker uses probe and remote to user at-tacks with dos to gain root privileges. Therefore, the dataset have a majority of user to root attacks. The occupancy ratio of denial of service attacks and re-mote to local attacks is nearly 1:2.

Dataset D: include 1430 instances represents the same scenario of Dataset C when an attacker uses probe and remote to user attacks to gain root privileges but without the using of dos attacks.

Dataset E: This dataset contains 5571 instances ran¬domly collected, having a majority of u2r attacks. The occupancy ratio of denial of service attacks and remote to local attacks is nearly 2:1.

Figure 4: Percentage distributions of attack types in datasets

5.2 Evaluation Result

Three performance metrics have been used for machine learning comparison, classification accuracy, Precision and F-Measure. The performance of the selected machine learning algorithms have been conducted by training and testing with above five datasets to show its performance in different scenarios. However, there are four factors which influence the classification accuracy; the number of samples (alarms) processed during training phase, the frequency distribution of the alarms, the machine learn¬ing used and the network traffic features used. Table 4 illustrates the detection accuracy of the five datasets us¬ing different machine learning algorithms. Based on the above results of five datasets, it can conduct that Random Committee and Random Tree perform better than other algorithms and their detection accuracy almost identical, but the precision of Random Committee is higher than Random Tree. Therefore, in this work Random Commit-tee will be used to classify the detected activities.

6 Conclusions and Future Works

In order to estimate the risk of unknown attacks, a so-lution to automate the classification of anomaly-based alarms is required. However, So far no effective and effi-cient automatic or semi-automatic approach is currently available able to classify anomaly-based alarms at run

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1170

Table 5: Machine learning selection criteria

Machine Learning Dataset A Dataset B Dataset C Dataset D Dataset E

Random Committee 96.78% 99.85% 98.49% 99.23% 98.20%

Rotation Forest 94.18% 99.03% 97.88% 98.04% 98.09%

PART 93.22% 99.18% 97.13% 98.04% 98.06%

Random Forest 96.61% 99.70% 98.45% 99.09% 98.18%

Random Tree 96.78% 99.85% 98.49% 99.23% 98.20%

time. Thus, any anomaly-based alarm must be manually processed to identify its class; this may increase the work-load of security analyst, and will effectively increase time required; as well as, the dependence on security analysts.

This paper presents Network Anomalies Classi¬fier (NAC) that uses machine learning technologies to au-tomatically classify activities detected by a packet header-based anomaly detection system. The concept behind the proposed methodology is that, attacks those share some common network traffic flow behaviors are usually in the same class. Based on the available information from the literature it seems that an efficient attack classification can be done by using the network traffic flow information. Recent researches showed that network traffic flows could improve the accuracy of attack classification. Therefore, the network traffic flow method has been used in this work to monitor network behaviour. Thus by extracting traffic flow sequences triggered by certain attack, it is possible to compare those sequences to previously collected data using machine learning algorithm, then to infer the attack class from the matching sequences.

Two approaches were used to select the relevant fea-tures from the network traffic. Initially, an analysis of what information the field of literature holds on this topic; then an evaluation of different attack scenarios and how they affect the network traffic behaviour have been pre¬pared. The most suitable traffic flow features are selected by handpicking from the feature spectrum based on the prior knowledge about the environment that the IDS is monitoring and the analysis of known attack types.

In this work, five machine learning algorithms have been considered as follows; Random Committee, Rota¬tion Forest, PART, Random Forest and Random Tree. Evaluation experiments showed that machine learning al-gorithms are capable of classifying malicious activities in an effective and efficient means. However, a too low num¬ber of samples could generate an inaccurate classification. Therefore, as the number of training samples increases, accuracy increases. Based on the evaluation experiments results, it can conduct that Random Committee and Ran-dom Tree perform better than other algorithms and their detection accuracy almost identical, but the precision of Random Committee is higher than Random Tree. There-fore, as future works random committee algorithm will be used to classify the detected activities to estimate the security risk level.

References

[1] M. H. Aghdam and P. Kabiri, “Feature selection for intrusion detection system using ant colony op-timization,” International Journal of Network Secu¬rity, vol. 18, no. 3, pp. 420–432, 2016.

[2] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[3] C. Y. Chiu, Y. J. Lee, C. C. Chang, W. Y. Luo, and H. C. Huang, “Semi-supervised learning for false alarm reduction,” in Advances in Data Mining Ap¬plications and Theoretical Aspects, Springer Berlin Heidelberg, vol. 6171, pp. 595–605, 2010.

[4] G. Fernandes and P. Owezarski, “Automated classi-fication of network traffic anomalies,” Security and Privacy in Communication Networks, vol. 19, pp. 91– 100, 2009.

[5] R. Fontugne, T. Hirotsu, and K. Fukuda, “An image processing approach to traffic anomaly detection,” in ACM Proceedings of the 4th Asian Conference on Internet Engineering (Aintec’08), pp. 17, 2008.

[6] E. Frank and I. H. Witten, “Generating accurate rule sets without global optimization,” in Proceedings of the Fifteenth International Conference on Machine Learning, pp. 144–151, 1998.

[7] V. N. L. Franqueira, Z. Bakalova, T. T. Tun, and M. Daneva, “Towards agile security risk management in RE and beyond,” in IEEE Workshop on Empirical Requirements Engineering, pp. 33–36, 2011.

[8] S. Ganapathy, K. Kulothungan, S. Muthurajkumar, M. Vijayalakshmi, P. Yogesh, and A. Kannan, “In¬telligent feature selection and classification tech¬niques for intrusion detection in networks: A survey,” EURASIP Journal on Wireless Communications and Networking, vol. 2013, no. 1, pp. 271, 2013.

[9] A. A. Ghorbani, W. Lu, and M. Tavallaee, “Network intrusion detection and prevention,” Information Se-curity, vol. 47, pp. 27–54, 2010.

[10] M. Guimaraes and M. Murray, “Overview of intru¬sion detection and intrusion prevention,” in Proceed¬ings of the ACM 5th Annual Conference on Informa-tion Security Curriculum Development, pp. 44–46, 2008.

[11] X. Hongbin and X. Wenbo, “Research on method of network abnormal detection based on hurst parame¬ter estimation,” in Proceedings International Confer

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1171

ence on Computer Science and Software Engineering, vol. 3, pp. 559–562, 2008.

[12] G. Javadzadeh and R. Azmi, “IDuFG: Introducing an intrusion detection using hybrid fuzzy genetic ap-proach,” International Journal of Network Security Its Applications, vol. 17, no. 6, pp. 754–770, 2015.

[13] J. H. Jun, D. Lee, C. W. Ahn, and S. H. Kim, “DDoS attack detection using flow entropy and packet sam-pling on huge networks,” in The Thirteenth Interna-tional Conference on Networks, pp. 185–190, Nice, France, 2014.

[14] Y. Kim, J. Y. Jo, and K. K. Suh3, “Baseline pro¬file stability for network anomaly detection,” Inter-national Journal of Network Securit, vol. 6, no. 1, pp. 60–66, 2008.

[15] J. M. Kizza, “Introduction to computer network vul-nerabilities,” in Guide to Computer Network Secu-rity, vol. 4, pp. 87–103, 2015.

[16] O. Knuuti, T. Sepp¨al¨a, T. Alapaholuoma, J. Yli-nen, P. Loula, P. Kurnpulainen, and K. H¨at¨onen, “Constructing communication profiles by clustering selected network traffic attributes,” in 5th Interna-tional Conference on Internet Monitoring and Pro-tection (ICIMP’10), pp. 105–109, 2010.

[17] P. G. Kumar and D. Devaraj, “Network intrusion de-tection using hybrid neural networks,” in 2007 Inter-national Conference on Signal Processing Communi¬cations and Networking, pp. 563–569, 2007.

[18] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” in Pro-ceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Com¬puter Communications, pp. 217, 2005. Press.

[19] F. Y. Leu and Z. Y. Li, “Detecting DoS and DDoS at-tacks by using an intrusion detection and remote pre-vention system,” in Fifth International Conference on Information Assurance and Security, pp. 251–254, 2009.

[20] H. H. Lin, C. H. Mao, and H. M. Lee, “False alarm reduction by weighted score-based rule adaptation through expert feedback,” in IEEE 2nd International Conference on Computer Science and its Applica-tions, pp. 1–8, 2009.

[21] T. Liu, Z. Wang, H. Wang, and K. Lu, “An entropy-based method for attack detection in large scale net-work,” International Journal of Computer, Commu-nications and Control, vol. 7, no. 3, pp. 242–250, 2012.

[22] M. V. Mahoney, “Network traffic anomaly detection based on packet bytes,” in Proceedings of the ACM Symposium on Applied Computing, pp. 346, 2003.

[23] Y. Meng and L. Kwok, “Adaptive false alarm filter using machine learning in intrusion detection,” Prac-tical Applications of Intelligent Systems, pp. 573– 584, 2011.

[24] Mitre Corporation, “Common attack pattern enu¬meration and classification (CAPEC),” 2011.

[25] N. Mohd, S. Annapurna, and H. S. Bhadauria, “Tax-onomy on security attacks on self configurable net-works,” International Journal of Electronics and In¬formation Engineering, vol. 3, no. 1, pp. 44–52, 2015.

[26] T. L. Nielsen, J. Abildskov, P. M. Harper, I. Papae-conomou, and R. Gani, “The CAPEC Database,” Journal of Chemical & Engineering Data, vol. 46, pp. 1041–1044, 2001.

[27] A. Niemel¨a, “Traffic Analysis for Intrusion Detection in Telecommunications Networks,” Master of Science Thesis, Tampere University of Technology, 2011.

[28] I. V. Onut and A. A. Ghorbani, “A feature classifica-tion scheme for network intrusion detection,” Inter-national Journal of Network Security, vol. 5, no. 1, pp. 1–15, 2007.

[29] S. Parsazad, E. Saboori, and A. Allahyar, “Fast fea-ture reduction in intrusion detection datasets,” in MIPRO Proceedings of the IEEE 35th International Convention, pp. 1023–1029, 2012.

[30] Q. Qian, T. Wang, and R. Zhan, “Relative network entropy based clustering algorithm for intrusion de-tection,” International Journal of Network Security, vol. 15, no. 1, pp. 16–22, 2013.

[31] O. Rodas and M. A. To, “A study on network se¬curity monitoring for the hybrid classification-based intrusion prevention systems,” International Journal of Space-Based and Situated Computing, vol. 5, no. 2, pp. 115, 2015.

[32] J. J. Rodr´ıguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier ensemble method,” IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, vol. 28, pp. 1619–30, 2006.

[33] N. Sharma and S. Mukherjee, “A layered approach to enhance detection of novel attacks in IDS,” Inter-national Journal of Advances in Engineering Tech-nology, vol. 4, no. 2, pp. 444–455, 2012.

[34] H. Wang, M. Guo, L. Zhou, and J. Camargo, “Rank¬ing attacks based on vulnerability analysis,” in 2010 43rd IEEE Hawaii International Conference on Sys¬tem Sciences, pp. 1–10, 2010.

[35] K. Wang, S. J. Stolfo, “Anomalous payload-based network intrusion detection,” in Recent Advances in Intrusion Detection, LNCS 3224, pp. 203–222, Springer, 2004.

[36] J. Yu and Y.V.R. Reddy, “TRINETR: an intrusion detection alert management systems,” in 13th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 235– 240, 2004.

Qais Saif Qassim is a Ph.D. candidate in Faculty of Information Science and Technology, Universiti Ke-bangsaan Malaysia (UKM). His research interest in network security and management.

Abdullah Mohd Zin received his PhD from the University of Nottingham, United Kingdom in 1993. He is currently the dean of Faculty of Information Science and Technology, UKM. His specialization in system

International Journal of Network Security, Vol.18, No.6, PP.1159-1172, Nov. 2016 1172

architecture, programming language, communication and distributed, formal method.

Mohd Juzaiddin Ab Aziz received his PhD degrees in Computer Science from University Putra Malaysia (UPM). Currently, he is the deputy dean of undergradu¬ate studies in faculty science and information technology, UKM. His specialization in programming language tech¬nology and natural language processing.

Call For Papers: The 28th IJCNN 2016 Special Session on Advanced Machine Learning Methods and Applications from Complicated Data Environment

The IEEE WCCI 2016 will host three conferences: The 2016 International Joint Conference on Neural Networks (IJCNN 2016), the 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016), and the 2016 IEEE Congress on Evolutionary Computation (IEEE CEC 2016) under one roof. It encourages cross-fertilization of ideas among the three big areas and provides a forum for intellectuals from all over the world to discuss and present their research findings on computational intelligence.

AIMS AND SCOPE

Traditional machine learning methods have been commonly used for many applications, such as text classification, image recognition, and video tracking. For learning purposes, these data are often required to be represented as vectors. However, many other types of data objects in real-world applications, such as chemical compounds in bio pharmacy, brain regions in brain networks and users in social networks, contain rich feature vectors and structure information. Such a simple feature-vector representation inherently loses the structure information of the objects. In reality, objects may have complicated characteristics, depending on how the objects are assessed and characterized. Meanwhile the data may come from heterogeneous domains, such as traditional tabular-based data, sequential patterns, social networks, time series information, and semi-structured data. Novel machine learning methods are desired to discover meaningful knowledge in advanced applications from objects with complicated characteristics.

This special session expects to solicit contributions on the advanced machine learning methods and applications from complicated data environment.

TOPICS

The topics of interest include, but are not limited to:

• Supervised/Unsupervised/Semi-supervised Learning

•

• Semi-structured Learning •

• Graph-based Learning •

• Graph Classification/Clustering/ Streaming •

• Multi-Graph Learning •

• Deep Graph Learning •

• Online Graph Learning

IMPORTANT DAYS

• Paper submission: 15 January 2016. •

• Notification of paper acceptance: 15 March 2016. •

SUBMISSION GUIDELINES

All papers must be submitted through the IEEE IJCNN 2016 online submission system. For special session papers, please make sure you select the respective special session title “Advanced Machine Learning Methods and Applications from Complicated Data Environment” under the list of research topics in the submission system. For manuscript style information, you can refer to paper submission/templates/style at the conference website.

*Papers submitted for special sessions are to be peer-reviewed with the same criteria used for the rest of contributed papers. As a result, all accepted papers will be included in the main conference proceedings of IEEE IJCNN 2016.*

SPEICAL SESSION CHAIRS

Jia Wu QCIS Centre, University of Technology, Sydney, Australia Jia.Wu@uts.edu.au

Shirui Pan QCIS Centre, University of Technology, Sydney, Australia Shirui.Pan@uts.edu.au

Peng Zhang QCIS Centre, University of Technology, Sydney, Australia Peng.Zhang@uts.edu.au

Xingquan Zhu Florida Atlantic University, USA xzhu3@fau.edu

Chengqi Zhang QCIS Centre, University of Technology, Sydney, Australia Chengqi.Zhang@uts.edu.au

Philip S. Yu University of Illinois at Chicago, USA psyu@uic.edu

Horne: The Island Landscape

Institute of Island Studies

University of Prince Edward Island

Conceived by a small group of PEI scholars and community leaders in 1982 and brought into being at UPEI in 1985, the Institute of Island Studies has strived to fulfil its purpose: to improve the quality of life on Prince Edward Island through practical applications of the scholarly field of Island Studies.

With the generous support of the University of Prince Edward Island, the provincial government, federal agencies, private sponsors, and many, many Island volunteers who have served on the Advisory Council and its committees, the Institute works to fulfill its four-fold mandate:

To encourage a deep knowledge, understanding and cultural expression of Prince Edward Island;

To serve as a bridge between the University and Island communities;

To contribute to the formulation of progressive public policy on Prince Edward Island;

To undertake comparative studies of Prince Edward Island and other islands.

Horne: The Island Landscape

The Island Landscape –

a Non-renewable Resource

Carol Horne, MA

Institute of Island Studies, UPEI

June, 2009

iii

Horne: The Island Landscape

Contents

Executive Summary 1

Landscape and Development on Prince Edward Island 2

Landscape as a Research Subject 2

The Value of Landscape on Prince Edward Island 4

The Island Landscape and Agricultural Development 4

The Island Landscape and Tourism Development 5

Managing a Working Landscape 6

Non-resident Land Ownership 6

Creeping Exurbia 7

Legislative Protections for PEI Landscape 9

Scenic Viewscape Zones and Special Planning Areas 9

The Lands Protection Act 9

Alternative Land Use Service (ALUS) 11

Landscape Management Tools in Other Jurisdictions 12

Overview 12

International and European Landscape Management 12

United Kingdom 13

United States and Canada 14

Islands 15

Applying Global Lessons to Islands 17

Moving Towards Landscape Management on PEI 20

Filling Gaps in the Management Framework 20

The Public’s Role in Protecting Prince Edward Island Landscapes 20

Tools for Evaluating Scenic Viewscapes 23

Beyond the Aesthetic Value of Landscape 25

Possible Next Steps 25

Conclusions 27

References 29

Executive Summary

In Prince Edward Island, managing the land has always posed significant challenges for local administrators. At times the issue has simmered just below the surface; at others times, land use concerns have been the subject of widespread and very vocal debate. While the topic of land use is broad and far-reaching, one singular aspect of this issue ― the Island landscape ― elicits a common response, yet few solutions. Prince Edward Island’s pastoral scenery is universally admired and indeed is a key element of the “Prince Edward Island brand.” That unified appreciation is possibly where agreement ends, as preservation and protection face off against economic growth and development, tradition meets the forces of change, and individual property rights are challenged against the public good. How is a small, island-based administration to manage an admittedly very valuable scenic resource in the face of these conflicts?

Through discussion of land use and the value of the Island landscape, it has become clear that one useful step in a management process would be the proper identification of the most valuable of the province’s scenic assets. While PEI as a whole is indeed very beautiful, it is neither practical or desirable to designate the entire landscape as a protected area or a national park. So how might the most aesthetic and culturally valuable landscapes be identified and protected for the appreciation of today’s tourists and tomorrow’s Islanders? This paper, based on primary research (Horne, 2007), demonstrates an efficient and effective landscape identification tool and proves that agreement can be reached on the relative scenic value of viewscapes. A review of landscape management techniques used elsewhere reveals a variety of systems and processes that take into consideration the aesthetic and cultural values of landscape. PEI lags behind most of the areas studied, and there are many critical issues to be addressed with respect to land use and coastal zone management. The research leads to the following conclusions:

Landscape is essential to culture and identity; this connection may be especially acute on small islands.

Where the tourism industry is key to economic health, landscape also has significant economic value.

Institutional frameworks, public support and appropriate evaluative tools are necessary to identify and protect valued landscapes.

Balancing landscape protection and competing development options is challenging; it requires publicly supported, holistic and consistently applied policy and planning. Cost-effective tools for measuring relative landscape value do exist.

Other jurisdictions have approaches to landscape management that provide guidance for improving PEI’s landscape management processes.

Some regions have obtained national recognition for valued working landscapes. Some practical steps are suggested to advance landscape management and to better protect one of Prince Edward Island’s most valuable (and non-renewable) resources.

Horne: The Island Landscape

Landscape and Development on Prince Edward Island Landscape as a Research Subject

The economy of Prince Edward Island (PEI) depends to a large degree on the tourism industry and thus on the appeal of the Island’s scenic viewscapes. Few would deny that PEI’s scenic beauty is a major tourism draw, and that visual appeal plays a large part in both the quality of visitor experience and the quality of life of residents. Yet, land use decisions are typically based on agricultural productivity, potential for development or ecological importance. Aesthetics are rarely taken into consideration as a separate, inherently valuable factor.

At first glance, the value of beauty seems very difficult to measure. There is no consensus on how to accurately assess the economic value of a PEI viewscape, and the sense of well-being imparted by aesthetically pleasing surroundings seems even more challenging to evaluate. Assuming that there is some interest in preserving Prince Edward Island’s scenic resource, is there also a way to measure the relative value of scenic viewscapes? Can we integrate such a tool in planning processes so that land use and development can proceed while maintaining landscape values?

We can begin our exploration of the issue by asking: what is landscape? The Oxford English Dictionary definition limits the meaning to “a view or prospect of natural inland scenery such as can be taken in at a glance from a single point of view; a piece of country scenery”. However, the relevant published literature clearly states that ‘landscape’ does not exist without the ‘experience of it’. The New Zealand Landscape Protection Act explains it this way:

Landscape is a concept which refers to the broader physical environment,

including land and water areas, and people’s perception and appreciation of that

environment. ... Landscape as a human experience combines both aesthetic values and other values which humans attribute to landscape. Used in this sense,

landscape is not only the physical appearance of land but also the subjective baggage each person carries with them ... Everyone’s landscape is different. What we perceive depends on our experience, knowledge, expectation and role.

(Government of New Zealand, 2006)

The European Landscape Convention defines landscape in similar terms: “an area, as perceived by people, whose character is the result of the action and interaction of natural and/or human factors” (Natural England, n.d.3). The observation that landscape is experiential provides the challenge for measuring relative landscape values. If each individual’s experience of the landscape differs, how is it possible to find agreement among diverse members of the public? How can landscape management decisions affecting the daily lives of citizens be made in a way that fairly reflects the general preferences and concerns of the local populace? The interdisciplinary nature of the field of landscape research causes it to be both fascinating and difficult to grasp. As a result, landscape research has often been neglected in favour of seeking answers to environmental questions that seem more amenable to straightforward scientific analysis.

Horne: The Island Landscape

On Prince Edward Island, a series of studies by land use commissions and committees have noted the need to grasp and untangle the thorny issue of landscape management. For example, a primary recommendation in the Landscape chapter of the Report of the Royal Commission on the Land, titled Everything Before Us, reads as follows: “THAT the landscape be of paramount consideration in government activities” (Boylan 1990: 311). A further eight recommendations deal with topics such as promoting greater public appreciation of the landscape, preparing an extensive inventory of special landscapes, and designating a generous number of scenic look-out points. As of 2009, these recommendations have received scant (if any) attention.

The 1999 report by the Standing Committee on Agriculture, Forestry, and Environment included Recommendation number 76, which suggests

...that the Department of Economic Development and Tourism assess the opinions of residents and visitors to Prince Edward Island regarding the quality and

attractiveness of the landscape and the impact of changes thereto. (Government of PEI, 1999a: 39)

In the subsequent progress report, it is implied that the Economic Impact Survey conducted annually by the Department of Tourism somehow assesses the opinions of residents and visitors regarding landscape quality. However, these surveys are inadequate to assess attractiveness of the landscape; they merely indicate that visitors like to sightsee. Since 1999, there has been no effort to develop tools to help manage the province’s scenic resource. Meanwhile, this scenic resource has undergone significant change — not always, we would argue, for the better.

The Master of Arts in Island Studies thesis Measuring Landscape Preferences (Horne, 2007), on which this document is based, reports on how the scenic resource is managed in selected island and mainland jurisdictions. The analysis is summarized below. The thesis also demonstrates a method for evaluating PEI’s scenic resource by collecting the opinions of residents and other “scenery consumers such as tourists. The method is straightforward, and can lead to consistent results that define and rate the relative attractiveness of viewscapes.

The goal of the research — and this document — is to stimulate development of an objective and practical methodology for identifying scenic viewscapes which could then be designated for protection under the provincial Planning Act. This would in turn facilitate the definition and promotion of appropriate development that could enhance PEI’s tourism product and contribute to the quality of life of Island residents. Horne’s research begins to address the 1999 recommendation “to assess the opinions of residents and visitors to Prince Edward Island regarding the quality and attractiveness of the landscape”. The second part of the recommendation, to assess “the impact of changes thereto,” requires further research. Finally, to take action based on such research will require an updated institutional decision-making framework for development ― one that consciously takes into account the many benefits of scenic landscapes.

Horne: The Island Landscape

The Value of Landscape on Prince Edward Island

While most Island residents may take for granted the scenic beauty that surrounds them, they are also the first to boast about the Island landscape to visitors. If asked to describe PEI to outsiders, Islanders will often begin with the word “beautiful”. The Island’s scenic landscape is a key element of the province’s overall image, and has contributed greatly to the ‘brand’ that is Prince Edward Island.

The landscape has clear economic value, being a major motivator for visiting tourists. In fact, in the 2007 Tourism PEI Exit Survey, 36.7% of visitors cited “natural beauty and pastoral setting” as the primary features that attracted them to PEI. Another 23.2 % cited beaches and coastline (UPEI Tourism Research Centre, 2008). Not surprisingly, the top activity for visitors was sightseeing (74.8%). In other research (Government of PEI, 2005b) respondents were prompted to mention other things they liked about PEI, and once again “scenery” topped the list at 51%, while a beaches/oceans/coastline grouping (also key island landscape features) came in second at 34% (ibid: 4–5, 14). While it may not be possible to put an exact dollar value on PEI’s scenic resource, it is clearly a significant factor in the success of tourism, one of the province’s primary industries.

PEI’s other primary industry, agriculture, is the source of much of the visually pleasing patchwork quality of the landscape. Visitors describe the landscape with great admiration as a rolling and colourful quilt, made up as it is of fields of pasture, potatoes, grain and hay; demarcated by rivers, hedgerows and woodlots; and of course defined by the coastline, the ocean and the multitude of bays and inlets. This “working landscape” is visually appealing to many, and the often-noted neatness of the homesteads implies a culture of care that is also very attractive (Nassauer, 1997). Perhaps famous Island author L.M. Montgomery said it best:

Prince Edward Island, however, is really a beautiful province — the most beautiful place in America I believe. Elsewhere are more lavish landscapes and grander scenery; but for the chaste, restful loveliness it is unsurpassed.

‘Compassed by the inviolate sea,’ it floats on the waves of the blue gulf, a green seclusion and ‘haunt of ancient peace’ . (Montgomery, 1917: 11)

The Island Landscape and Agricultural Development

Out of a total Island land base of 5,656 sq. km or 1.4 million acres, almost half is currently covered in forest, scrub or tree plantations (Government of PEI, n.d.2) and 620,000 acres are “cleared for farm use” (Government of PEI, 2008b). Some of the changes in the PEI landscape since the mid 20th century are due to a shift in the agricultural industry. The number of farms has declined dramatically from more than 10,000 in 1951, to fewer than 2,000 in 1996 (Fig. 1). The decline continues, with the overall number of farms decreasing by 16.8% in the five years between 1996 and 2001, and by a further 7.9% from 2001 to 2006 (Statistics Canada, 2006). At the same time, average farm size has expanded from the traditional 90 acres at the turn of the 20th century

Horne: The Island Landscape

to operations averaging more than 300 acres in 2001 (MacKinnon, 2005; Statistics Canada, 2003). This trend to larger farms has been especially rapid since the 1960s.

For primarily economic reasons, the agricultural industry in PEI has shifted away from small mixed farms producing local food, and toward industrial scale monoculture for export. Not only did the average size of the farms increase, but the average size of farm fields expanded, which trend was accompanied by the removal of hedgerows and the infilling of wetlands, seeps and small streams. The increased size of farms and fields, and the size and prevalence of related infrastructure such as large, roadside, climate-controlled warehouses for potato crop storage, are changes that could be perceived to have negative impacts on the overall scenic resource.

Figure 1. Decline in the number of PEI farms, 1951 – 2006. (Source: Statistics Canada, 2006)

12000 10000 8000 6000 4000 2000

1951 1996 2006

number of farms

The Island Landscape and Tourism Development

Often, the economic health of an island depends heavily on the tourism industry. The “lure of the island” is undeniably appealing for vacationers, and tourists who come to PEI often cite “scenic touring” as a preferred activity. In the research conducted for this project (see below), both locals and tourists exhibited a clear preference for less developed, coastal scenes.

The sea-bounded nature of island landscapes both expands and adds value to the view. The ocean also limits the amount and extent of the land, making it seem somehow all the more valuable. However, although visitors have a great appreciation for the beauty of the PEI landscape, the rate of development — especially on shore frontage — may change that perception. While visitors are among the greatest appreciators of Island scenery, the very fact that almost one million people visit PEI each year means that increasing numbers and varieties of landscape-altering services

Horne: The Island Landscape

have been developed to meet the needs of tourists. Thus, the very landscape that has lured visitors may be placed at risk by the actions or demands of those same visitors.

Many island tourism destinations have found themselves faced with this dilemma: a growing tourism industry, visitors searching for simpler times and restful scenery, and island residents scrambling to keep up with the demands of tourists. These factors combine to generate questions and potential conflicts. A key question for discussion is: If an island places great value or depends to a large extent on income from tourism, should expectations for scenery be given more weight than the needs and wishes of residents who want to pursue other development projects or activities, which may degrade landscape value?

Managing a Working Landscape

PEI’s landscape is a “working landscape” and so is, by definition, not static. Unlike a National Park or a designated protected area, a working landscape — no matter how scenic — will be subject to change. That change may enhance or detract from the overall scenic value. Lived-in landscapes are bound to change; no one expects or wishes to live in a museum. The goal must therefore be balanced development, which requires first of all an accurate assessment of the needs and values in play.

The question remains: how can one assess whether, or to what degree, a scene could support development without detracting from its overall appeal? There is ample evidence in the field of landscape research that community members can, in fact, agree on scenic preferences. Given a reliable and convenient measurement tool, perhaps Prince Edward Islanders can agree on which geographic areas and types of landscape have greatest aesthetic or cultural value. It is a logical extension to suppose that through a facilitated process, people could also come to agreement on the acceptability of a proposed change to a valued viewscape. This could be the basis for a landscape management system.

Non-resident Land Ownership

In June 1997, a 13-kilometre bridge connecting Prince Edward Island to the mainland of Canada was completed. That year, the total number of pleasure visitors increased by a spectacular 62%, from 711,200 in 1996 to 1,155,600 (Government of PEI, 1999b). The tourism industry continued to push the expansion of facilities throughout the first years of the 21st century. Cottages were constructed, new attractions opened, and big box stores were built on the fringes of Charlottetown and Summerside. As a result of the easier access to PEI and the greater awareness of the landscape resources that developed among residents from mainland communities, relatively low-priced property was purchased for development of seasonal homes, often on prime shore frontage. At $60,000-$100,000, the current (2008) prices of cottage lots advertised on various PEI real estate websites (e.g. michaelshomes.com, 5662121.com) may be out of reach for some Islanders, but appear to be bargains for many affluent North Americans.

Land ownership statistics are gathered by the Island Regulatory Appeals Commission (IRAC). IRAC highlights an area as a “special interest area” once non-resident ownership reaches 25

Horne: The Island Landscape

percent. Many of these areas are located along the coastline of the province. IRAC’s first published report on Trends in Non-resident Land Ownership included data for 1994-2000 (IRAC, 2000). The prevalence of non-resident land ownership on the coast can be clearly seen on the map provided in that report. The report covering 2001-2004 has also been released but as of November 2008, no more recent reports on non-resident land ownership were available.

In 2003-2004, 15 miles of shorefront were sold to non-residents (IRAC, 2005). It was noted that the proportion of non-resident land purchase applications involving shore frontage had increased (from 47% to 54%) as compared to the previous two-year period. IRAC reported that:

“The Commission believes that this indicates that shore property continues to be an attraction for non-residents. This report shows an increase in the number of applications for shore property although the number of actual miles decreased from 17 in 2002 to approximately 15 miles during this two year period.” (IRAC, 2005: 22)

As seen in Figure 2, the total acreage of PEI land owned by non-residents climbed steadily up until 1988 and then declined― perhaps reflecting a tendency for non-resident land-owners to transition into residents. More current data are not yet publically available.

Figure 2. Acres of PEI land held by non-residents, 1970 - 2004. Sources: IRAC 2000; IRAC 2005

Creeping Exurbia

The migration of Island residents to lots of land or small estates just beyond urban boundaries, in the countryside, or on the coast, leads to still more change to the traditional Island “working

Horne: The Island Landscape

landscape”. Research performed in other parts of the world generally indicates that most people prefer less developed, country scenery. Any trend to suburban-style development lining country roads certainly has the potential to detract from the overall aesthetic appeal of the PEI countryside.

Over the past decade, an average of 420 single family dwelling permits and 221 summer cottage permits have been issued annually by the PEI Department of Community and Cultural Affairs. In the 2001-2002 fiscal year alone, approvals were granted for 434 subdivisions, containing a total of 675 lots (Government of PEI, 2002: 30). Much of this development occurs outside of municipal boundaries.

While there are 75 municipalities in the province (Government of PEI, n.d.1; 2005a), most are too small to have the capacity for land use management. Some 82% of the land base is either unincorporated or is not subject to any municipal planning regulations. Taxes are lower in the unincorporated areas, where the provincial Planning Act is the only means to manage infrastructure development.

Some development restrictions have been imposed in special planning areas surrounding several larger municipalities, but the intent — to provide green space on the fringes of the urban areas — has been lost as developers take their subdivisions and infrastructure projects to lands just outside of the municipal buffer zones, where lower taxes create an incentive for strip development. New construction that occurs outside of municipal boundaries may meet the letter of the law, but the net effect is creeping ‘exurbia’, a term coined to describe the increasing trend of settlement on large plots beyond the suburbs. Even the provincial government itself stands accused of erecting public infrastructure on unincorporated land rather than within municipalities.

Horne: The Island Landscape

Legislative Protections for PEI Landscape

Scenic Viewscape Zones and Special Planning Areas

The concept of coastal touring routes has recently been developed, supported by local groups and two levels of government. Regional tourism associations have identified scenic lookouts, and the provincial Department of Transportation and Public Works has created pull-off areas along various road so that people can admire these views. It is likely that other “lookout” spots will be identified in future. But there seems to be a step missing, or perhaps further coordination required in this tourism product development process. The problem is that, as of March 2008, there was no indication that these lookouts and the associated viewscapes would be afforded any special protection (email communication, PEI Dept of Tourism).

PEI legislation does allows for the designation of Special Planning Areas, and areas of scenic beauty are recognized under the PEI Planning Act as Scenic Viewscape Zones. Once a scenic viewscape has been formally identified under the Planning Act, the Act can be used to prevent electrical poles and built structures from marring the view from the road (Government of PEI, n.d.3).

As yet, it appears that only two Scenic Viewscape Zones have been identified: Campbell’s Pond/New London and Borden-Carleton. It is not clear what criteria were used to designate these viewscapes; or what mechanism might be employed to apply this categorization in a broader way (for example through designation of Special Planning Areas); or how selection and protection of additional scenic viewscapes might proceed. An objective, acceptable and agreed-upon methodology is needed to enable identification of a greater number of scenic viewscapes for protection under the Act. Such a tool could also greatly assist in determining what is or is not appropriate development.

The Lands Protection Act

In the mid 1760s, while under England’s control, Prince Edward Island was divided into sixty-seven 20,000-acre lots or townships. These lots were allocated in advance of settlement to an elite group of absentee proprietors. Eventually, the land was purchased from the British landlords and returned to local ownership, but the issue of land control proved to be a sore point in the Island psyche.

In the 1960s, non-residents began buying up Island properties and as a result, substantial amounts of land in the coastal areas fell out of the control of local residents. The resultant surge in public concern spurred the government to action.

“In 1972, Government introduced amendments to the Real Property Act to restrict the purchase of land by non-residents. In 1981, an application by a non-resident corporation to acquire 6,000 acres prompted public concern over the perceived impacts of land ownership distribution and its implications for the future. Some

Horne: The Island Landscape

viewed this as giving pre-eminent control of the province’s agricultural industry to one company. This led to the introduction of Bill 37, the Lands Protection Act (the Act) in 1982.” (IRAC, n.d.).

Under the Act, there is a limit to the total amount of land any individual or corporation may hold: a person may not own more than 1,000 acres, while a corporation is limited to 3,000 acres. The Act requires a non-resident individual or any corporation (local or off-Island) to apply to the Island Regulatory Appeals Commission (IRAC) for permission to purchase five acres or more, or to purchase shore frontage in excess of 165 feet. Government may approve a land purchase unconditionally, or impose certain restrictions. Exchanges of property within a municipality that has an official plan are not subject to the regulations of the Lands Protection Act.

Conditions imposed under the Act may involve particular land management measures, or may state that the land may not be subdivided except for particular purposes: i.e. agricultural, forestry or fisheries production; conservation or parks use; or a residential use by the owner (IRAC, n.d.). Once a conditional purchase is approved, the land in question becomes “identified” for non-development under the land identification program. IRAC clearly states that “The object of the land identification program is to prevent development of land identified for non-development use” (IRAC n.d.). However, the definition of “non-development” seems exceptionally liberal, allowing almost any type of agricultural or fishery use, as well as individual residential and recreational uses. In addition, it is not clear exactly how the Act as it is currently used actually protects land from non-resident purchase or from inappropriate development. The IRAC appears to refuse only 5-10% of applications from non-residents. In 2001, 4 of 104 applications were turned down; in 2002, 8 of 118 were refused; and in 2002-2004, 9 of 199 (IRAC, n.d.). Furthermore, the IRAC sends its recommendations to Executive Council, where the ultimate decision is made by the Lieutenant Governor in Council. Refusals and conditions recommended by IRAC are often overturned at the Executive Council level.

The Island Regulatory Appeals Commission publishes a list of exemptions to the Act that have been granted. Most exemptions appear to allow for expansion of food processing or forestry activities, but some involve land holdings that are designated as natural areas under the Natural Areas Protection Act. In addition, it is important to note that under the current legislative framework, some lands formerly identified under the Lands Protection Act for non-development are now being de-identified. For example, a recent Order in Council dated October 24, 2006 reads:

Pursuant to subsection 9(2) of the Prince Edward Island Lands Protection Act R.S.P.E.I. 1988, Cap. L-5, Council amended the condition of non-development use made pursuant to section 2 of the Land Identification Regulations (EC606/95) in respect of approximately ten (10) acres of land, being Provincial Property No. 683367 located in Lot 21, Queens County, Prince Edward Island and currently owned by Kenneth Montgomery of Park Corner, Prince Edward Island. Council noted that this amendment will enable subdivision of the parcel of land into a maximum of eleven lots [bold added]. This Order-in-Council comes into force on 24 October, 2006. (IRAC, n.d.2)

Horne: The Island Landscape

Alterations such as the above would seem to be completely contrary to the stated objective of the Act.

Alternative Land Use Service (ALUS)

As recently as March 2008, a new program was announced by the PEI Department of Environment, Energy and Forestry. Alternative Land Use Service:

...is an incentive-based approach for the conservation and protection of key environmental assets and services on PEI. The PEI ALUS Program seeks to reward landowners for public environmental benefits.... include clean water, improved flood control, fish and wildlife habitat conservation and carbon sequestration among others. The principle of ALUS is that it seeks to reward landowners when the public benefits are in excess of existing government regulations. (Government of PEI, 2008a)

This program, in keeping with other incentives for soil and water conservation, is not specifically designed to preserve landscape values. However, it may nevertheless help to restore attractive elements of the agricultural “working landscape” of PEI, such as riparian shrubs and trees.

Horne: The Island Landscape

Landscape Management Tools in Other Jurisdictions Overview

Landscape policies of selected jurisdictions around the world were reviewed to gather information on ways and means that have been developed to manage scenic resources and address land issues from an aesthetic standpoint. Three islands were included in the review in an effort to identify island-specific concerns as they relate to protection of coastlines and conservation of scenery as a tourism resource. The examples documented included international, European, North American, New Zealand and Australian programs that involved at least two categories of process — legislative and community-based. These categories are not necessarily exclusive of one another. From the international level on down, approaches varied widely among continents and countries, with some regions leading the way in terms of both appreciation and protection of landscapes.

It was determined that Europe, and specifically the United Kingdom, may be among the most advanced practitioners of landscape management. This observation suggests that there is a different approach in European nations as compared to their former colonies. The more densely populated European states, with their long histories of human infrastructural development, are characterised by a keen appreciation for heritage that includes a relatively greater interest in landscape. By contrast, in the “New World”, wide-open space may be more taken for granted and a shorter span of development history may result in landscapes having less perceived heritage value. Perhaps as well, the concept of the rights of private landowners was strongly entrenched during the period of colonial occupation and development by European settlers who had been largely landless in their countries of origin.

International and European Landscape Management

The UNESCO World Heritage program and the International Union for the Conservation of Nature (IUCN) are two international agencies that have addressed concerns around landscape. These agencies perform research and identify valued landscapes for the purposes of protection. Of particular interest for PEI is that in 1992, UNESCO’s World Heritage Convention introduced a category called Cultural Landscapes (UNESCO, 2004). Also, the IUCN Category V Protected Landscape/ Seascape designation acknowledges the interaction of people and nature, and could potentially be applied to our working landscapes and seascapes. The IUCN definition for Category V reads as follows:

Area of land, with coast and sea as appropriate, where the interaction of people and nature over time has produced an area of distinct character with significant aesthetic, ecological and/or cultural value, and often with high biological diversity. Safeguarding the integrity of this traditional interaction is vital to the protection, maintenance and evolution of such an area. (IUCN, 1994)

Horne: The Island Landscape

Safeguarding biodiversity is a management objective under IUCN Category V designations, but is not a factor in the case of the Cultural Landscapes category of UNESCO, the criteria for which are as follows:

...to be an outstanding example of a traditional human settlement, land-use, or sea-use which is representative of a culture (or cultures), or human interaction with the environment especially when it has become vulnerable under the impact of irreversible change. (UNESCO, 2008)

There also exists a European Landscape Convention (Natural England, n.d.3), which is primarily a policy and research body involving a network of national research institutes. A leader in this sector in Europe is certainly the United Kingdom. The legendary beauty of the British countryside is well protected with programs supported by extensive community involvement.

United Kingdom

In the UK, a well-established and seemingly successful model features an umbrella organization now called Natural England, which brings together English Nature, the Countryside Agency and the Rural Development Service. This holistic approach combines considerations of economic development, the environment, and scenery as a resource. The UK institutional framework is well supported by active community movements (Campaign to Protect Rural England, Eat the View, Quality of Life Assessment, and Countryside Character Assessment). The concept of landscape character assessment seems to have caught on; as of 2007, 83% of English counties had completed assessments. England has been divided into eight “Joint Character Areas”, each of which exhibit s a distinctive landscape type. The description for the sub-region “Isle of Wight” for example, begins like this:

“A small-scale island landscape with an often intimate feel and an overwhelming sense of discovery. There is a juxtaposition of varied and distinctive landforms, diverse land cover types and often sudden and dramatic views of the sea. The close relationship of the area to the sea is a vital ingredient of the island's cultural heritage from prehistoric times.” (Natural England, n.d.2)

Obviously, administrators in the UK have given a great deal of thought to the landscape, as there are more than 150 of these very detailed descriptions. As well, 36 Areas of Outstanding Natural Beauty, covering 15% of England’s land, have been designated for the purpose of conserving and enhancing their natural beauty (which includes landform and geology, plants and animals, landscape features, and the rich history of human settlement over the centuries) (Natural

England, n.d.4).

In the face of continuing land use changes and development pressures, new laws were incorporated into the Countryside and Rights of Way Act to help protect Areas of Outstanding Natural Beauty. Much of the responsibility under the Act lies with local authorities, who must prepare management plans that detail how each area will be protected. As well, funding was put in place to support delivery of the program’s objectives.

Horne: The Island Landscape

A less official designation has been developed for the British coastline: 33% (1,057km) of scenic English coastline is now conserved as Heritage Coasts. These special coastlines are managed so that their natural beauty is conserved and, where appropriate, the accessibility for visitors is improved (Natural England, n.d.1).

Rounding out the British land use planning toolkit is the Quality of Life Capital program which is often integrated into the Landscape Character Assessment process. Quality of Life Assessment is a tool for maximizing environmental, economic and social benefits as part of land-use and community planning.

A review of the extensive range of landscape evaluation processes in place in the UK did not uncover a comparable set of regulatory systems. It appears that the landscape management successes in the UK are due to widespread public participation in landscape categorization, quality of life assessments and sustainable planning, rather than to a complex set of bylaws or regulations. The general trend seems to be away from the numerical rating of landscapes, and much more towards a community planning process. This process engages local residents in defining the special features present in their own region.

United States and Canada

Land policy researchers Bengston, Fletcher and Nelson (2003), who summarized the United States’ (US) public policies for managing urban growth and protecting open space, noted the lack of a comprehensive national land use policy. They proposed that land conservation measures may be categorized into three general groupings: acquisition of property, regulatory approaches, and incentives. Included in the category of property acquisition were parks and other lands acquired for the public good. Many land protection measures involve regulation to impose zoning, greenbelts, and development moratoria. Governments can also offer incentives, such as tax deferrals and tax credits for undeveloped land.

Government-initiated land protection efforts include acquisition of land for National Parks and National Heritage Areas. Perhaps most relevant to PEI is the inclusion of a separate category for valued shoreline and barrier islands. Assateague, Cape Cod and Fire Island are among this collection of National Seashores in the American National Parks system. The geographic area and range of protection and services varies from place to place.

Several federal, state, and local tax policies provide incentives for taxpayers to donate land, to employ easements for conservation purposes, or to commit to keeping their forest land intact. These policies include income tax deductions and credits at the federal and state levels, estate tax exemptions and use-valuation at the federal level, and use-value property tax at the local level. As well, some jurisdictions allocate a portion of real estate transaction taxes to land preservation.

While there does not seem to be any kind of European-style, pan-continental landscape convention in the United States, there are several national groups concerned with landscape and scenery. The groups are volunteer/non-profit, or government-initiated, or a combination. An example of a joint citizen-government effort is the National Scenic Byways program, which is “a voluntary, grassroots program that recognizes and supports outstanding roads” (National Scenic Byways Program, 2007). The Department of Transportation’s Federal Highways Administration

Horne: The Island Landscape

provides resources to help manage the designated Byways, for which local groups make application.

Land trusts are the traditional volunteer approach to landscape preservation in the US, but other non-profits have also been effective. Scenic America, for example:

...is the only national non-profit organization dedicated solely to preserving and enhancing the visual character of America's communities and countryside. We accomplish this mission through national advocacy efforts and technical assistance services, local and national projects, and the support of our state affiliates. (Scenic America, 2008)

The American Farmland Trust is another non-profit volunteer organization “founded in 1980 by a group of farmers and conservationists concerned about the rapid loss of the nation's farmland to development” (American Farmland Trust, 2007).

Another agent of landscape preservation, The Cultural Landscape Foundation, is :

...the only not-for-profit foundation in America dedicated to increasing the public’s awareness of the importance and irreplaceable legacy of cultural landscapes. Through education, technical assistance, and outreach, the Cultural Landscape Foundation broadens the support and understanding for cultural landscapes nationwide. (Cultural Landscape Foundation, 2007)

These are just a few of the social economy sector programs operating to preserve open spaces and promote sustainable development in the US.

When compared with the US situation, the list of non-profit organizations and government programs devoted to preservation of Canadian landscapes seems much shorter. The primary means of scenery (and environmental) protection has been the National Parks system. The Canadian Heritage Rivers program augments this by offering a degree of non-regulatory protection to about 30 river systems. Land trusts are operating in some regions, and provincial efforts such as Ontario’s greenbelt program also acknowledge the value of open space. The land trust movement does not appear to be as organized or as sophisticated as in the US.

Islands

Islands of three jurisdictional levels were reviewed: the island nation of New Zealand; the Australian island province of Tasmania; and Quadra Island within the province of British Columbia, Canada. In each case the islands take stock of their scenic resources, but management efforts vary.

In New Zealand, an island nation that relies heavily on the tourism industry, the 1991 Resource Management Act is the primary means of landscape protection. A user-friendly guide to the NZ RMA begins thus (and resonates with the PEI situation):

Horne: The Island Landscape

New Zealand is just beautiful, isn't it? And this beauty is part of our everyday lives. On Saturday mornings we might be lying in bed in a suburb and by lunchtime be swimming at the beach, tramping through the bush or snowboarding down a mountain on fresh snow. We have more gardens and parks than you can poke a stick at. People come from all over the world to marvel at what some of us tend to take for granted. But if we look a bit closer at “clean and green New Zealand” we see things aren't that way at all. For years we've been damaging our soil, air and water in ways that couldn't be restored. This approach just wasn't sustainable and meant our grandchildren could have missed out on the quality of life that we have enjoyed. The Resource Management Act (usually called the 'RMA') is the main piece of legislation that sets out how we should manage our environment. It's based on the idea of the sustainable management of our resources, and it encourages us (as communities and as individuals) to plan for the future of our environment. (Government of New Zealand, n.d.)

A coastal policy established in 1994 under the Resource Management Act is currently (2008) under review, because New Zealand has recognized the need for more careful consideration of management plans for coastal areas. A proposed New Zealand Coastal Policy Statement details objectives and policies to achieve the purpose of the Resource Management Act. The proposal addresses subdivision, land use, and development; natural character including biodiversity and landscapes; public access; water quality; coastal hazards; historic heritage; and the definition of restricted coastal activities (Government of New Zealand, n.d.). While the country has policies in place, it is often up to local levels of government to administer regulations. These regulations may be difficult to interpret; defining ‘scenic’ or ‘appropriate development’ is not at all simple. Also, authorities have found that applying land use restrictions in communities made up of independent-minded islanders is a challenge.

Tasmania, an island province of Australia, is also reviewing its coastal policy. As in New Zealand, this policy is a decade old and by Tasmanian law, all state policies require review on a regular basis to ensure that they remain relevant and up to date. The policy aims to conserve the coastal area and promote its sustainable use or development. The objectives include the conservation or enhancement of public amenity and scenic values.

The expected outcome in Tasmania is that the visual values of the landscape will be protected and enhanced. In the response document filed during the Tasmanian community consultation process, there appeared to be a certain amount of frustration with the high-level nature of the policy statements. These are not meant to be self-executing, and the implementation role rests in the hands of local planning boards. The respondents to the consultation identified, among other things, a lack of “visual amenity tools” (Government of Tasmania, 1996:14). In both Tasmania and New Zealand, responsibility for managing the landscape resource seems to be in the hands of local administrations, and in both cases, people appear to be searching for workable scenic amenity tools which would help them develop and enforce regulations to preserve scenic values.

The island of Quadra is located just off the coast of (and belongs to the jurisdiction of) the province of British Columbia, Canada. It has no powers to enact legislation or control landscape development, but depends on provincial regulations and regional planning boards for any scenic protection activity. The planning process often takes into account scenic quality, and where no

Horne: The Island Landscape

official planning body is in charge, District Managers in the Ministry of Forests and Range can define visual quality objectives (VQOs) according to a Visual Landscape Design Training Manual. VQOs are based on both planning documents and community input, and may have arisen as a result of public pressure to resist impacts of extensive logging in many areas of British Columbia. A provincial inventory, which maps forestry and tourism capability in BC, has been developed to:

...provide the necessary information to identify the location of sensitive landscapes and known scenic areas, as well as provide management direction. Where landscape unit planning is initiated or under way, scenic areas may be identified and made known and VQOs may be established through this planning process. In the absence of higher-level plans, or for higher-level plans not specifically addressing the management of scenic areas and visual quality, district managers can use their own statutory authority to identify and make scenic areas known, and establish VQOs. (BC Ministry of Forests, 2001)

The BC policy regarding scenic resource defines visual quality objectives as:

...resource management objectives established by the district manager or

contained in a higher level plan that reflects the desired level of visual quality based on the physical characteristics and social values for the area. There are five categories; preservation, retention, partial retention, modification, and maximum modification. (WCELA, n.d.)

In other words, gradations of protection (or lack thereof) are applied according to planning objectives, which may be imposed by civil servants such as district forestry managers.

Applying Global Lessons to Islands

Islands, having a relatively high ratio of coastline to land mass, contain some of the world’s most treasured scenery — that of coastal views. But, although feelings for land run strong, development pressures are also powerful. Depending on their circumstances, islands may experience intense development pressures. This varies according to various factors, for example: native population density; proximity to heavily populated mainlands; degree of linkage to the mainland; perceived rarity or exoticism; and geographic location, i.e. whether they are tropical, temperate or polar.

From the review of landscape management in various jurisdictions it can be seen that appreciation for the scenic value of the landscape is neither universal nor managed in any uniform manner. Some islands and island nations, such as Britain, boast comprehensive landscape management. The amenity of scenery is uppermost in the intent of environmental management plans and legislation. At the other extreme are islands where the value of landscape is rarely or not at all mentioned; or, scenery is valued but only as a secondary feature of ecological resources. Without doubt, all jurisdictions face challenges in managing landscape resources.

One consistent finding in the review is that an engaged local community is key to effecting shifts in land use policy. That a highly engaged local community can spell success in this sector is demonstrated in the UK’s Campaign to Protect Rural England. With a membership of more than

Horne: The Island Landscape

60,000, the Campaign has appreciable impact, such that public opinion is likely to come down in support of protection measures rather than against. Even in the US, the extensive range of volunteer efforts (Scenic America, American Farmland Trust, Cultural Landscape Foundation and so on) serves to bolster or even influence the regulatory process.

Where international landscape protection programs exist, such as the European Landscape Convention or the UNESCO and IUCN landscape categories, these serve mainly to provide guidance or validity to more local efforts. But, within those local jurisdictions, landscape management efforts at times appear to be disjointed. Even with clear-cut legislation and a full set of regulations, the administration of programs is difficult. Volunteers, environmental lobby groups, real estate developers and planning professionals all have opinions about the value of specific scenes, or conversely the value of specific development projects. However, certain trends may be perceived and valuable lessons learned from leaders in the field. In some jurisdictions landscape preservation is achieved by zoning; in other instances, to preserve landscape values, property owners are compensated financially for development rights; or, properties are purchased outright by the state.

As for assessment methods, no one system prevails for determining the ‘value of the view;’ nor was it clear that any one system of landscape evaluation was most successful. The coastal policy reviews underway in New Zealand and Tasmania reveal a common problem: there is a lack of clear direction when deciding on landscape protection actions. Terms such as ‘appropriate’ are too vague when considering specific development proposals. Even where reasonably hard calculations of scenic quotient were performed, the route to achieving the ‘perfect landscape’ was not obvious.

The IUCN Category V Protected Landscape/seascape designation is one option that could prove very suitable to many small island situations where the land base is limited and the scenery is not untouched or pristine. As in Europe, where the bulk of the designated Category V protected landscapes now exist, small islands may exhibit relatively high population densities; a highly developed tourism industry; domesticated, working landscapes; and the absence of large expanses of wilderness. These conditions invite application of the Category V designation, though IUCN planners caution that the category is not meant to be a catch-all. Areas that have exceptional scenic qualities; where there is a strong connection between culture and nature; that demonstrate sustainable use of resources; and have maintained their integrity and traditional industries, are good candidates for Protected Landscape designation (Phillips, 2002). Currently, there are several islands that contain Category V designated landscapes, including the Vinales Valley in Cuba; the vineyards of Pico, Portugal; Öland Island in Sweden; the Cordilleras in the Philippines; and the first designated Cultural Landscape, Tongariro National Park on the North Island of New Zealand. It is important to note that in most of the above cases, the protected landscapes comprise a working agricultural scene. As well, it is significant that in 1994 the IUCN definition explicitly included “All areas of land and/or sea ...” (ibid: 2.2.3), opening the door to greater attention for island systems, marine environments and other coastal zones.

The suggested management guidelines for the special areas designated as Category V (quoted below) deserve consideration by jurisdictions such as Prince Edward Island where it is the working landscape, rather than pristine scenes, that forms the basis of the scenic resource. The IUCN guidelines for management of Category V landscapes (Phillips, 2002) read as follows:

Horne: The Island Landscape

* to maintain the harmonious interaction of nature and culture through the protection of landscape and/or seascape and the continuation of traditional land uses, building practices and social and cultural manifestations;

* to support lifestyles and economic activities which are in harmony with nature and the preservation of the social and cultural fabric of the communities concerned;

* to maintain the diversity of landscape and habitat, and of associated species and ecosystems;

* to eliminate where necessary, and thereafter prevent, land uses and activities which are inappropriate in scale and/or character;

* to provide opportunities for public enjoyment through recreation and tourism appropriate in type and scale to the essential qualities of the areas;

* to encourage scientific and educational activities which will contribute to the long term

well-being of resident populations and to the development of public support for the environmental protection of such areas; and

* to bring benefits to, and contribute to the welfare of, the local community though the provision of natural products (such as forest and fisheries products) and services (such as clean water or income derived from sustainable forms of tourism).

A great deal of expertise lies within the IUCN and it is an excellent source for guidance in this field, but ultimately the responsibility for the land lies within local jurisdictions. In the case of PEI, some local level of designation and concrete management provisions would need to be in place before appealing to the IUCN or any international body for recognition.

Without further research it is impossible to say whether island administrations are any more or less concerned with landscape than those of mainland jurisdictions. The notion of landscape character may be more clearly articulated within island communities, where a sense of place often thrives. It is notable that coastal policies have often been developed on islands to protect marine ecology, beach access and, of course, the view. It is not only obvious, but has been confirmed by research, that water views are generally considered superior to land-based scenes (Government of Scotland, 2006). This places a great responsibility in the hands of island administrations to properly take into account the special value placed on seascapes.

The politics of defining the most valuable viewscapes on islands can be fraught with difficulties, especially where a variety of jurisdictional levels confront the issues. While it may be admirable that island administrations hand off the landscape planning function to local communities, they must not do so without providing adequate direction or clear policy. In small island societies where the government fails to provide legitimate, clearly articulated leadership, land use debates can turn into a popularity contest wherein highly engaged individuals on both sides of the issue gain public prominence and tussle it out. This usurps the political space and often leads to a stalemate and consequent inaction. Meanwhile, the march of development continues.

Horne: The Island Landscape

Moving Towards Landscape Management on PEI

Filling Gaps in the Management Framework

In comparison to all the other jurisdictions under review, Prince Edward Island seems to have one of the least effective means, and has taken the least action, to protect its valuable landscape resource. The review of the Island’s legislation, planning regulations and the application of the regulations revealed that scenic viewscapes as such do not warrant any special attention. With the exception of two small special planning areas, development permits are awarded or refused on the basis of issues other than how they would affect the aesthetic resources of PEI.

As will be seen in the research results described below, undeveloped coastal scenes of PEI are valued highly by both residents and tourists. However, current legislation does not reflect that preference. From studying related legislation and land-use reports produced over the past 20 years, as well as through interviews with bureaucrats active in the planning sector, it is apparent that aesthetic value has not been a conscious consideration in most decision-making for developments approved to date. Some progress has been made in maintaining safety on the highways by regulating the placement of private lanes; recent policies encourage protection of water tables and riparian zones; but in general there is no ‘standard of beauty’ that developers must meet in Prince Edward Island. Yet, this is the province that makes a good deal of its living from its scenic attributes. At the very least, the adoption of a specific coastal policy to treat the sensitive issues around galloping shoreline development would seem to be a wise course of action for PEI administrators.

The IUCN Class V Protected Landscape and Seascape category, described earlier, is certainly intriguing in terms of Prince Edward Island. The UNESCO Cultural Landscapes category is also of interest and in this case, does not require consideration of biodiversity. In either instance, landscape preservation must first be addressed at a local level. Policies and land use plans must be in place to allow for the proper management of any internationally designated cultural landscapes or seascapes.

The Public’s Role in Protecting Prince Edward Island Landscapes

In the jurisdictions studied, it was apparent that a healthy grassroots movement was often the instigation for greater governmental involvement in land use issues. Tourists and short-term visitors may have an active interest in the scenery of a destination, and travel decisions and activities are very often based on the physical attractiveness of the place. However, visitors are not generally in a position to influence local policy making. As pointed out earlier, visitors may actually be the landscape offenders, whether indirectly by demanding services and attractions that have not been part of the traditional viewscape, or more directly by choosing to purchase and develop the landscape that attracted them to the destination in the first place.

Horne: The Island Landscape

Where other jurisdictions have successfully applied controls, purchased development rights or made outright land purchases, they have had the support of their residents, and local people have often been vocal about what they value in their surroundings. In addition, local benefits have often been clearly articulated. For example, where scenic roadways have been developed, the net result — economic benefit for the region — has been explicit. Whether the approach is top-down, bottom-up or some combination of efforts, some modicum of public interest must exist, or some potential public benefit must have been identified, for viable changes to land-use policy to ensue.

On first glance, land issues would seem to have preoccupied the PEI public over the past 30 years, beginning with a 1973 Royal Commission on Land Use and Land Ownership, followed in 1990 by a Royal Commission on the Land, and in 1996-97 by a government-appointed Round Table on the Land. The Round Table held 36 meetings and heard from 70 local groups and individuals. However, all this activity led to little real action. Public discourse concerning landscape on PEI has not resulted in any strong, coherent movement comparable to that in Britain. To a certain degree, this may be explained by historical factors. During the Island’s colonial period, a backlash against absentee landownership engendered a sense of independence that is often exhibited in the form of fiercely protected private property rights. As a result, local and provincial governments have been reluctant to try to impose limitations on landowners.

For example, most forested land on PEI is privately owned, and in the late 1990s an effort to regulate private woodlot management in PEI met with resistance and limited success. A series of recommendations by the Forest Partnership Council covered such aspects as “intent to harvest” forms, minimum age standards for harvest of softwood, and pre-harvest assessment by Forestry staff. These recommendations were rejected at meetings organized across the province by a small but very vocal group of woodlot owners (MacDonald, 2001).

Regulations controlling agricultural practices have also evolved very slowly despite clear evidence of fish kills, eutrophication and pollution of groundwater linked to industrial farming inputs. In general, individual farm owners do not appreciate the imposition of regulations or directives that would control the use of the farmland or the operation of the business. The 1993 Final Report of the Special Legislative Committee on the Lands Protection Act implies that some progress may have been made since the 1973 Royal Commission on the Land:

Over the years, meanwhile, a more favourable climate for land use restrictions has begun to emerge. There is a growing awareness that an individual’s use of his or her land can have social, economic, or environmental impacts that go far beyond the boundaries of the land. (Government of PEI, 1993)

However, the 1990 Royal Commission on the Land recommended province-wide land use planning, and zoning of farmlands. As of 2009, no such action has yet been taken outside of municipal boundaries. Some would suggest that this is the result of local resistance by strong-minded individuals.

The close-knit connections often found in island communities also mean that residents feel it perfectly within their rights to expect to circumvent certain regulations. What is more, they just may find the means to do so, making use of family ties or political favours. At the same time, in a

Horne: The Island Landscape

close-knit community, strong dissenting opinions that suggest a change in the status quo are often poorly received.

A case in point is that of the well-known land activist of the 1980s, Marc Gallant, who was both revered and reviled for his outspoken opinions on the visual future of the province. He is quoted in a 2004 CBC Compass story (from 1996 archive footage of an interview on a PEI shoreline):

If we don’t have the courage, if we don’t have the determination to protect this province, however difficult that might be, we’re going to end up losing it. Twenty years from now there’ll be none of this left; Islanders will have no access to it. There’ll be ‘No Trespassing’ signs and we’ll have no access to our own beaches. (RUK, 2006)

A YouTube video made recently on the same site confirms that Gallant’s fears were well justified. Montgomery Land Trust member Peter Rukavina has videotaped a series of shoreline “monster homes” recently built in the central north shore area that the Land Trust is attempting to preserve.

Marc Gallant’s actions did lead to the protection of specific landscapes as well as the preservation of other aspects of Island heritage such as the wooden bait sheds on PEI wharves. His legacy lives on in the Montgomery Land Trust, but no individual activists have since surfaced to take his place at the head of a wider landscape protection movement.

A PEI volunteer group, the L.M. Montgomery Land Trust was founded in 1994 to work to preserve the scenic agricultural coastal lands on Prince Edward Island's north shore. The group noted the downturn in the Island’s agricultural economy and the growing pressure to sell farmland for development. They identified for protection the area between French River and Sea View, which has special significance for the heritage of Anne of Green Gables author L.M. Montgomery. The Trust works with landowners to identify alternatives to selling land for development, using a variety of mechanisms to preserve the region’s precious viewscapes. The purchase of development rights for some properties has led to the successful preservation of a few parcels that were at risk, and the Montgomery Land Trust has just announced (March 2008) a new project known as the L.M. Montgomery Seashore. The project identifies as a priority for conservation five kilometres of shoreline and 622 acres of land made up of 11 privately owned parcels. The Trust hopes to work with landowners and various levels of government to preserve the area as a heritage landscape (L.M. Montgomery Land Trust, n.d.).

The Institute of Island Studies has also been active in community engagement on the topic of land use, organizing numerous forums and publishing research papers and reports (e.g. Land Policy for a Small Island; Focus on the Land, Preparing for the Royal Commission; Forests and Forestry in Prince Edward Island; Agriculture on PEI ― Papers from a Symposium).

In 2005, the Iris Group organized a public forum on the land called “The Narrowing Landscape”, which attracted excellent speakers and large group of concerned citizens, and served to point out that none of the issues have changed in the past three or four decades.

Horne: The Island Landscape

Beyond public agitation and numerous commissions and reports, other agencies have taken their own routes to protect some portions of the province’s land resource. The Island Nature Trust, in particular, has had considerable success in purchasing environmentally significant property across the Island, but their impact on the visual resource of scenery is indirect. The PEI National Park and provincial parks also play a role in land protection but once again, the motivation is not generally the scenic resource.

Assuming, based on the above evidence, that a certain level of public interest in and commitment to preservation of the landscape does exist on PEI, the government could reasonably instigate a process for public involvement in landscape management. Although experts such as staff geographers or land use consulting companies may take on the role of assessing regional landscape resources, it is rare that they could do so in an institutional vacuum. Mechanisms for coherent community participation are essential for effective land use planning and management. A first step could be to gauge the level of interest and level of commitment in the resident community. Once it has been established that public interest is engaged, a variety of potential tools and methods for inclusion of the populace in policy development are available.

Tools for Evaluating Scenic Viewscapes

Several key factors must combine before a community can successfully take stock of and protect its scenic resource. Most importantly, residents and policy makers must acknowledge the value of the landscape and commit to its protection. This goes beyond day-to-day pride of place; if a location is deemed to be beautiful, then proposed changes must be carefully evaluated and consideration given to the net effect on overall scenic worth.

If it is agreed that efforts must be made to preserve viewscapes, then research is needed to determine which views have the greatest value. Research is also required to determine the point at which alterations to the landscape detract from its original appeal. When island jurisdictions attempt to measure and place value on aesthetic surroundings, how should they best deal with the related challenges of protecting those landscapes, and especially coastlines? As noted, IUCN offers a category of landscape protection that could provide a solution for islands that exhibit a high scenic value but whose landscapes are active and lived-in. Within such a framework, should the measure of scenic value be based on a majority opinion or consensus, or is there some objective standard that can be applied? Do valuable viewscapes comprise more than an aesthetic quotient? Should cultural, personal and historic reasons also trigger protection?

If the local population grasps the concept of scenic value, and legislative or other management processes are available to manage landscape and scenic viewscapes, some simple evaluation tools do exist that can be applied to support the landscape protection effort. Whereas in the past it may have been difficult to determine landscape preferences or assess the effects of proposed developments, current technologies permit swift and efficient feedback using extremely effective visualization techniques.

The local landscape preferences in Prince Edward Island — a Canadian island province generally renowned for its scenery — were assessed in 2006-07 using an on-line viewscape preference survey with both on- and off-Island respondents. The survey was augmented by a qualitative

Horne: The Island Landscape

focus group session that discussed attitudes to the local landscape (Horne 2007). In spite of the many possible individual responses to landscapes and scenery, it was hypothesized that general preferences would emerge for some landscapes over others.

The on-line survey proved the efficacy of the Internet as a tool for this type of study. A set of 31 PEI landscape snapshots, such as might be seen from a car window, were posted online. Both residents and non-residents were invited to score the scenes in terms of scenic beauty. The scoring offered nine gradations from “not at all” (scenic) to “very.” Subscribers to the Tourism PEI e-mail newsletter (most of whom live off-Island), as well as subscribers to the Tourism Industry Association of PEI’s E-news received the invitation to participate in the survey. In total, almost 1,200 responses to the survey were collected; by far the largest percentage of the respondents lived off-Island (1148 vs. 46). The only out-of-pocket expense was the survey software subscription. Completion rates for surveys of local and off-Island residents were 82% and 91%, respectively; that is, of those who opened and began the survey, 82% and 91% answered all questions. While a survey can remain posted indefinitely, in this case most responses to the E-mailed invitation to participate in the survey were received within the first week; thus, results can be obtained very quickly. The trial proved that internet surveys and digital photo alterations can allow communities to participate directly in a landscape evaluation and planning process.

The results revealed that the notion that “beauty is in the eye of the beholder” is not particularly relevant. The respondents demonstrated clear preferences for some types of landscape scenes and agreed on the relative scenic value of various viewscapes. For both residents and non-residents, the standard errors around the mean landscape value scores for individual scenes were very small (ranging between 0.017 and 0.165), suggesting a high level of agreement among participants. One category of landscape, that of undeveloped coastal scenes, clearly received the highest approval rating. In a graph which displayed the descending mean scores for each of the 31 scenes used in the survey, no undeveloped coastal scenes rated in the bottom third of the mean ratings. Unobstructed views to the sea had the greatest appeal. While ratings for some individual scenes did differ somewhat between the two groups of respondents, the slopes of the lines, when graphed, were the same.

The survey results supported the findings of other research in the area of landscape preferences. Firstly, clear indications of preference can be determined; overall, people do agree on which landscapes rate more highly than others in terms of scenic beauty. Secondly, place of residence is not necessarily a factor in the overall ratings; the landscape scores of Prince Edward Island residents followed the same slope as the ratings made by non-residents, although the mean scores did differ for some scenes. Thirdly, when scenes were categorized into four general types of landscape, one category — undeveloped coastal scenes — had more appeal than the others.

The Internet-based survey methodology tested in Prince Edward Island effectively and efficiently collected data on preferences for one scene over another. In small islands such as PEI, Internet landscape surveys could become a standard tool of planners because of the easy adaptability of the survey tool; the quick and inexpensive administration of a web-based survey; and efficient collection of significant amounts of data. In larger jurisdictions, it might be necessary to develop a method whereby the survey results could be applied to broader landscape groupings, or adapted to geographic information mapping systems used in land-use planning.

Horne: The Island Landscape

An on-line survey allows for anonymity, which is an important factor in small, close-knit communities. However, it is also true that Internet access, especially high-speed, may be less readily available in rural areas. The use of the Internet does, however, allow for efficient inclusion of opinions of non-residents, if that is the wish of the planners.

Beyond the Aesthetic Value of Landscape

While the survey data indicated that one category of landscape had more appeal than others, focus group discussions in the community of Tyne Valley revealed that landscape elicits strong feelings that go beyond strictly aesthetic ratings.

The intent of the focus group was to obtain more in depth, qualitative information from local residents about their attitudes to the landscape around them. The discussion guide centred on their feelings about special places in the neighbourhood as well their responses to changes in the landscape. The focus group participants scored paired sets of photos, one of the pair having been altered to include development. This allowed the researcher to draw some conclusions about attitudes to construction, land subdivision, and other land development activities.

In the focus group setting, when asked to rate the scenes included in the online survey, there was general agreement on relative beauty of the 31 scenes. The group also revealed clear preferences for undeveloped versus developed scenes. Moreover, the 11 members of the group felt strongly about the landscape in their community and around the Island. Their discussion revealed clear emotional associations with landscape that must in some way be taken into account in land use planning. The vocabulary and tone of voice recorded in Tyne Valley communicated a strong connection with the land: “gorgeous” and “beautiful” were descriptors employed. One respondent went so far as to say that both her arrivals to and departures from the Island caused her to “howl and bawl.” Moments that inspired the most emotion often seemed to be related to a return home, arriving by either bridge or ferry. When asked where they would take visitors, almost all the responses involved shore locations or water scenes. Nostalgia coloured the comments and responses, which led to some consideration of the necessity of change. Examples were provided of places that could be at risk, such as the Experimental Farm in Charlottetown which has been subject to frequent threats of development, or the lupines in Indian River which could be ploughed under. There was much appreciation for an old farm site in the neighbourhood, and some wondered when the barns would finally fall down. The group agreed though, that this would be a natural process as opposed to alterations imposed on the landscape. There was great pride among the members of the group regarding the care taken by Island residents when it comes to their properties and their surroundings. “People take pride in their homes; no matter what economic strata you are from people try to take care of their homes... they maintain as best as they can, the lawns are so nice, well kept, the cemeteries, churches too....” (Horne, 2007, from transcript of Tyne Valley focus group).

Possible Next Steps

Adequate evidence exists as to the high degree of attractiveness of the province’s scenic resource. It is apparent that continuing development is changing that scenic resource irretrievably; some

Horne: The Island Landscape

might say the situation is urgent. One possible approach is to identify the necessary steps involved in placing parts or all of PEI on a list for potential designation by the UN or the IUCN. Adrian Phillips has suggested that addressing lived-in landscapes as IUCN Category V protected areas may be “...an approach whose time has come” (Phillips, 2002: 13). He sees a possible policy shift from the current focus on saving natural areas, to pay greater attention to working landscapes.

Another possible approach to acknowledging the net value of the landscape and the farms that define the PEI scene is to consider appraising the farm community as “natural capital,” whereby farmers would be paid for conserving valued landscapes as well as ecological goods and services. Pilot projects are underway in some parts of Canada to assess the practicalities of this system; Canada’s island province would seem an ideal laboratory for such a project. This approach has been used successfully by the European Union.

While less than ideal as a solution, some consideration should be given to the very simple technique of mitigation. Vegetative cover has been proven to lessen the detrimental effects of inappropriate development. Islanders are already keen gardeners and the soil and climate lend themselves to propagation of all sorts of bushes and trees. This approach could be used after the fact, and could also be included as a requirement in new development proposals.

In no way are any of the above ideas to be interpreted as “silver bullet” solutions to what is acknowledged as a complex issue — an issue that involves far more than the forces of pro- and anti-development. This document seeks to present a preliminary but textured view of landscape resource issues on Prince Edward Island, pointing out possible problems, suggesting tools to measure the resource, and reminding readers of the notion of “place” which, it is proposed, is highly developed in Prince Edward Island. How closely this sense of place is truly related to the physical aspect of the land, and how well it can survive in the face of “creeping sameness”, are topics for future research. This research would address Recommendation #76 in the 1999 Standing Committee’s report, which called for an assessment of the impacts of changes to the landscape. Also of interest would be to test proposed Island development projects with three-dimensional visioning technology such as that which is available in the Collaborative for Advanced Landscape Planning laboratory at University of BC. In a more academic/Island Studies sphere, it could also be useful to measure the relative importance placed on landscape by islanders as compared with mainland residents.

Some very practical actions could be taken on a local front to address the issue of the changing Prince Edward Island landscape. These include the following:

a) Government and industry should consider the brand value of the Prince Edward Island scenic resource from a tourism marketing standpoint.

b) Government agencies could standardize the measurement tools tested in this research, to provide planners with reliable systems for community input during planning processes. Internet surveys and other graphic tools that can be used to determine landscape preferences would not only serve to assist planners, but could be used to draw local attention to scenery as a valuable resource.

Horne: The Island Landscape

c) To expand this research in a Prince Edward Island setting, a province-wide open-ended survey (also by Internet) could be used to invite individual residents, and visitors if desired, to nominate outstanding scenes.

d) At the same time, the province could undertake a “smart growth” pilot project. A regional planning exercise could extend the scope of the current land-use planning approach by including the quality of the landscape. This should be supplemented by research into landscape as a resource for community well-being; and an assessment of impacts of the loss of scenic Island landscapes.

e) PEI could investigate the possibilities for IUCN Category V Protected Landscape/seascape designation.

f) Research into the National Seashore model currently in place in some US states is recommended, especially in relation to the L. M. Montgomery Land Trust region on PEI’s central north shore.

g) It would be advisable for the province, through a public process, to develop a comprehensive coastal policy that would identify the most valuable viewscapes in addition to addressing related, pressing issues such as: beach access for shellfishers, recreational fishers and beachgoers; increases in cottage lot subdivisions; potential for salt water intrusion where coastal groundwater is heavily exploited; and climate change preparedness in light of escalating rates of coastal erosion.

h) A review of a wider range of jurisdictions might identify other potential solutions applicable to the Prince Edward Island situation.

Conclusions

Yearning for an ideal and humane habitat is perhaps universal. Such a habitat must be able to support a livelihood and yet cater to our moral and aesthetic nature. (Meinig, 1979: 101)

Without doubt, the places we live in affect us in very many ways, and in ways extending well beyond basic physical needs. Comments at the Tyne Valley focus group session revealed that the connection with landscape is deep and personal, and a sense of place thrives where this connection exists. Can this connection between physical surroundings and personal identity survive if land use policies ignore the more metaphysical aspects of community living — the elements of living that support our “moral and aesthetic nature”? The musing may seem rhetorical, but the question is key as one considers how to value the view, or find useful measurement tools to assess the scenic resource.

The review of global best practices and current approaches included in this paper has afforded us a better understanding of how the landscape resource is managed in Prince Edward Island and in other jurisdictions, including other islands. It is disappointing that, in terms of managing the scenic resource, Prince Edward Island appears to lag behind many other island jurisdictions that

Horne: The Island Landscape

depend on scenery for their economic future. Other jurisdictions provided useful examples and success stories that demonstrate the benefit of approaching the coastline as a valuable resource in and of itself.

Both a quantitative on-line survey and a qualitative focus group method proved practical and efficient to use in measuring landscape preferences. The survey methodology successfully collected a significant amount of data concerning landscape preferences of Prince Edward Island residents and visitors. It was shown that the Internet survey technique is one that could be simply and efficiently repeated. The data confirm, in a defensible way, that there exists a strong preference for undeveloped PEI coastline scenes. Use of these tools would certainly enhance the ability of policy makers to make decisions on the relative values of viewscapes. Landscape surveys could be used to further refine preference data either by landscape type (agricultural vs. urban, coastline vs. inland and so on), or by specific scene. They could also be used to decide on special designations for particular viewscapes, or as tools for assessing public opinion on impacts of individual development proposals.

The value of landscape goes well beyond the aesthetic. Economic, cultural and ecological worth must also be taken into consideration. When landscape changes, as it must, the changes should be evaluated in more holistic terms. Measurements must reflect not only environmental and economic values, but also the many values inherent in the view, including its effect on the culture and identity of the community.

There is no doubt that where we live defines us, much as we define our place. The importance of this relationship with place should not be underestimated. In the case of islands, and in the case of Prince Edward Island, the concept of place might well be considered the primary defining factor in community identity. If, through lack of adequate land use and development planning and control, home is allowed to become ‘placeless’, where does that leave an island people?

Horne: The Island Landscape

References

American Farmland Trust (2007). About Us. http://www.farmland.org/about/default.asp

Bengston, D., Fletcher, J., & Nelson, K. (2003). “Public policies for managing urban growth and protecting open spaces: policy instruments and lessons learned in the United States”, Landscape and Urban Planning, 69, 2-3.

Boylan, D. (1990). Everything before us: Volume 1. Royal Commission on the Land. Charlottetown. PE: Queen’s Printer.

Cultural Landscape Foundation (2007). http://www.tclf.org/organization.htm

Glen, B. (2005). Presentation to the Narrowing Landscape Symposium.

http://www.landtrust.ca/system/files/NarrowingProgram.pdf

Government of British Columbia: Ministry of Forests. (2001). Visual Impact Assessment Guide – Appendix 1. http://www.for.gov.bc.ca/tasb/legsregs/fpc/fpcguide/visual/via10011.htm

Government of New Zealand: Ministry for the Environment (n.d.). The Resource Management Act. http://www.mfe.govt.nz/publications/rma/everyday/overview-jun06/html/page2.html

Government of New Zealand: Department of Conservation (2006). Review of the New Zealand Coastal Policy Statement. http://www.doc.govt.nz/conservation/marine-and-coastal/coastal-management/nz-coastal-policy-statement/review-of-the-nz-coastal-policy-statement/

Government of Prince Edward Island (n.d.1). Municipalities.

http://www.gov.pe.ca/municipalities

Government of Prince Edward Island: Department of Environment, Energy and Forestry (n.d.2). Public Forests, a Valued Island Resource.

http://www.gov.pe.ca/envengfor/index.php3?number=71976

Government of Prince Edward Island (n.d.3). Prince Edward Island Statutes: Planning Act and Regulations.

http://www.gov.pe.ca/law/statutes/pdf/p08.pdf?PHPSESSID=a37482661d3e6fdf84d32273b8de713e

Government of Prince Edward Island (1993). Special Legislative Committee on the Lands Protection Act. Charlottetown: Queen’s Printer.

Government of Prince Edward Island (1999a). Report of the Standing Committee Agriculture, Forestry, and Environment. Charlottetown: Queen’s Printer.

Government of PEI (1999b). Tourism PEI Economic Impact Study.

http://www.gov.pe.ca/photos/original/EcoImpactTour99.pdf

Horne: The Island Landscape

Government of Prince Edward Island (2002). PEI Department of Community and Cultural Affairs Annual Report 2001-2002. http://www.gov.pe.ca/photos/original/cca_annual01-02.pdf

Government of Prince Edward Island: Department of Community and Cultural Affairs. (2005a). Municipal Legislative Reform. http://www.gov.pe.ca/photos/original/ccamunreview.pdf

Government of Prince Edward Island (2005b). Tourism PEI Conversion Report. Charlottetown: Queen’s Printer.

Government of Prince Edward Island: Department of Environment, Energy and Forestry (2008a). March 3, 2008 News Release. New Alternate Land Use Services (ALUS) Coordinator to Promote Environmentally Sound Land Use Practices. http://www.gov.pe.ca/news/getrelease.php3?number=5598

Government of Prince Edward Island (2008b). Factsheet: Agriculture.

http://www.peigov.ca/af/agweb/index.php3?number=71208&lang=E

Government of Scotland: Scottish Executive Research (2006). Valuing the water environment: A review of international literature. http://www.scotland.gov.uk/Publications/2006/11/17092457/6

Government of Tasmania: Department of Environment, Parks, Heritage and the Arts (1996). Review of the State Coastal Policy – Response Paper. http://www.environment.tas.gov.au/file.aspx?id=1891

Horne, C. (2007). Measuring Landscape Preferences: The identification and evaluation of Island viewscapes. Masters thesis, Island Studies. Charlottetown: University of Prince Edward Island.

IRAC (Island Regulatory and Appeals Commission) (n.d.1). FAQ.

http://www.irac.pe.ca/document.asp?file=faq/documents/lpafaq.asp

IRAC (n.d.2). LPA Alterations Databank.

http://www.irac.pe.ca/land/data/alterationsearch_rss.asp?pid=683367

IRAC (2000). Report on Trends in Non-resident Land Ownership 1994-2000.

http://www.irac.pe.ca/generalpublications/documents/NonResidentOwnershipTrends1994-2000.pdf

IRAC (2005). Report on Non-resident Land Ownership 2003-2004.

http://www.irac.pe.ca/content/searchable/generalpublications/documents/IRAC-Report-Non-Resident-Land-Ownership-2003-2004.pdf

IUCN (1994). Protected Areas. http://www.unep-wcmc.org/protected_areas/categories/index.html L.M. Montgomery Land Trust (n.d.). www.landtrust.ca

MacDonald, W. (2001). Forests and Forestry in Prince Edward Island; Backgrounder to a Public Forum. Institute of Island Studies, UPEI.

Horne: The Island Landscape

MacKinnon, W. (2005). Between Two Cultures – The Alex Campbell Years. Stratford, PE: Tea Hill Press.

Meinig, D.W. (Ed.) (1979). “Reading the Landscape”, in The Interpretation of Ordinary Landscapes – Geographical Essays, pp. 195-244. New York, NY: Oxford University Press.

Montgomery, L.M. (1917). The Alpine Path- The story of my career. Don Mills, ON: Fitzhenry & Whiteside.

Nassauer, J.I. (1997). Placing Nature: culture & landscape ecology. Covelo, CA: Island Press. National Scenic Byways Program. (2007). http://www.byways.org/learn/program.html

Natural England (n.d.1) Heritage Coasts.

http://www.naturalengland.org.uk/ourwork/conservation/designatedareas/heritagecoasts/default .aspx

Natural England (n.d.2). Landscape Character. http://www.naturalengland.org.uk/ ourwork/landscape/englands/character/assessment/

Natural England (n.d.3). European Landscape Convention.

http://www.naturalengland.org.uk/ourwork/landscape/protection/europeanconvention/default.a spx

Natural England (n.d.4). Areas of Outstanding Natural Beauty.

http://www.countryside.gov.uk/LAR/Landscape/DL/aonbs/index.asp and

http://www.countryside.gov.uk/LAR/Landscape/DL/aonbs/aonbs2.asp

Phillips, A. (2002). Management guidelines for IUCN category V protected landscapes/seascapes. World Commission on Protected Areas. http://app.iucn.org/dbtw-wpd/edocs/PAG-009.pdf

RUK (2006). Peter Rukavina blog. http://ruk.ca/article/3640

Scenic America (2008). http://www.scenic.org/about_us

Statistics Canada (2003). Canadian Farm Operations in the 21st Century.

http://www.statcan.ca/english/agcensus2001/first/farmop/01front.htm

Statistics Canada (2006). Snapshot of Canadian Agriculture.

http://www.statcan.ca/english/agcensus2006/articles/snapshot.htm

UNESCO (2004). Protected Areas International Categories.

http://portal.unesco.org/en/ev.php

URL_ID=21553&URL_DO=DO_TOPIC&URL_SECTION=201.html

UNESCO (2008). World Heritage. http://whc.unesco.org/en/criteria/

Horne: The Island Landscape

UPEI Tourism Research Centre (2008). Overall Results of the 2007 Prince Edward Island Exit Survey:Results for the Main Season, from June 27 to September 30. http://www.trc.upei.ca/files/Exit-Survey-Report1.pdf

WCELA (West Coast Environmental Law Association) (n.d.). British Columbia Guide to Watershed Law and Planning. http://www.bcwatersheds.org/issues/water/bcgwlp/index.shtml

Horne: The Island Landscape

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Principles and Processes for Publishing Textbooks and Alignment with

Standards: A Case in Singapore

Lianghuo Fan, Nanyang Technological University

March 2010

Abstract: Mainly drawing on the author’s experience in textbook development for Singapore schools and research in this area, this paper presents six principles and discusses relevant processes for developing mathematics textbooks. These principles include curriculum principle, discipline principle, pedagogy principle, technology principle, context principle, and presentation principle. For each principle, the author briefly explains what it means, why it is important, and how it can be implemented for the development of mathematics textbooks.

Key Words: Singapore mathematics; Mathematics curriculum; Textbook development

Over the last 15 or so years, Singapore students’ outstanding performance in mathematics in large-scale international comparative studies has generated considerable interest among educational researchers and policy makers in its approaches to school mathematics education. Its modern mathematics textbooks, as a most important resource in support of teaching and learning in mathematics classroom, have also received much attention. In fact, Singapore school mathematics textbooks, considered to some extent as exemplary ones, have been adopted, with or without modification, in quite a number of economies over the last decade (e.g., see Quek, 2002). Having said this, I must add that, as researchers have found, there is still much room in Singapore mathematics textbooks for further improvement (e.g., see Ng, 2002; Fan & Zhu, 2007).

In this paper, I shall propose six inter-related principles and discuss relevant processes for developing (publishing) mathematics textbooks. For brevity, these basic principles are termed curriculum principle, discipline principle, pedagogy principle, technology principle, context principle, and presentation principle, respectively, as shown in Figure 1 below. For each principle, I shall briefly explain what it means and/or signifies, why it is important, and how it can be implemented for the development of mathematics textbooks. The discussions are mainly based on relevant research work I, and my co-researchers, have done (e.g., see Fan & Zhu, 2000, 2002, 2007; Ng, 2002; Lee & Fan, 2004), and personal experiences I gained as consultant/editor-in-chief for both primary and secondary mathematics textbooks over the last ten years. In particular, when it is helpful I will draw on examples in the new series of secondary school textbooks, New Express Mathematics (see Fan, Cheng, Dong, Leong, Lim-Teo, Ng, et al., 2007, 2008), to illustrate my discussion.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Fig. 1 Six Principles for Developing Mathematics Textbooks

It should be noted that Singapore is an island city-state and has a highly developed economy with a GDP per capita of US$38,904 in 2008 (Source: Singapore Department of Statistics). It has a population of about 5 million and an area of about 700 square kilometers. Chinese, Malay, Tamil, and English are all official languages, but English is the most widely used working language and the medium of instruction in schools. Singapore adopts a highly centralized education system.

Principle 1: Curriculum Principle

The curriculum principle requires that textbooks must be developed for the implementation and

realization of intended curriculum.

In a broad sense, curriculum is a course of study, or all the experiences a student will have and achieve in school. School textbook developers (or authors/writers) must have certain “intended” curriculum, explicitly or inexplicitly, in their minds before they develop their textbooks.

In modern societies, because of a variety of reasons and needs, many (if not all) economies have developed so-called national curriculum (syllabus or standards) for school mathematics. In Singapore, like in many other Asian economies, national mathematics curriculum (syllabus) is developed and issued by the Ministry of Education (MOE), and all schools are required to follow the syllabus in teaching, learning, and assessment. Accordingly, textbooks must align themselves with the syllabus. Below is the well-known Singapore mathematics curriculum framework, also known as pentagon framework, stated in the national syllabus (MOE, 2006a, 2006b).

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Fig. 2 Singapore Mathematics Curriculum Framework

To align textbooks with the curriculum in terms of the coverage of contents, as roughly reflected in “Concepts” and “Skills” in the pentagon framework and detailed in the syllabus, is important and, relatively, easy. What is more challenging is for textbooks to reflect other aspects that the curriculum intends to achieve, for example, developing students’ high-order thinking skills, critical thinking skills and creativity, and positive attitudes towards mathematics, etc. In fact, Ng (2002) found that the whole series of primary textbooks developed by CDIS (see below) only introduced 11 out of the 14 problem solving heuristics listed in the syllabus. Similar inconsistency was also found in secondary mathematics textbooks for the lower grade level, i.e., Grade 7 and Grade 8 (Fan & Zhu, 2007).

As textbooks are essentially textbook developers’ own interpretation and reflection of the intended curriculum in the process of textbook development, they must study and hence establish good knowledge of the curriculum, and more importantly, work together and get information and feedback from curriculum developers.

With regard to the alignment of textbooks with national curriculum, as pointed out by Kho, who is a most senior curriculum specialist of MOE, during an interview conducted as a preparation for this paper, an exceptional case in Singapore was that in the 1980s and early 1990s, all the primary mathematics textbooks and a set of secondary mathematics textbooks were developed by two specially appointed teams who also developed the syllabus in the Curriculum Development

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Institute of Singapore (CDIS) under MOE (Kho, personal communication, Jan. 31, 2010). In other words, the curriculum developers are also textbook developers1.

Since the mid 1990s, due to a number of reasons the development and publication of textbooks have been decentralized in Singapore. Nevertheless, the curriculum developers have always maintained very close working connection and interaction with the textbook developers though a variety of activities and channels, including curriculum briefings (e.g., see Mathematics Unit, 2004, 2005), seminars, and meetings. More importantly, all school textbooks in Singapore must be reviewed and approved, primarily based on the curriculum, by a evaluation committee appointed by the Ministry of Education before they can be published and used in schools, and understandably, the curriculum developers have always played a leading role in the evaluation committee. In addition, the textbook developers must revise their textbooks according to the feedback given in the review report. I think these practices have worked very successfully in Singapore and are worth recommendation.

Principle 2: Discipline Principle

There is no doubt that mathematics is a very mature and well-established scientific or academic discipline. The discipline principle requires that school mathematics textbooks must provide solid foundation for the students to understand, apply, and study mathematics in their daily life, further learning and workplace. In terms of content, textbooks must correctly present mathematics knowledge (including mathematical concepts, facts, and methods, etc.). Furthermore, also more challengingly, textbooks should properly represent and reflect the nature, the structure, and epistemology of mathematics as a discipline.

The importance of the discipline principle in developing textbooks is easy to see, but to implement it is not as easy as people might think. Many studies on textbooks have indicated a surprisingly large number of cases in which the textbooks presented the content improperly or incorrectly (e.g., see Levin, 1998).

My own experiences in textbook study and development also suggest that many problems found in textbooks are technical and can be corrected easily, but there are still many which are non¬technical or conceptual and they pointed to the problems or weakness in the knowledge base of the textbook developers.

Just to give one case, I shall use an example in the topic of synthetic division in algebra. Many advanced school and college algebra textbooks explicitly stated that this method is only applicable to a divisor in the form of x-a, and it cannot be extended to a divisor being a polynomial with degree higher than 1 (e.g., see Larson & Hostetler, 1997), which is incorrect.

In connection with this topic, to introduce the long division as shown below, some textbooks place the quotient at the top of 0x3+2x2 7x10 instead of 5x5 +13x4 +0x3 +2x2. Although either way will produce the correct answer at this level, the former will hinder students’ further learning about how the method can be generalized for other kinds of divisors with degree more

1 The similar practice also existed in China for a long time, where People’s Education Press was the organization both to stipulate the national syllabus and to develop/publish the textbooks based on the syllabus. The situation was changed in the late 1990s, when a new round of curriculum reform was launched.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

than 1, and hence latter expression should be used (for more details about the synthetic division and its generalization, see Fan, 2003).

From the fact that numerous studies have consistently revealed that many mathematics teachers don’t have sufficient knowledge for effective teaching of mathematics (e.g., Carpenter, Fennema, Peterson, & Carey, 1988; Fan, 1998; Ma, 1999), the situation here is not surprising, although it should be noted that virtually no study has been done about what knowledge textbook developers need and have. This is worth attention from researchers as well as policy makers.

With regard to the discipline principle, it is clear that textbook developers must have a sound knowledge base in mathematics as a discipline. It is also very helpful, whenever possible, to have mathematicians in the textbook development team, particularly for secondary and higher level textbooks. In fact, in a latest series of secondary mathematics textbooks (of which I served as chief editor), 10 of the 16 my fellow developers (authors) are trained mathematicians, holding PhD degrees in mathematics from reputable universities. It makes us have more confidence in claiming that one of the key features of the textbooks is, “content is mathematically sound” (Fan, et al., 2007, 2008).

Another relevant point is that textbook developers must carefully collect feedback from the teachers and students after they have used the textbooks. Many times, the problems and mistakes in textbooks cannot be totally detected until they are really used in schools. It implies that textbook development should be ideally an ongoing process.

Having textbooks reviewed, especially by mathematicians and school teachers, is also important in terms of this principle. In Singapore, as Kho pointed out, being mathematically correct is one of the basic criteria for the reviewers to make recommendation for the approval of the textbooks (Kho, personal communication, Jan. 31, 2010), and the textbook developers must correct the mistakes and address the concerns, if any, raised by the reviewers in this aspect.

Principle 3: Pedagogy Principle

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

The pedagogy principle requires that textbooks must be developed to facilitate the teaching, learning, and assessment in mathematics.

As Fan and Kaeley (2000) indicated, textbooks as a learning tool or resource can convey different pedagogical messages to teachers (and students) and provide them with an encouraging or discouraging curricular environment, promoting different teaching (and learning) strategies. In fact, available studies have consistently revealed, textbooks can, to different extent, affect not only what to teach, but also how to teach, which will ultimately affect students’ learning in mathematics (Zhu & Fan, 2002; Fan, Chen, Zhu, Qiu, & Hu, 2004).

In usual, the pedagogical orientation can be provided in the textbooks implicitly, but sometimes it is helpful to make some pedagogical messages explicit. For example, in New Express Mathematics, the authors labeled some sections with headings such as “Let us try”, “Looking back”, “In-class activity”, and “Project task” to make the message about the learning and learning process more explicit. For assessment, the textbooks classified mathematics questions into Group A, B, and C. Journal writing tasks and other kinds of so-called alternative assessment tasks are also provided in the textbooks (Fan, et al., 2007, 2008).

Regarding this principle, as found in the case of Singapore, textbook developers are often given more room to be flexible in pedagogical matters. It is important that textbook developers keep abreast with the new development of the practice, theories, and research in pedagogy and learning. It is also very helpful to have mathematics educators and mathematics teachers in the textbook development team, particularly for developing the textbooks for students at lower grade levels. While mathematics educators have strengths in pedagogical theory and research, school teachers often know better the practices and needs of teachers and students in schools. In New Express Mathematics mentioned above, all the other authors are mathematics educators, and most of them have school teaching experiences (Fan, et al., 2007, 2008).

It is worth mentioning that in the process of developing the primary mathematics textbooks by CDIS in the 1980s, as said earlier, all the content and activities designed by the project team, which was led by Kho, were piloted in classrooms in a number of schools which volunteered to participate in the trial, and then revised according to the feedbacks from the try-out before they were finalized and published. According to Kho (personal communication, Jan. 31, 2010), this process was unique and very effective for the developers to make sure that the textbooks being developed would be suited to the needs of teaching and learning in classrooms.

Having textbook reviewed by mathematics educators, or pedagogical experts, and school teachers before the textbooks are published and listening to teachers’ feedback after publication is also important for improvement with respect to the pedagogy principle.

Principle 4: Technology Principle

The meaning of technology in mathematics education has expanded over the time, from calculator, to calculator and computer, and now more commonly to information and communication technology (ICT).

About 15 years ago, I criticized, with good intention, that mathematics education including textbooks in China was largely isolated from modern technology and there was virtually no

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

existence of technology in the mathematics textbooks (Fan, 1995). I must say that this criticism is no longer valid, as China has made dramatic progress in this aspect in the new wave of curriculum reform, most visibly in the new textbooks developed. In Singapore, much progress has also been made over the last decade or so.

Undoubtedly, the advent of modern technology has produced significant influences on our modern society. In the field of mathematics education, technology has affected what to teach and how to teach, and moreover, why to teach. In relation to this, technology must be reflected and, more importantly, embedded into the teaching and learning of mathematics. Textbooks, as a most important pedagogical resource, must integrate technology to support and facilitate the teaching and learning of mathematics. With the rapid development of technology, it appears apparent that technology will play an increasingly important role in the next generation of mathematics textbooks.

Let me briefly share some examples in the case of New Express Mathematics to illustrate how the technology principle is, to different extents, reflected in the textbook development. The first example is, in the older mathematics textbooks of which I was also a consultant/general editor, the approximate value of n, most commonly 22/7 and sometimes 3.142, was provided mainly for easy calculation. This is no longer the case in the new textbooks, because all students are expected to use calculators, in which keying in 22/7 is not only redundant and less accurate, but also less efficient than directly keying in the symbol “n” or “pi” (similar idea applies to other special values in mathematics, e.g., e).

Another example is that, as all students in Singapore are expected to have access to ICT including calculators, computer and internet, the textbooks developers have developed more authentic and challenging problems including investigative and project tasks. In working on these problems, students will focus more on conceptual understanding, information gathering, logical reasoning and data analysis, and so on, rather than tedious calculation, complex algebraic manipulation, or time-consuming drawing, etc. By doing so, technology can make mathematics teaching and learning not only more efficient, but also more effective. In fact, many questions in the textbooks that are targeted to develop students’ high-order thinking and problem solving abilities are ICT-embedded.

In addition, many topics covered in the textbooks, especially those in geometry (e.g., for measuring and construction) and statistics (e.g., for statistical diagrams, graphical representation and data analysis), were introduced with the use of available mathematics software to facilitate students’ learning (Fan, et al., 2007, 2008).

The technology principle requires that the textbook developers be familiar with the development of technology. In particular, having experts in the use of ICT in mathematics teaching and learning on board would be most helpful in this aspect. In addition, feedback from teachers and students is also helpful with regard to this principle.

Finally, I shall very briefly mention context principle and presentation principle. Although I think in some sense they are less important and more technical compared to these described earlier, they are still worth reasonable attention in textbook development.

Principle 5: Context Principle

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

School mathematics textbooks are not research publications for pure mathematics, which can be almost completely abstract. School mathematics is often contextualized, and cannot be free from the social and cultural background, under which school education takes place.

The context principle requires the textbook developers provide adequate cultural, social and even historical contexts when introducing mathematics concepts and contents. This principle is particularly important when application of mathematics is concerned. In the case of New Express Mathematics, many examples and problems use Singapore’s local context as background. For example, they use authentic information about Singapore’s geography (e.g., for distance, speed, and time), demography and economy (e.g., for statistics), architecture (e.g., for geometrical shapes), and society (e.g., social welfare and public housing system for financial mathematics). The contextualized information is provided to motivate and engage students in their learning of mathematics as they are familiar and can make connection with the contexts. In this sense, as it is found in the case of Singapore mathematics textbooks, modification or localization of the textbooks is necessary when they are developed in one economy but used in other economies.

The context principle requires the textbook developers have reasonable knowledge of local contexts. Having local mathematics education experts and school teachers in the development team is important in this aspect. Searching information from local newspapers and other sources can also be very helpful.

Principle 6: Presentation Principle

This principle requires that the presentation of the contents in textbooks must suit the level and needs of teaching and learning. This principle is meaningful in the textbook development as well-designed presentation can make the reading and use of textbooks easy and pleasant, and facilitate teaching and learning.

The principle is more about the technical aspects of developing and publishing a textbook, “design and physical features”. In Singapore, the mathematics unit of the Ministry of Education once recommended the following four aspects for textbook developers/publishers to consider: 1. Real-life pictures and realistic drawings, 2. Clear layout and illustrations, 3. Use of colors, and 4. Simple language. (Mathematics Unit, 2004).

Largely consistently, in developing the series of New Express Mathematics, the developers’ guidelines in this aspect were “1. Use clear and concise language to describe mathematics concepts and process, so it is easy for students to understand; and 2. Use diagrams, pictures, and other visual representations, whenever possible, to make the textbooks more interesting and visually appealing to students and hence enrich and enhance students learning experiences in mathematics” (New Express Mathematics Project Team, 2004).

To implement this principle, textbook developers and publishers should work together (and share the strengths and responsibilities), have experts or specialists in relevant areas, and most importantly, pay reasonable attention to the aspects as highlighted above.

The following table provides a summary of the principles and process/recommendations for developing mathematics textbooks, as presented and discussed in the article. Readers are reminded again that they are based on my own experience and mainly with a Singapore context.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Table 1

A Summary of Principles and Processes for Publishing Mathematics Textbooks

Basic Principles Processes/Recommendations

Curriculum Principle: Textbooks must be developed for the implementation and realization of intended curriculum. 1. Textbook developers have good knowledge of the curriculum.

2. Textbook developers closely work together and have interaction with curriculum developers.

3. Textbooks be reviewed by reviewers including curriculum developers.

Discipline Principle: Textbooks must

provide solid foundation for the students to understand, apply, and study mathematics. 1. Textbook developers have a sound knowledge base in mathematics.

2. Development team include mathematicians.

3. Textbook developers collect ongoing feedback from the users.

4. Textbooks be reviewed by reviewers including mathematicians and school teachers.

Pedagogy Principle: Textbooks must be

developed to facilitate the teaching,

learning, and assessment in mathematics. 1. Textbook developers have good knowledge in pedagogy.

2. Developer team include mathematics educators and school teachers.

3. Textbook developers collect ongoing feedback from the users.

4. Textbooks be reviewed by reviewers including pedagogical experts and school teachers.

Technology Principle: Textbooks must

integrate technology to support and

facilitate the teaching and learning of

mathematics. 1. Textbook developers be familiar with the development of technology.

2. Developer team include experts in the use of ICT in mathematics teaching and learning.

3. Textbook developers collect feedback from the users.

Context Principle: Textbooks must

provide adequate cultural, social and even

historical contexts when introducing

mathematics concepts and contents. 1. Textbook developers have good knowledge of local contexts.

2. Development team include local mathematics experts and school teachers.

Presentation Principle: Textbooks must suit the level and needs of teaching and learning. 1. Textbook developers and publishers work

together.

2. Development team include experts in relevant areas.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

References

Carpenter, T. P., Fennema, E., Peterson, P. L., & Carey, D. A. (1988). Teachers' pedagogical content knowledge of students' problem solving in elementary arithmetic. Journal for Research in Mathematics Education, 19, 385-401.

Fan, L. (1995). A review of the recent development of mathematics education in the U.S. and its comparison with China. Mathematics Teaching (Shanghai, China), 140, 1-3 & 141, 4-6.

Fan, L. (1998). The development of teachers' pedagogical knowledge: An investigation of mathematics teachers in three high-performing high schools. Ann Arbor, MI: UMI Dissertation Service, No. 9841511.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Fan, L., Cheng, W., K., Dong, F. M., Leong, Y. H., Lim-Teo, S. K., Ng, W. L., Quek, K. S., Tang, W. K., Tay, E. G., Teo, K. M., Toh, T. L., Yap, S. F., Yeap, B. H., Zhao, D. (2007). New Express Mathematics (Book 1 & 3). Singapore: Multimedia Communications.

Fan, L., Cheng, W., K., Dong, F. M., Leong, Y. H., Lim-Teo, S. K., Ng, W. L., Quek, K. S., Tang, W. K., Tay, E. G., Teo, K. M., Toh, T. L., Yap, S. F., Yeap, B. H., Zhao, D. (2008). New Express Mathematics (Book 2 & 4). Singapore: Multimedia Communications.

Fan, L., & Kaeley, G. (2000). The influence of textbooks on teaching strategies: An empirical study. Mid-Western Educational Researcher. 13(4), 2-9.

Fan, L., & Zhu, Y. (2000). Problem solving in Singaporean secondary mathematics textbooks. The Mathematics Educator. 5(1/2), 117-141.

Fan, L. (2003). A generalization of synthetic division and a general theorem of division of polynomials. Mathematical Medley. 30(1), 30-37.

Fan, L., Chen, J., Zhu, Y., Qiu, X., & Hu, Q. (2004). Textbook use within and beyond Chinese mathematics classrooms: A study of 12 secondary schools in Kunming and Fuzhou of China. In L. Fan, N. Y. Wong, J. Cai, & S. Li (Eds.). How Chinese learn mathematics: Perspectives from insiders (pp. 228-261). Singapore: World Scientific.

Fan, L., & Zhu, Y. (2007). Representation of problem-solving procedures: A comparative look at China, Singapore, and US mathematics textbooks. Educational Studies in Mathematics. 66(1), 61-75.

Larson, R. E., & Hostetler, P. R. (1997). Algebra and Trigonometry (4th ed.), Boston: Houghton Mifflin.

Lee, P. Y., & Fan, L. (2004). The development of Singapore mathematics curriculum: Understanding the changes in syllabus, textbook and approaches. In P. Y. Lee, D. Zhang, & N. Song (Eds.), Proceedings of the International Congress of Mathematicians (ICM 2002) Satellite Conference on the Reform of Mathematics Curriculum and its Education in the 21st Century (pp. 31-36). Chongqing, China: Chongqing Publishing House.

Levin, S. W. (1998). Fractions and division: Research conceptualizations, textbook presentations, and student performances (doctoral dissertation, University of Chicago, 1998). Dissertation Abstracts International 59, 1089A.

Ma, L. (1999). Knowing and teaching elementary mathematics: Teachers' understanding of fundamental mathematics in China and the United States. Mahwah, NJ: Lawrence Erlbaum Associates.

Mathematics Unit. (2004). 1st briefing to publishers on O-level mathematics. Science Branch,

Curriculum Planning and Development Division, Ministry of Education, Singapore. Mathematics Unit. (2005). 2nd briefing to publishers on O-level mathematics. Science Branch,

Curriculum Planning and Development Division, Ministry of Education, Singapore. Ministry of Education. (2006a). Mathematics syllabus (Primary). Singapore: Author.

Ministry of Education. (2006b). Mathematics syllabus (Secondary). Singapore: Author.

New Express Mathematics Project Team. (2005). Conceptual Framework for New Express Mathematics. Internal Working Document (Available upon request)

Ng, L. E. (2002). Representation of problem solving in Singaporean primary mathematics textbooks with respect to types, Pólya’s model, and heuristics. Unpublished master’s thesis, National Institute of Education, Singapore.

Quek, T. (2002). US, Malaysia, Thailand, Viet Nam, India, Pakistan, Bangladesh, Finland ... Now, Israel uses S'pore maths textbooks too. The Straits Times, Sept. 23, 2002, Singapore.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Paper presented at APEC Conference on Replicating Exemplary Practices in Mathematics Education, Koh Samui, Thailand, 7-12 Mar. 2010

Zhu, Y., & Fan, L. (2002). Textbook use by Singaporean mathematics teachers at lower secondary level. In D. Edge & B. H. Yap (eds.), Mathematics education for a knowledge-based era (Vol. 2) (pp. 194-201). Singapore: Association of Mathematics Educators.

Replicating Exemplary Practices in Mathematics Education among APEC Economies, July 2010 [APEC#210-HR-01.4]

Information Brochure

M.Tech.

2005-2006

Indian Institute of Technology, Bombay

Important Information:

1. PLEASE READ THE INSTRUCTIONS GIVEN IN THE BROCHURE CAREFULLY BEFORE YOU FILL IN THE APPLICATION FORM.

2. Each candidate can submit only One Application Form. If more than one Applications are submitted only one application will be considered.

3. Along with the Application form, candidates are provided with the following.

i) Information Brochure

ii) Two white envelops

iii) One brown envelope

iv) Statement of Purpose (SoP) to be submitted only if you are applied for Department of Aerospace Engineering

4. Application form, Information Brochure & Specimen of envelopes are also available on our website.

Please mail the completed application form to the following address.

Deputy Registrar (Academic)

Indian Institute of Technology, Bombay Powai, Mumbai – 400 076.

Website: www.iitb.ac.in/admissions

E-Mail address: pgadm@iitb.ac.in

Telephone No. for contact: 022-25767066/7046

It is noted that normal Instructional hours are from 8.30 a.m to 5.00 p.m. and also some courses are in evening slots. We shall bear the total expenses of his/her studies.

Date: Signature and Seal of the

Sponsoring Authority

Certificate for Project-RA, Project-TA & Project Staff

This is to certify that Shri/Smt/Kum.

has been working in Project from dt.

I have no objection if he/she register for M.Tech. Programme in

Department under category.

The duration of the project is years. Appointment of Shri/Smt/Kum

is for the period of years. The appointment is likely

to be extended for the further period.

Signature

Prof. Project Investigator: Project Code:

Project Name:

UNDERTAKING

Shri/Smt/Kum is hereby state that in the event

my appointment in the project is terminated, I shall continue as Self-financed student for the remaining period.

Name of Student:

Signature:

Date:

iii) a) Application Form & Information Brochure is available at the Academic Office counter (II

Floor, Main Building) and can be collected in person against D.D.

b) The brochure and application forms may also be obtained by sending a Demand Draft to Deputy Registrar (Academic), IIT Bombay, Powai, Mumbai – 400 076. For this purpose, enclose a self-addressed strong cloth-lined duly stamped (Rs.25/-) envelope of size 27cms X 12 cms.

c) Application form & Brochure are available on IIT website. Candidates can download the application form fill in the form and submit (at counter/by post) alongwith Demand Draft. Forms without Demand Draft will not be considered.

d) Candidates can use on-line form and submit the form on-line. However, he/she has to send printed copies of application form duly signed along with Demand Draft to Deputy Registrar (Academic), IIT Bombay, Powai, Mumbai – 400 076

8.2 Filling the forms

ALL ENTRIES SHOULD BE IN CAPITAL LETTERS

1. Name in full: Write your name in capital letters as it is in qualifying degree certificate.

Paste (do not pin) passport size photograph in the space provided and get it attested by Principal/Head of College/University last attended or a Gazetted Officer.

2. (a) GATE Details: (Attach copy of the GATE Score Card)

Write your GATE registration no., GATE year and GATE percentile.

Percentile score: Performance Index (PI)

AIR

(b) Competitive (All India) Exams Score, if any (Like NCST (D level)/ AIIMS/ MCI/ GIPMER/ PGI- Chandigarh/ AFMC - Pune) -- Write Name of examination, Year of passing & Score

3. Write your address for communication at (a) and permanent address at (b). It is the responsibility of the candidate to inform change of address, if any, to the Deputy Registrar (Academic), IIT Bombay, Powai, Mumbai – 400 076.

4. Write your telephone number with STD code and e-mail address for contact.

D D M M Y Y Y Y

6. Sex- M for Male, F for Female.

7. Fill in for General Candidates G N

for Scheduled Caste S C

for Scheduled Tribe S T

Candidates with disabilities P D

Attach caste/tribe/disabilities certificate.

8. Marital Status: (Write ‘M’ for Married/ ‘S’ for Single)

9. Indicate your qualifying degree as follows:

B.Tech. B T

B.E. B E

B.Sc.(Engg.) B S

M.Sc. M S

Professional P F

(AMIE, MBBS, etc.)

10. Indicate the qualifying discipline in which you have pursued your degree programme by

writing the appropriate code form the following table.

Engineering/ Technology

Aeronautical/ Aerospace Engg. A E

Agriculture Engg. A G

Applied Mechanics A M

Architecture A R

Automobile Engg. A U

Biochemical Engg. B I

Biomedical Engg. B M

Biotechnology B T

Ceramic & Glass Tech C G

Chemical Engg./Chemical Tech. C H

Civil Engg./Civil & Environmental/

Structural Engg. C E

Computer Sc. & Engg. C S

Electrical Engg. E E

Electronics & Comm. Engg. E C

Energy Engg. E N

Environmental Engg. E V

Food Technology F T

Industrial Engg. I E

Instrumentation Engg./

Instrumentation Tech./

Instrumentation I N

Manufacturing Engg. M F

Mechanical Engg. M E

Metallurgical Engg. M T

Mineral Engg. M R

For examination pending 1

For result pending 2

For result out 3

12. No. of years of experience – Indicate years in the boxes provided against (attach separate sheet with details).

a) Teaching b) Industry

13. Details of Qualifying Examination: please fill various columns of the table given here.

14. * Departments/Interdisciplinary Groups/Centre and Schools applied for: Candidate can fill in any three Departments/Interdisciplinary groups/Centre/Schools (in column I) in order of preference. Write specialization of your interest (if any) against the Departments you have applied for in column II (under 1 to 5).

Specimen for filling in choices:

Department/ Interdisciplinary

groups/ Centre/ Schools II Specializations

1 2 3 4 5

(i) Aerospace Engg. AE 1 AE 4 AE 3 AE 5 AE 2

(ii) Mechanical Engg. ME 2 ME 1 ME 3 -- -

(iii) Energy Systems Engg. EN -- -- -- -

15. For Category write – TA-Institute Teaching Assistantship, TAP–Teaching Assistantship through Project, RA–Institute Research Assistantship, PA–Project Research Assistantship, PS–Project Staff, IS–Institute Staff, SF-Self-finance, SW–Sponsored, in order of preference.

16. Papers published, if any: (Please attach extract)

No. of Papers: Published -¬ Under publication -

17. Declaration: Must be given duly signed.

Additional Information: (Please attach separate sheet)

Candidates applying for Chemical (CH), Biomedical (BM), Electrical (EE1, EE2,EE3, EE4, EE5), please provide following information on the backside of the application form.

i) Title of the final year project.

ii) List of four courses, which you think, are relevant to the M.Tech. programme.

iii) Short-term courses attended.

Areas of Research

The research activities of the department encompass Process Modelling, Simulation, Computer Aided Design, Optimization and Control, Particulate Systems, Fluid Mechanics, Separation Processes, Petrochemicals, Electrochemical Processes, Thermodynamics, Colloid and Interfacial Science, Microstructure Engineering, Supercritical Fluid Extraction, Membrane Processes, Polymers and Advanced Materials, Reaction Engineering, Catalysis, Environmental Engineering, Bioprocess Engineering and Food Engineering.

3. CIVIL ENGINEERING CE

This programme is particularly geared to meet the growing demand in the country for designers, consultants, development engineers, research-scientists and faculty.

Areas of Specialization

A student entering the M.Tech. programme in Civil Engineering can follow one of the following streams :

1. Transportation Systems Engineering CE 1

2. Geotechnical Engineering CE 2

3. Water Resources CE 3

4. Structural Engineering CE 4

5. Remote Sensing CE 6

Eligibility for Admission

Candidates with Bachelor’s degree in Civil Engineering or equivalent are eligible to apply for streams 1 to 6.

Areas of Research

Transportation Systems Engineering:

Modelling traffic flow, Urban regional transport network design, Transport planning models, Urban public transport operation and management; Economic evaluation, analysis and impact assessment; Land-use transport planning, Pavement analysis and design, Pavement maintenance management. FS, ANN, AI, GA, ES, GIS applications to transport modelling.

Geotechnical Engineering:

Geotechnical properties of soils, Soil-structure interaction, Foundation for offshore structures, Earth dam problems, Stability and Seepage, Mechanics of Swelling Soils, Rock Mechanics and tunneling, Soil dynamics, Pile foundations, Soil stabilization, Anchored geosynthetics, Reinforced soil structures and geosynthetics, Geotechnical centrifuge study, Optimization techniques and environmental geotechniques; Landslides.

1.Table I (a) Fees, Deposits & Hostel Rent:

GENERAL:

Particulars Students staying in

Hostels Students not staying in

Hostels

At the time of admission* Rs. Rs.

Admission Fee 150 150

Provisional Certificate 100 100

Medical Examination 100 100

Student Welfare Fund 200 200

Modernization 500 500

Grade Card Fees 150 150

Identity Card 25 25

Courses of Study Bulletin 100 100

Institute Day Fee 50 50

Valedictory Function Fee 50 50

Total (I) 1425 1425

DEPOSITS:

Institute Security Deposits 1000 1000

Library Security Deposits 1000 1000

Mess Security Deposits 1000 -

Total (II) 3000 2000

ANNUAL CHARGES:*

Annual Insurance Premium 126 126

Total (III) 126 126

PER SEMESTER**

Gymkhana 100 100

Examination Fee 300 300

Registration Fee 300 300

Tuition Fee 2500 2500

Medical Fee 50 50

Hostel Rent 550 -

Hostel Est. Charges 300 -

Contribution to**

Hostel Subsidy 1750 -

Fan, Elect., Water Charges 300 -

Student Benevolent Fund 100 100

Medical Fund 50 50

Internet Charges 250 250

Hostel Maintenance Fee 500 -

Total (IV) 7050 3650

Grand Total

(I+II+III+ IV) 11601 7201

* FEES ARE NOT REFUNDABLE UNDER ANY CIRCUMSTANCES ** LIKELY TO BE REVISED

Mess advance to be paid before occupying the Hostel is Rs. 2000/- for those who opt for staying in hostel.

Candidates from reserved categories (SC/ST) are fully exempted from paying the Tuition Fees.

The students who are staying in Quarters such as Tansa, Tulsi and OIP Quarters, are required to pay license fees, water & Elect. Charges, F.R. etc. as applicable for occupation of the quarters as per the Estate Office Rules.

To utilize and maintain any motorized vehicle on the campus is prohibited.

2. Table I (b)

PAYMENT OF FEES FOR STUDENTS OF INFORMATION TECHNOLOGY

GENERAL:

Particulars Students staying in Hostels Students not staying in

Hostels

AT THE TIME OF

ADMISSION* Rs. Rs.

Admission Fee 150 150

Provisional Certificate 100 100

Medical Examination 100 100

Student Welfare Fund 200 200

Modernization 500 500

Grade Card Fees 150 150

Identity Card 25 25

Courses of Study Bulletin 100 100

Institute Day Fee 50 50

Valedictory Function Fee 50 50

Total (I) 1425 1425

DEPOSITS:

Institute Sec. Deposits 1000 1000

Library Security Deposits 1000 1000

Mess Security Deposits 1000 -

Total (II) 3000 2000

ANNUAL CHARGES:*

Annual Insurance Premium 126 126

Total (III) 126 126

PER SEMESTER**

Gymkhana 100 100

Examination Fee 300 300

Registration Fee 300 300

Tuition Fee 21050 21050

Medical Fee 50 50

Hostel Rent 550 -

Hostel Est. Charges 300 -

Contribution to **

Hostel Subsidy 1750 -

Fan, Elect., Water Charges 300 -

Student Benevolent Fund 100 100

Medical Fund 50 50

Internet Charges 250 250

Hostel Maintenance Fee 500 -

Total (IV) 25600 22,200

Grand Total

(I+II+III+ IV) 30151 25,751

* FEES ARE NOT REFUNDABLE UNDER ANY CIRCUMSTANCES ** LIKELY TO BE REVISED

Mess advance to be paid before occupying the Hostel is Rs. 2000/- for those who opt for staying in hostel.

Candidates from reserved Categories (SC/ST) are fully exempted from paying the Tuition Fees.

3. Table II

SUMMARY OF M.TECH. PROGRAMMES

Department/ID Groups/Centre Specialization Code

Department

1. Aerospace Engineering Aerodynamics

Dynamics and Control

Aerospace Propulsion

Aerospace Structures

Systems Design and Engineering AE1

AE2

AE3

AE4

AE5

2. Chemical Engineering Chemical Engineering CH

3. Civil Engineering Transportation Systems Engg. Geotechnical Engineering. Water Resources.

Structural Engineering.

Remote Sensing. CE1

CE2

CE3

CE4 CE6

4.Computer Science &

Engineering Computer Science & Engg. CS

5. Earth Sciences Geoexploration GS

6. Electrical Engineering Communication Engineering. Control & Computing. Power Electronics & Power Systems.

Micro Electronics.

Electronic Systems. EE1

EE2

EE3

EE4

EE5

7. Mechanical Engineering Thermal & Fluids Engineering. Design Engineering Manufacturing Engineering ME1

ME2

ME3

8. Metallurgical Engineering & Materials Science Materials Science and Engg. Process Metallurgy MM1

MM2

Interdisciplinary Groups:

9. Corrosion Science & Engineering. CO

10. Energy Systems Engineering. EN

11. Industrial Engg. & Operations Research IO

12. Reliability Engineering RE

13. Systems and Control Engg. SC

Centre/ School :

14. Environmental Science & Engineering EV

15. School of Information Technology IT

16. School of Biosciences & Bioengineering

(Biomedical Engineering) BM

PGDIIT programme is offered by School of Information Technology.

4. Table – III

ELIGIBILITY FOR SEEKING ADMISSION TO DIFFERENT DISCIPLINES

If you have degree or equivalent in You can seek admission to the following disciplines (Refer Table II for codes)

Agricultural Engg. EN, IT, IO, EV, CS

Aerospace/Aeronautical Engg. AE1, AE2, AE3, AE4, AE5, CO, CS,

EN, IO, ME1, RE, SC, IT, EV.

Automobile Engg. ME1, ME2, EN, IT, IO, CS

Biomedical Engg. BM, IT, IO, EE5, CS

Ceramic and Glass Tech. MM1, IT, IO, CS

Chemical Engg. BM, CH, CO, CS, EN, EV, IO,

MM1, MM2, RE, SC, IT.

Civil Engg. CE1, CE2, CE3, CE4, CE6, CO, CS, EN, EV, IO, RE, IT, AE1, AE4, AE5.

Computer Science & Engineering. BM, CS, RE, EE1, EE2, EE3,EE4, IT, IO, AE5.

Energy Systems Engg. EN, EV, IT, IO, CS

Electrical/Electronics Engg. EE1, EE2, EE3, EE4, EE5, BM, CO, CS, EN, IO, MM1, RE, IT, SC, AE2, AE5.

Telecommunication Engg. EE1, EE2, EE3, EE4, EE5, BM, CO, CS, IO, RE, IT

Electrochemical CO, MM1, MM2, IT, IO

Engineering Physics BM, EE1, EE2, EE4, EE5, MM1, CS, IT, IO

Industrial Engineering ME3, IO, RE, CS, IT

Instrumentation Engg. BM, CS, EE2, EE3, EE4, EE5, RE, SC, IT, IO.

Machine Tool Engg. ME2, ME3, IT, IO, CS

Thermal Power Engg, EN, IT, IO, CS, AE2

Mechanical Engg. ME1, ME2, ME3, AE1, AE3, AE4, AE5, BM, CO, CS, EN, EV, IO, MM1, MM2, RE, SC, IT

Metallurgical Engineering MM1, MM2, CO, CS, EN, EV, IO, RE, IT

Mining Engineering. EV, RE, IT, CS, IO

Production Engineering. ME2, ME3, CS, EV, IO, RE, IT

Petrochemical CH, EV, IT, IO, CS

MCA (with B.Sc. with Physics & Maths.) CS, IT

M.Sc. or equivalent

Atmospheric Science EV, CS

Biochemistry BM.IT, EV, CS

Biophysics BM, IT, CS

Biotechnology BM, IT, EV, CS

Chemistry BM, CO, EV, MM1, MM2, IT, CS

Ceramics BM, IT, CS

Environmental Sciences EV, IT, CS

Environmental Toxicology EV, CS

Ergonomics BM, IT, CS

Earth Sciences Geology/ Applied Geology/ Geochemistry/ Geophysics. EV, GS, IT, CS

Life Sciences EV, IT, CS

Materials Science BM, CO, MM1, MM2, IT, CS

Mathematics BM, CS, IT

Microbiology EV, IT, CS

Meteorology EV, CS

Molecular Biology BM, IT, CS

Physics BM, MM1, IT, CS, EV

Physics with Electronics as a Special Subject BM, CS, EE4, MM1, IT

Electronic Science BM, EE4, IT, CS

Physiology BM

M.B.B.S. (Medicine) BM

M.Pharm. BM

Note: 1. For admission to Computer Science & Engineering (CS), the candidate must have taken the GATE Specialization Paper CS.

2. For admission to EV, the candidate must have passed Mathematics in the plus two stage (i.e. in 12th standard).

3. For admission to Chemical Engineering, candidates must have taken GATE specialization paper CH.

4. For admission to Systems & Control Engg., candidates should have undergone a course in Control Theory.

5. For admission to MM1 and MM2, the candidates with M.Sc. must have passed Mathematics as a subject at the B.Sc. degree level.

6. For admission to MM2, the candidates with M.Sc., should have taken either general/physical/inorganic Chemistry as specialization at the M.Sc. level.

7. For admission to various specializations in Aerospace & Electrical Engineering, there is specific requirement of GATE paper. The information is given under respective Department in brochure.

TABLE – IV

GATE & OTHER REQUIREMENT OF DIFFERENT DISCIPLINE

Discipline Eligibility Criteria GATE Requirement

Aerospace Engineering Aerodynamics (AE 1) Dynamics & Control (AE 2) Aerospace Propulsion (AE 3) Aerospace Structures (AE 4) Systems Design & Engg (AE5) B.E./B.Tech. or equivalent in Aeronautical/ Aerospace are eligible for all disciplines.

B.E./ B.Tech. or equivalent in Mechanical/ Civil/ Computer/ Electrical/ Electronics or Allied branches are eligible to specific specialization as follows: Valid GATE score in any subject

Aerodynamics (AE 1) -

Dynamics & Control (AE 2) - Aerospace Propulsion (AE 3) - Aerospace Structures (AE 4) - Systems Design & Engg (AE5) -

Chemical Engineering B.E./B.Tech. in Chemical Engineering or

equivalent

Civil Engineering B.E./ B.Tech. in Civil Engineering or

Transportation Systems Engg. equivalent

(CE 1)

Geotechnical Engg. (CE 2)

Water Resources (CE 3)

Structural Engg. (CE 4)

Remote Sensing (CE 6)

Computer Sc. & Engg. Bachelors degree in Engineering/ Master’s

degree in Science/ MCA or equivalent degree

ME, CE

ME, EE, EC

ME, CE

ME, EE, EC, CE, CS, XE

Valid GATE score in Chemical Engg. (CH) only

Valid GATE score in any subject

Valid GATE score in Computer Sc. (CS) only

Earth Sciences

(Geoexploration)

Electrical Engineering Communication Engg. (EE1) Control & Computing (EE 2) Power Electronics & Power Systems (EE 3) Microelectronics (EE 4) Electronic Systems (EE 5)

Master’s degree or equivalent in Geology/ Applied Geology/ Geophysics/ Geochemistry

EE 1: Bachelor’s degree in Computer Sc./

Electrical Engineering/ Electronics/

Telecommunications/ Engineering Physics or Master’s Degree in Physics (with specialization in Electronics/ Electronics Sciences)

Discipline Eligibility Criteria GATE Requirement

EE 2 : Bachelor’s degree in Aeronautical/

Aerospace, Biotechnical, Chemical,

Communication, Electrical, Electronics,

Telecommunication, Instrumentation,

Mechanical, Metallurgical Engineering, Engineering Physics or M.Sc. in Physics or Mathematics

EE 3: Bachelor’s degree in Computer Science/ Electrical Engineering/ Electronics/ Telecommunications/ Instrumentation

EE 4: Bachelor’s degree in Computer Science/ Electrical Engineering/ Electronics/ Telecommunications/ Engineering Physics/ Instrumentation OR Master’s degree in Physics or Electronics Sciences

EE 5: Bachelor’s degree in Biomedical/

Electrical Engineering/ Electronics/

Instrumentation/ Telecommunications/

Engineering Physics

ME 1: Bachelor’s degree in Mechanical/

Aeronautical/ Aerospace/ Automobile

Engineering

ME 2: Bachelor’s degree in Mechanical/ Production/ Automobile/ Machine Tool Engineering

ME 3: Bachelor’s degree in Mechanical/ Production/ Machine Tool/ Industrial Engineering

MM 1: Bachelor’s degree in Ceramic/ Chemical/

Electrical/ Electronics/ Electrochemical/

Mechanical/ Metallurgical Engineering/

Engineering Physics or M.Sc. in Chemistry/ Materials Sciences/ Physics

MM 2: Bachelor’s degree in Chemical/ Electrochemical/ Mechanical/ Metallurgical Engineering or M.Sc. in Chemistry (General or specialisation in Physical or Inorganic Chemistry)/ Materials Science

In addition in either specialization candidates with M.Sc. must have passed Mathematics as a subject at B.Sc. degree level

Discipline Eligibility Criteria GATE Requirement

Corrosion Sc. & Engg. Bachelor’s degree in Aeronautical/

Aerospace/ Chemical/ Civil/ Electrical/ Electrochemical/ Mechanical/ Metallurgical Engineering/ Ceramics & Petrochemical or Master’s degree in Chemistry/ Material Science

Energy Systems Engg. Bachelor’s degree in Mechanical/ Electrical/ Chemical/ Energy Systems/ Thermal Power/ Automobile/ Aerospace/ Aeronautical/ Metallurgical/ Civil Engineering

Industrial Engg. & Operations Bachelor’s degree in any branch of

Research Engineering

Reliability Engineering Bachelor’s degree in Aeronautical/

Aerospace/ Chemical/ Civil/ Electrical/

Mechanical/ Production/ Industrial/

Metallurgical Engineering/ Computer Science/ Electronics/ Instrumentation/ Mining Engineering

Systems & Control Engg. Bachelor’s degree in Aeronautical/

Aerospace/ Chemical/ Communication/

Electrical/ Electronics/ Instrumentation/

Mechanical/ Metallurgical Engineering or M.Sc. in Mathematics

Candidate should undergone a course in Control Theory

Environmental Sc. & Engg. Bachelor’s degree in Aeronautical/

Aerospace/ Agricultural/ Chemical/ Civil/ Energy Systems/ Environmental/ Mechanical/

Metallurgical/ Mining/ Production/

Petrochemical Engineering or M.Sc. in

Atmospheric Science/ Biochemistry/

Biotechnology/ Chemistry/ Earth Sciences/ Environmental Toxicology/ Environmental Science/ Life Science/ Meteorology/ Microbiology./ Physics

The candidates with M.Sc. must have passed Mathematics as a subject at 10 + 2 level

Discipline Eligibility Criteria GATE Requirement

B.E./ B.Tech/ B.Sc. (Engg.)/ M.Sc. in any discipline or MCA

B.E./ B.Tech. in Biomedical/ Chemical/ Computer Science/ Electrical/ Electronics/

Telecommunications/ Instrumentation/

Mechanical Engg./ Engineering Physics or M.Sc. or equivalent in Biochemistry/ Biophysics/ Biotechnology/ Ceramics/

Ergonomics/ Chemistry/ Electronics/

Materials Science/ Mathematics/ Molecular Biology/ Physics/ Physiology or MBBS or M.Pharm.

Statement of Purpose (SoP): only for candidates applying to Aerospace Engineering Department

INDIAN INSTITUTE OF TECHNOLOGY,

BOMBAY

STATEMENT OF PURPOSE

Statement of Purpose (SOP) is your opportunity to share with the admission committee your thoughts and feeling about Postgraduate studies at IIT Bombay including your preparation for the same. Briefly describe past project/ research work done by you. Restrict yourself to 500-600 words. The personal SOP will aid the admission committee in evaluating your application.

1. Name:

2. Programme of study: M.Tech. Department: Aerospace Engineering Department

Signature: Date:

SPARSE AND LOW-RANK MATRIX DECOMPOSITION VIA

ALTERNATING DIRECTION METHODS

XIAOMING YUAN* AND JUNFENG YANG

Abstract. The problem of recovering the sparse and low-rank components of a matrix captures a broad spectrum of applications. Authors in [4] proposed the concept of ”rank-sparsity incoherence” to characterize the fundamental identifiability of the recovery, and derived practical sufficient conditions to ensure the high possibility of recovery. This exact recovery is achieved via solving a convex relaxation problem where the l1 norm and the nuclear norm are utilized for being surrogates of the sparsity and low-rank. Numerically, this convex relaxation problem was reformulated into a semi-definite programming (SDP) problem whose dimension is considerably enlarged, and this SDP reformulation was proposed to be solved by generic interior-point solvers in [4].

This paper focuses on the algorithmic improvement for the sparse and low-rank recovery. In particular, we observe that the convex relaxation problem generated by the approach of [4] is actually well-structured in both the objective function and constraint, and it fits perfectly the applicable range of the classical alternating direction method (ADM). Hence, we propose the ADM approach for accomplishing the sparse and low-rank recovery, by taking full exploitation to the high-level separable structure of the convex relaxation problem. Preliminary numerical results are reported to verify the attractive efficiency of the ADM approach for recovering sparse and low-rank components of matrices.

Key words. Matrix decomposition, sparse, low rank, alternating direction method, l1 norm, nuclear norm.

AMS subject classifications. 90C06, 90C22,90C25,90C59, 93B30

1. Introduction. Matrix representations of complex systems and models arising in various areas often have the character that such a matrix is composed of an sparse matrix and an low-rank matrix. Such applications include the model selection in statistics, system identification in engineering, partially coherent decomposition in optical systems, and matrix rigidity in computer science, see e.g. [15, 16, 17, 30, 35, 37, 41]. To better understand its behavior and properties and to track it more efficiently, it is of significant interest to take advantage of the decomposable character of such a complex system. One necessary way towards this goal is to recover the sparse and low-rank components of the corresponding given matrix, but without no prior knowledge about the sparsity pattern or the rank information. Despite its powerful applicability, the problem of sparse and low-rank matrix decomposition (SLRMD) was highlighted and intensively studied only very recently by [4].

Obviously, the SLRMD problem is in general ill-posed (NP-hard) and thus not trackable. In sprit of the influential work of [5, 6, 9, 10, 11, 36] in the areas of compressed sensing and statistics, authors of [4] insightfully proposed the concept of ”rank-sparsity incoherence” to characterize the fundamental identifiability of the recovery of sparse and low-rank components. Accordingly, a simple deterministic condition for the eligibility of exact recovery was derived therein. Note that the concept of ”rank-sparsity incoherence” relates algebraically the sparsity pattern of a matrix and its row or column spaces via an uncertainty principle. Throughout, we assume by default that the matrix under consideration is recoverable for sparse and

*DEPARTMENT OF MATHEMATICS, HONG KONG BAPTIST UNIVERSITY, HONG KONG, CHINA (XMYUAN@HKBU.EDU.HK). THIS AUTHOR WAS SUPPORTED BY THE RGC GRANT 203009 AND THE NSFC GRANT 10701055.

DEPARTMENT OF MATHEMATICS, NANJING UNIVERSITY, NANJING, JIANGSU, CHINA (JFYANG2992@GMAIL.COM).

2 X. M. YUAN and J. F. YANG

low-rank components.

By realizing the widely-used heuristics of using the l1-norm as the proxy of sparsity and the nuclear norm as the surrogate of low-rank in many areas such as statistics and image processing (see e.g. [6, 8, 14, 36]), authors of [4] suggested to accomplish the sparse and low-rank recovery by solving the following convex optimization problem:

minA,B γ 11A11l + 11B 11* (1.1)

where C  R'"n is the given matrix to be recovered; A  R'"n is the sparse component of C; B  R'"n is the low-rank component of C; 11 • 11l is the l1 norm defined by the component-wise sum of absolute values of all entries; 11 • 11* is the nuclear norm defined by the sum of all singular values; and γ > 0 is a constant providing a trade-off between the sparse and low-rank components. Let A* and B* be the true sparse and low-rank components of C which are to be recovered. Then, some conditions on A* and B* were proposed in [4] to ensure sufficiently that the unique solution of (1.1) is exactly (A*, B*) for a range of γ, i.e., the exact spare and low-rank recovery of C can be accomplished. We refer to Theorem 2 in [4] for the delineation of these conditions, and we emphasize that these conditions are practical as they are satisfied by many real applications, see also [4].

Therefore, efficient numerical solvability of (1.1) becomes crucial for the task of recovering the sparsity and low-rank components of C. In particular, it was suggested in [4] to apply some interior-point solvers such as SDPT3 [40] to solve the semi-definite programming (SDP) reformulation of (1.1). For many cases, however, matrices to be decomposed are large-scale in dimensions, and it is not hard to imagine that this large-scale feature significantly aggrandizes the difficulty of recovery. In fact, it has been well realized that the interior-point approach is numerically invalid for large-scale (actually, even for medium-scale) optimization problems with matrix variables. In particular, for solving (1.1) via the interior-point approach in [4], the dimension of the SDP reformulation is even magnified considerably compared to that of (1.1), see (A.1) in [4].

Hence, just as what authors of [4] raised at the end, it is of particular interest to develop efficient numerical algorithms for solving (1.1) in order to accomplish the recovery of sparse and low rank components, especially for large-scale matrices. For doing so, we observe that (1.1) is well-structured in the sense that the separable structure emerges in both the objective function and the constraint. Thus, we have no reasons not to take advantage of this favorable structure for the sake of algorithmic design. Recall that (1.1) was regarded as a generic convex problem, and its beneficial structure was completely ignored in [4]. In fact, the high-level separable structure of (1.1) can be readily exploited by the well-known alternating direction method (ADM). Thus, we are devoted to presenting the ADM approach for solving (1.1) by taking full advantage of its beneficial structure. The rationale of making use of the particular structure of (1.1) for algorithmic sakes also conforms to what Nesterov has illuminated in [33]: ”It was becoming more and more clear that the proper use of the problem’s

structure can lead to very efficient optimization methods ”. As we will analyze

in detail, the ADM approach is attractive for sparse and low-rank recovery because that the main computational load of each iteration is dominated by only one singular value decomposition of B.

2. The ADM approach. Generally speaking, ADM is a practical improve¬ment of the classical Augmented Lagrangian method for solving convex programming

Sparse and Low-Rank Matrix Decomposition via ADM 3

problem with linear constraints, by fully taking advantage of its high-level separable structure. We refer to the wide applications of ADM in many areas such as convex programming, variational inequalities and image processing, see, e.g., [2, 7, 12, 13, 18, 19, 20, 21, 22, 24, 28, 39]. In particular, novel applications of ADM for solving some interesting optimization problems have been discovered very recently, see e.g. [13, 25, 32, 43, 44].

The Augmented Lagrangian function of (1.1) is

L(A, B, Z) := γkAkl + kBkhZ, A + B  Ci + 02kA + B  Ck2,

where Z mn is the multiplier of the linear constraint; 0 > 0 is the penalty parameter for the violation of the linear constraint and • denotes the standard trace inner product and • is the induced Frobenius norm. Obviously, the classical Augmented Lagrangian method (see e.g. [1, 34]) is applicable, and its iterative scheme is:

(Ak+1, Bk+1)  argminA,BRL(A, B, Zk), Zk+1 = Zk 0(Ak+1 + Bk+1 C), (2.1)

where (Ak, Bk, Zk) is the given triple of iterate. The direct application of the Aug¬mented Lagrangian method, however, treats (1.1) as a generic minimization problem and ignores its favorable separable structure emerging in both the objective function and the constraint. Hence, the variables A and B are minimized simultaneously in (2.1).

Indeed, this ignorance of the direction application of Augmented Lagrangian method for (1.1) can be made up by the well-known ADM method which mini¬mizes the variables A and B serially. More specifically, the original ADM (see e.g. [19, 20, 21, 22]) solves the following problems to generate the new iterate:





 Ak+1  argminARL(A, Bk, Zk),

Bk+1  argminBL(Ak+1, B, Zk),

Zk+1 = Zk 0(Ak+1 + Bk+1 C),

which is equivalent to

0  γ(kAk+1kl)  [Zk 0(Ak+1 + Bk C)]. (2.2a)

0  (Bk+1)  [Zk 0(Ak+1 + Bk+1 C)]. (2.2b)

Zk+1 = Zk 0(Ak+1 + Bk+1 C), (2.2c)

where (•) denotes the subgradient operator of a convex function.

We now elaborate on strategies of solving the subproblems (2.2a) and (2.2b).

First, we consider (2.2a), which turns out to be the widely-used shrinkage problem

(see e.g [38]). In fact, (2.2a) can be solved easily with explicit solution:

Ak+1 = 10ZkBk+CPΩ

[10ZkBk+C],

where PΩ

 denotes the Euclidean projection onto

Ωγ/β

 := X  Rnn γ /0  Xij  γ /0.

4 X. M. YUAN and J. F. YANG

For the second subproblem (2.2b), it is easy to verify that it is equivalent to the following minimization problem:

Bk+1 = argminBR lIIBII + β2 IIB  [C  Ak+1 + 1 βZk]II2l.

Then, according to [3, 31], Bk+1 admits the following explicit solution: Bk+1 = Uk+1 diag(max{σk+1

i β 1,0})(Vk+1)T,

where Uk+1  Rm', Vk+1  Rn' are obtained via the singular value decomposition

of C  Ak+1 + 1βZk:

C  Ak+1 + 1β Zk = Uk+1Σk+1(V k+1)T with Σk+1 = diag({σk+1

i }' i=1).

Based on aforementioned analysis, we now delineate the procedure of applying ADM to solve (1.1). For given (Ak, Bk, Zk), the ADM takes the following schemes to generate the new iterate:

Algorithm: the ADM for SLRMD problem:

Step 1. Generate Ak+1:

Ak+1 = 1βZkBk+CPΩ

[1βZkBk+C].

Step 2 Generate Bk+1:

Bk+1 = Uk+1 diag(max{σik+1 β 1,0})(Vk+1)T,

where Uk+1, Vk+1 and {σk+1

i } are generated by the singular values decompo

sition of C  Ak+1 + 1β Zk, i.e.,

C  Ak+1 + 1β Zk = Uk+1Σk+1(V k+1)T , with Σk+1 = diag({σk+1

i }' i=1).

Step 3. Update the multiplier:

Zk+1 = Zk  β(Ak+1 + Bk+1  C).

Remark 1. It is easy to see that when the ADM approach applied to solve (1.1), the computation load of each iteration is dominated by one singular values decomposition (SVD) with the complexity O(n3), see e.g. [23]. In particular, existing subroutines for efficient SVD (e.g. [29, 31]) guarantees the efficiency of the proposed ADM approach for sparse and low-rank recovery of large-scale matrices.

Remark 2. Some more general ADM methods are easy to be extended to solve the SLRMD problem. For example, the general ADM proposed by Glowinski [21, 22] which modifies Step 3 of the original ADM with a relaxation parameter in the interval

(0, 5+1

2 ); and the ADM-based descent method developed in [42]. We here omit

details of these general ADM type methods for succinctness. Convergence of ADM type methods are available in the literatures, e.g., [19, 20, 21, 22, 26, 42].

Sparse and Low-Rank Matrix Decomposition via ADM 5

Remark 3. The penalty parameter β is eligible for dynamic adjustment. We refer to, e.g., [24, 26, 27, 28], for the convergence of ADM with dynamically-variable parameter and some concrete effective strategies of adjusting this parameter.

3. Numerical results. In this section, we present experimental results to show the efficiency of ADM when applied to (1.1). Let C = A* + B* be the available data, where A* and B* are, respectively, the original sparse and low-rank matrices that we wish to recover. For convenience, we let γ = t/(1  t) so that problem (1.1) can be equivalently transformed to

min {t~A~l + (1  t)~B~* : A + B = C} . (3.1)

A,B

The advantage of the reformulation in the form of (3.1) is that parameter t changes in a finite interval (0, 1) as compared to γ  (0, +) in (1.1). We let ( ˆAt, ˆBt) be a numerical solution of (3.1) obtained by ADM. We will measure the quality of recovery by relative error to (A*, B*), which is defined as

RelErr II(ˆAt, ˆBt)  (A*, B*)IIF II(A*, B*)IIF + 1 , (3.2)

where II • IIF represents the Frobenius norm. All experiments were performed under Windows Vista Premium and MATLAB v7.8 (R2009a) running on a Lenovo laptop with an Intel Core 2 Duo CPU at 1.8 GHz and 2 GB of memory.

3.1. Experimental settings. Given a small constant τ > 0, we define

difft II ˆAt  ˆAtτ IIF + II ˆBt  ˆBtτ IIF . (3.3)

It is clear that ˆBt approaches zero matrix as t does. On the other hand, ˆAt approaches zero matrix as t becomes close to 1. Therefore, difft is stabilized on the boundaries of (0, 1). As suggested in [4], a suitable value of t should result to a recovery such that difft is stabilized and meanwhile t stays away from both 0 and 1. To determine a suitable value of t, we set τ = 0.01 in (3.3), which, based on our experimental results, is sufficiently small for measuring the sensitivity of ( ˆAt, ˆBt) with respect to t, and ran a set of experiments with different combinations of sparsity ratios of A* and ranks of B*. In the following, r and spr represent, respectively, matrix rank and sparsity ratio. We tested two types of sparse matrices: impulsive and Gaussian sparse matrices. The MATLAB scripts for generating matrix C are as follows:

• B = randn(m,r)*randn(r,n); mgB = max(abs(B(:)));

• A = zeros(m,n); p = randperm(m*n); L = round(spr*m*n);

– Impulsive sparse matrix: A(p(1:L)) = mgB.*sign(randn(L,1));

– Gaussian sparse matrix: A(p(1:L)) = randn(L,1);

• C = A + B.

Specifically, we set m = n = 100 and tested (r, spr) = (1, 1%), (2, 2%), ..., (20, 20%), for which the decomposition problem roughly changes from easy to hard. Based on our experimental results, suitable values of t shrinks from [0.05, 0.2] to [0.09, 0.1] as (r, spr) increases from (1, 1%) to (20, 20%). Therefore, we set t = 0.1 in our experiments. In the following, we present experimental results for the two types of sparse matrices. In all experiments, we set m = n = 100. We simply set β = 0.25mn/IICII1 and terminated ADM when the relative change falls below 106, i.e.,

RelChg II(Ak+1, Bk+1)  (Ak, Bk)IIF  106. (3.4)

II(Ak, Bk)IIF + 1

6 X. M. YUAN and J. F. YANG

3.2. Exact recoverability. In this section, we present numerical results to demonstrate exact recoverability of model (3.1). For each pair (r, spr), we ran 50 trials and claimed successful recoverability when RelErr < e for some e > 0. The probability of exact recovery is the defined as the successful recovery rate. The fol¬lowing results show the recoverability of (3.1) for r varies from 1 to 40 and spr from 1% to 35%.

Recoverability. RelErr  106 Recoverability. RelErr  105 Recoverability. RelErr  103

10 20 30 40

10 20 30 40 10 20 30 40 10 20 30 40

10 20 30 40

Rank

Rank Rank Rank

FIG. 3.1. Recoverability results from impulsive sparse matrices. Number of trials is 50.

Recoverability. RelErr  104 Recoverability. RelErr  103 Recoverability. RelErr  102

10 20 30 40

10 20 30 40 10 20 30 40 10 20 30 40

10 20 30 40

Rank

Rank Rank Rank

FIG. 3.2. Recoverability results from Gaussian sparse matrices. Number of trials is 50.

It can be seen from Figures 3.1 and 3.2 that model (3.1) shows exact recoverability when either the sparsity ratio of A* or the rank of B* are suitably small. Specifically, for impulsive sparse matrices, the resulting relative errors are as small as 105 for a large number of test problems. When r is small, say r < 5, ADM applied to (3.1) results to faithful recoveries with spr as high as 35%, while, on the other hand, high accuracy recovery is attainable for r as high as 30 when spr < 5%. Comparing Figures 3.1 and 3.2, it is clear that, under the same conditions of sparsity ratio and matrix rank, the probability of high accuracy recovery is smaller when A* is Gaussian sparse matrices, which is quite reasonable since impulsive errors are generally easier to be eliminated than Gaussian errors.

3.3. Recovery results. In this section, we present two classes of recovery results on both impulsive and Gaussian sparse matrices. Figure 3.3 shows the results of 100 trials for several pairs of (r, spr), where x-axes represent the resulting relative errors (average those relative errors fall into [10(i+1), 10i]) and y-axes represent the number of trials ADM obtains relative errors in a corresponding interval. It can be seen from Figure 3.3 that, for impulsive sparse matrices (plot on the left), the resulting relative errors are mostly quite small and poor quality recovery appears for less than 20% of the 100 random trials. In comparison, for Gaussian sparse matrices (plot on the right), the resulting relative errors are mostly between 103 and 104,

Sparse and Low-Rank Matrix Decomposition via ADM 7

and it is generally rather difficult to obtain higher accuracy unless both r and spr are quite small.

. Recovery results of 100 trials. Left: results of impulsive sparse matrices; Right: results of Gaussian sparse matrices. In both plots, x-axes represent relative error of recovery (average those relative errors fall into [10('+1), 10')) and y-axes represent the number of trials ADM results to a relative error in a corresponding interval.

In Figures 3.4 to 3.7, we present two results for each type of sparse matrices, where the rank of B*, the sparsity ratio of A*, the number of iterations used by ADM, the resulting relative errors to A*, B* (defined in a similar way as in (3.2)) and the total relative error RelErr defined in (3.2) are given in the captions. The sparsity ratios and matrix ranks used in all these tests are near the boundary of high accuracy recovery, which can be seen in Figures 3.1 and 3.2. From these results, it can be seen that high accuracy is attainable even for the boundary cases of high accuracy recovery. It is also implied that problem (3.1) is usually much easier to solve when C is corrupted by impulsive sparse errors as compared to Gaussian errors.

8 X. M. YUAN and J. F. YANG

True Sparse Recovered Sparse

. Results #1 from impulsive sparse matrix. Rank: 10, sparsity ratio: 20%, number of iteration: 39, relative error in sparse matrix: 1.06 x 106, Relative error in low rank: 2.22 x 106, relative error in total: 1.33 x 106.

True Sparse Recovered Sparse

True LowRank Recovered LowRank

. Results #2 from impulsive sparse matrix. Rank 20, sparsity ratio: 10%, number of iteration: 45, relative error in sparse matrix: 1.55 x 106, relative error in low rank: 1.79 x 106, relative error in total: 1.63 x 106.

Sparse and Low-Rank Matrix Decomposition via ADM 9

True Sparse Recovered Sparse

FIG. 3.6. Results #1 from Gaussian sparse matrix. Rank 10, sparsity ratio: 20%, number of iteration: 163, relative error in sparse matrix: 1.88 x 103, relative error in low rank: 2.26 x 104, relative error in total: 3.37 x 104.

True Sparse Recovered Sparse

True LowRank Recovered LowRank

FIG. 3.7. Results #2 from Gaussian sparse matrix. Rank 20, sparsity ratio 10%, number of iteration: 390, relative error in sparse matrix: 7.84 x 104, relative error in low rank: 5.82 x 105, relative error in total: 8.28 x 105.

10 X. M. YUAN and J. F. YANG

4. Conclusions. This paper mainly emphasizes the applicability of the alternat¬ing direction methods (ADM) for solving the sparse and low-rank matrix decomposi¬tion (SLRMD) problem, and thus numerically reinforces the pioneering work of [4] on this topic. It has been shown that the existing ADM type methods are eligible to be extended to solve the SLRMD problem, and the implementation is pretty easy since both of the subproblems generated at each iteration admit explicit solutions. Prelim¬inary numerical results exhibit affirmatively the efficiency of ADM for the SLRMD problem.

REFERENCES

[1] D. P. BERTSEKAS, Constrained Optimization and Lagrange Multiplier methods, Academic Press, 1982.

[2] D. P. BERTSEKAS AND J. N. TSITSIKLIS, Parallel and distributed computation: Numerical methods, Prentice-Hall, Englewood Cliffs, NJ, 1989.

[3] J. F. CAI, E. J. CAND´eS AND Z. W. SHEN, A singular value thresholding algorithm for matrix completion, preprint, available at http://arxiv.org/abs/0810.3286.

[4] V. CHANDRASEKARAN, S. SANGHAVI, P. A. PARRILO AND A. S. WILLSKYC, Rank-sparsity inco¬herence for matrix decomposition, manuscript, http://arxiv.org/abs/0906.2220.

[5] E. J. CAND´eS, J. ROMBERG AND T. TAO, Robust uncertainty principles: exact signal recon-struction from highly incomplete frequency information, IEEE Transactions on Information Theory, 52 (2), pp. 489-509, 2006.

[6] E. J. CAND´eS AND B. RECHT, B. Recht, Exact Matrix Completion Via Convex Optimization, Manuscript, 2008.

[7] G. CHEN, AND M. TEBOULLE, A proximal-based decomposition method for convex minimization problems, Mathematical Programming, 64 (1994), pp. 81-101.

[8] S. CHEN, D. DONOHO, AND M. SAUNDERS, Atomic decomposition by basis pursuit, SIAM Jour¬nal on Scientific Computing, 20 (1998), pp. 33-61.

[9] D. L. DONOHO AND X. HUO, Uncertainty principles and ideal atomic decomposition , IEEE Transactions on Information Theory, 47 (7), pp. 2845-2862, 2001.

[10] D. L. DONOHO AND M. ELAD, Optimal Sparse Representation in General (Nonorthogonal) Dic- tionaries via l1 Minimization, Proceedings of the National Academy of Sciences, 100, pp.2197-2202, 2003.

[11] D. L. DONOHO, Compressed sensing, IEEE Transactions on Information Theory, 52 (4), pp. 1289-1306, 2006.

[12] J. ECKSTEIN AND M. FUKUSHIMA, Some reformulation and applications of the alternating di¬rections method of multipliers, In: Hager, W. W. et al. eds., Large Scale Optimization: State of the Art, Kluwer Academic Publishers, pp. 115-134, 1994.

[13] E. ESSER, Applications of Lagrangian-based alternating direction methods and connections to split Bregman, Manuscript, http://www.math.ucla.edu/applied/cam/.

[14] M. FAZEL, Matrix rank minimization with applications, PhD thesis, Stanford University, 2002.

[15] M. FAZEL AND J. GOODMAN, Approximations for partially coherent optical imaging systems, Technical Report, Stanford University, 1998.

[16] M. FAZEL, H. HINDI, AND S. BOYD, A rank minimization heuristic with application to min¬imum order system approximation, Proceedings American Control Conference, 6 (2001), pp. 4734-4739.

[17] M. FAZEL, H. HINDI AND S. BOYD, Log-det heuristic for matrix rank minimization with appli-cations to Hankel and Euclidean distance matrices, Proceedings of the American Control Conference, 2003.

[18] M. FUKUSHIMA, Application of the alternating direction method of multipliers to separable convex programming problems, Computational Optimization and Applications, 1(1992), pp. 93-111.

[19] D. GABAY, Application of the method of multipliers to varuational inequalities, In: Fortin, M., Glowinski, R., eds., Augmented Lagrangian methods: Application to the numerical solution of Boundary-Value Problem, North-Holland, Amsterdam, The Netherlands, pp. 299-331, 1983.

[20] D. GABAY AND B. MERCIER, A dual algorithm for the solution of nonlinear variational prob-lems via finite element approximations, Computational Mathematics with Applications, 2(1976), pp. 17-40.

Sparse and Low-Rank Matrix Decomposition via ADM 11

[21] R. GLOWINSKI, Numerical methods for nonlinear variational problems, Springer-Verlag, New York, Berlin, Heidelberg, Tokyo, 1984.

[22] R. GLOWINSKI AND P. LE TALLEC, Augmented Lagrangian and Operator Splitting Methods in Nonlinear Mechanics, SIAM Studies in Applied Mathematics, Philadelphia, PA, 1989.

[23] G. H. GOLUB AND C. F. VAN LOAN, Matrix Computation, third edition, The Johns Hopkins University Press, 1996.

[24] B. S. HE, L. Z. LIAO, D. HAN AND H. YANG, A new inexact alternating directions method for monontone variational inequalities, Mathematical Programming, 92(2002), pp. 103-118.

[25] B. S. HE, M. H. XU AND X. M. YUAN, Solving large-scale least squares covariance matrix problems by alternating direction methods, Submission, 2009.

[26] B.S. HE AND H. YANG, Some convergence properties of a method of multipliers for linearly constrained monotone variational inequalities, Operations Research Letters, 23 (1998), pp. 151-161.

[27] B. S. HE AND X. M. YUAN, The unified framework of some proximal-based decomposition methods for monotone variational inequalities with separable structure, Submission, 2009.

[28] S. KONTOGIORGIS AND R. R. MEYER, A variable-penalty alternating directions method for convex optimization, Mathematical Programming, 83(1998), pp. 29-53.

[29] R. M. LARSEN, PROPACK-Software for large and sparse SVD calculations, Available from http://sun.stanford.edu/srmunk/PROPACK/.

[30] S. L. LAURITZEN, Graphical Models, Oxford University Press, 1996.

[31] S. Q. MA, D. GOLDFARB AND L. F. CHEN, Fixed point and Bregman iterative methods for matrix rank minimization, preprint, 2008.

[32] M. NG, P. A. WEISS AND X. M. YUAN, Solving constrained total-Variation problems via alter-nating direction methods, Manuscript, 2009.

[33] Y. NESTEROV, Gradient methods for minimizing composite objective function, CORE Discus-sion Paper 2007/76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, Belgium, 2007.

[34] J. NOCEDAL AND S. J. WRIGHT, Numerical Optimization. Second Edition, Spriger Verlag, 2006.

[35] Y. C. PATI AND T. KAILATH, Phase-shifting masks for microlithography: Automated design and mask requirements, Journal of the Optical Society of America A, 11 (9), 1994.

[36] B. RECHT, M. FAZEL AND P. A. PARRILO, Guaranteed Minimum Rank Solutions to Linear Matrix Equations via Nuclear Norm Minimization, submitted to SIAM Review, 2007.

[37] E. SONTAG, Mathematical Control Theory, Springer-Verlag, New York, 1998.

[38] R. TIBSHIRANI, Regression shrinkage and selection via the LASSO, Journal of the Royal sta-tistical society, series B, 58 (1996), pp. 267-288.

[39] P. TSENG, Alternating projection-proximal methods for convex programming and variational inequalities, SIAM Journal on Optimization, 7(1997), pp. 951-965.

[40] R. H. T¨UT¨UNC¨U, K. C. TOH AND M. J. TODD, Solving semidefinite-quadrtic-linear programs using SDPT3, Mathematical Programming, 95(2003), pp. 189-217.

[41] L. G. VALIANT, Graph-theoretic arguments in low-level complexity, 6th Symposium on Mathe¬matical Foundations of Computer Science, pp. 162-176, 1977.

[42] C. H. YE AND X. M. YUAN, A Descent Method for stuctured monotone variational inequalities, Optimization Methods and Software, 22(2007), pp. 329-338.

[43] Z. W. WEN, D. GOLDFARB AND W. YIN, Alternating direciotn augmented lagrangina methods for semidefinite porgramming, manuscript, 2009.

[44] X. M. YUAN, Alternating Direction Methods for Sparse Covariance Selection, Submission, 2009.

Jasa Konsultasi Disertasi Keuangan (1) dan Perbankan S3, hubungi WA 0821 2230 7021

Selasa, 09 Agustus 2022

Contoh paper manajemen proyek untuk anda: Locally Linear Metric Adaptation for Semi-Supervised Clustering

Tidak ada komentar:

Posting Komentar