SvmFu Documentation

SvmFu 3


Last modified: Wed Oct 12 2005
ATTENTION! As of October 2005, I believe SvmFu is primarily of historical interest. When I first wrote SvmFu, there was a dearth of available SVM software. Since I finished my PhD, I have not had the time to maintain SvmFu. I understand that it does not compile with recent versions of gcc. I do not know if I will ever have time to update this code. You are welcome to use it "as is," but I am not able to provide support. In the interim, I recommend SVMlight or LIBSVM. --rif

Table of Contents


Introduction

SvmFu 3 is a package written for using Support Vector Machines. It is written in C++, does not require any third-party optimization engine, and is very fast. The algorithm is somewhat inspired by Platt's SVM algorithm, in that it optimizes a pair of points at each iteration, with many important twists of my own. SvmFu will only handle classification, not regression. There will be a paper or tech report about SvmFu soon. This is version 3.1 of SvmFu. It is a complete rewrite. It is much faster and more featureful. If you've been using a previous version, you should definitely read the changes section, as the user interface as changed substantially. If you're interested in hearing about upgrades and new features, drop me an email. If you use the software and find it useful, or have any suggestions for improvements, drop me an email. If you have questions, check out the FAQ before sending me email. Thanks, and enjoy!

License

SvmFu is © 2000 Ryan Rifkin and MIT and is released under the GPL. In particular,

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.


Getting SvmFu

SvmFu is currently available as source code only. It should compile with g++ version 2.95 or above on any Unix-like system, such as Linux, Solaris, or under the Cygwin system for Windows.

Downloading

SvmFu 3 can be downloaded via this link.

Compiling and Installing

SvmFu uses the Autoconf system for auto-configuration. To build the package, unpack the distribution into an empty directory and execute the following commands:

     ./configure
     make
(See the INSTALL file for generic information about the configure script and options you can pass to it).

This will create the following programs under the src/clients subdirectory:

     svmfutrain
     svmfutest
     svmfutrainmulti
     mrep.pl

Note that you will almost certainly need the g++ compiler to compile SvmFu. The particular use of templates in a library combined with the general state of flux with regards to C++ templates means that you will probably not have much luck compiling with other C++ implementations. SvmFu has been tested with g++ versions 2.95 and up to (but not including) GCC 3.0.

The SvmFu binaries are standalone and do not depend on any particular directory structure or location. You can place them anywhere you wish.

(Note: SvmFu no longer has a "tests" subdirectory. The effort to maintain these checks produced more pain the checks produced value.)


Using SvmFu

Overview

Working with SVMs is a two step process: training and testing. The training and testing programs are separate. The training program reads in a set of training points (vectors) and saves the trained Support Vector Machine to a file. The testing program reads in a set of testing points and loads the saved SVM.

Data Types

SvmFu supports different data types for the data elements (ie, the points read in from the supplied training or testing file) and the internal kernel values. Support for each combination of data element and kernel value type must be explicitly compiled into SvmFu. By default, the following four combinations are supported:

data elementkernel value
intint
floatfloat
doubledouble
intfloat

where int is a 32-bit integer, float is a 32-bit IEEE single-precision floating point number, and double is a 64-bit IEEE double-precision floating point number. If other combinations of types are required, it is necessary to add two lines to the SvmFu source code and recompile; this does not require any knowledge of C or C++. The relevant location is in src/lib/SvmFuSvmTypes.h around line 50, and relevant documentation is inside the code.

Types are specified to the svmfutrain family of programs by the -D and -K parameters, as described in the clients portion of this documentation.

Data Format

The training and testing sets are lists of points and their respective classes. They are supplied to SvmFu as plain ASCII text files in one of three formats: dense, sparseN, or sparse01. For all formats, the data consists of (possibly signed and possibly floating-point) numbers separated by whitespace. The particular choice or amount of whitespace does not matter.


Clients

A typical and simple usage of the SvmFu clients is as follows:

svmfutrain -f dense.train -C 2 -s svm.out
svmfutest -f dense.test -s svm.out
Each of the programs have a number of command-line options, detailed below.

Svmfutrain

svmfutrain is used to train binary classifiers. The input data is expected to consist of points classified as either belonging to class "1" or "-1". The following command-line options are supported:

Svmfutest

svmfutest takes in an input file containing the training points and their correct classifications and tests the performance of a trained SVM on them. It takes only three arguments; the SVM, kernel, and data type parameters are all taken from the saved SVM file.

Svmfutrainmulti

svmfutrainmulti is the SVM client for doing multiclass classification. The datasets for svmfutrainmulti are identical in form to those for the other clients, except that the class values at the end of each line are non-negative integers (if you have 10 classes, use the integers 0 through 9 for labels). Most of the options are the same as for svmfutrain, and are described there. Here we describe options unique to svmfutrainmulti.

Mrep

mrep is the program you use to interpret the results of svmfutrainmulti. Technically, it's not an SvmFu client per se --- it's a perl script, and doesn't use the SvmFu library. Nevertheless, this seems the place to describe it. Usage: mrep [options] splitsFile [infile, -] The required argument, splitsFile, is the same splitsFile that was specified using the -p option to svmfutrainmulti. The infile is the file that was output from svmfutranimulti via either the -l or -F options. The other options are:
--report=[list, errors, confusion, accuracy] default: accuracy
--loss=[hinge, zero-one] default: hinge
--error-type=[no-tie,allow-tie,ratio] default: no-tie
--nolabel
The report option controls the form of the output. list gives the class and the prediction for each point, errors gives just the errors, confusion gives the confusion matrix, and accuracy, the default, gives the overall accuracy. The loss option controls how errors in individual classifiers contribute to the error for a column in the splits matrix (if this doesn't make sense, read Alwein and Schapire's paper, "Reducing Multiclass to Binary"). The error function controls what happens in the case of ties. no-tie treats all ties as errors, allow-tie treats ties as correct, and ratio treats ties as partially correct. The nolabel generates predictions, and is useful when the labels are unknown. It can only be used with --report=list.

Tips

Chunking

In general, if you're solving on 2,000 or fewer points, you probably don't want to use chunking. If you've got more than 2,000 points, chunking with a chunksize of 2,000 is probably reasonable. Your own mileage may vary; play with the numbers a bit and see what you get. Note that if you use chunking, performance will become very slow if the chunksize is smaller than the number of unbounded support vectors.

Choosing values for tolerance and epsilon

If you do not ensure that tol >> eps, the algorithm will abort many of its steps because they end up being indistinguishable from zero. Conventional wisdom is that a tol of 10E-3 is small enough. I have often found that when trying to get the outputs from different SVM solvers to match closely (for comparison purposes), a tol of 10E-6 produces much better results.

WARNING: eta too small in takeStep.

and related errors. SvmFu can get unhappy if your dataset contains identical data points (two data points x and y for which K(x,x) = K(y,y) = K(x,y) from different classes. The message should help you diagnose this problem. If you get warnings like this, your answers are probably not trustworthy. If you are sure there are no identical data points in differing classes and you get this message, please email me.

Changes

The biggest user-visible changes relate to compilation and file formats. Previous to 3.0, SvmFu was templated, but was compiled for a single type of data and kernel value at a time. Now, instantiations for multiple types are compiled into a library, and the user can select among them at runtime.

The svm save files now contain information about what kernel and parameters were used by svmfutrain, so these options no longer need to be specified to svmfutest, eliminating a major source of error.

The sparse file formats no longer require padding of the data, resulting in much shorter files when the data points have widely varying numbers of non-zero elements. See the data formats section for more info.


FAQ

Can SvmFu handle regression? If not, can you point me at another code than can?

SvmFu cannot do regression. It only does classificaton. There are no immediate plans to add regression. I do not recommend other codes, but a wide variety of codes are available at
kernel-machines.org. I have not evaluated these codes, so I cannot comment on their quality.

Is there a Windows or Windows NT Version of SvmFu?

SvmFu will work under Cygwin on Windows, so yes. If what you are really asking is, "Is there a Visual C++ version?", then the answer is no, and I'm not sure when it will happen. Some other people in the lab were supposedly working on it, but I don't think that project is currently active. If you send me an email asking to be put on the SvmFu mailing list (VERY low traffic), you'll get occasional updates on the situation. Asking me when it's going to happen won't make it happen any sooner. Of course, since SvmFu is open source, you could put together a Windows version yourself...

Help! My code doesn't compile! Here's the output...

Have you checked to make sure you have the Standard Template Library properly installed where the compiler can find it? Are you using a recent version of g++? If you don't know the answer to either of these questions, talk to your local system administrator. If the answer to both these questions is "yes", then you can go ahead and email me and I'll try to help you.

When will a publication be ready?

I'm not sure. Soon I hope. Asking me in an email won't make it happen faster.

Will you modify SvmFu to do XXXX?

You can ask, but probably not. Particularly if the modification you want is a simple modification of the output format, I suggest you write a perl script to postprocess, or go into the code and tinker with it yourself. It is open source, after all. If you need more substantial modifications or assistance and are interested in paying consulting rates, drop me an email.

Credits and Acknowledgements

This software was developed by Ryan Rifkin at MIT's Center for Biological and Computational Learning, and at Compaq's Cambridge Research Lab. Many thanks to Pedro Moreno, Henry Nicponski, Mariano Alvira, Jim Paris, Virgil King, Michelle Nadermann and others whose names I'm likely forgetting.