SvmFu 3 is a package written
for using Support Vector Machines. It is written in C++, does not
require any third-party optimization engine, and is very fast. The
algorithm is somewhat inspired by Platt's SVM algorithm, in that it
optimizes a pair of points at each iteration, with many important
twists of my own. SvmFu will only handle classification, not
regression. There will be a paper or tech report about SvmFu soon.
This is version 3.1 of SvmFu. It is a complete rewrite. It is much
faster and more featureful. If you've been using a previous version,
you should definitely read the changes section,
as the user interface as changed substantially.
If you're interested in hearing about upgrades and new features, drop me an email. If you use the
software and find it useful, or have any suggestions for improvements,
drop me an email. If you have questions, check out the
FAQ before sending me email. Thanks, and enjoy!
SvmFu is © 2000 Ryan Rifkin and MIT and is released under the
GPL.
In particular,
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General
Public License for more details.
SvmFu is currently available as source code only. It should
compile with g++ version 2.95 or above on any Unix-like
system, such as Linux, Solaris, or under the
Cygwin system for
Windows.
SvmFu 3 can be
downloaded via this link.
SvmFu uses the Autoconf system for auto-configuration. To build the
package, unpack the distribution into an empty directory and execute
the following commands:
This will create the following programs under the src/clients subdirectory:
Note that you will almost certainly need the g++ compiler
to compile SvmFu. The particular use of templates in a library
combined with the general state of flux with regards to C++ templates
means that you will probably not have much luck compiling with other
C++ implementations. SvmFu has been tested with g++ versions
2.95 and up to (but not including) GCC 3.0.
The SvmFu binaries are standalone and do not depend on any particular
directory structure or location. You can place them anywhere you wish.
(Note: SvmFu no longer has a "tests" subdirectory. The effort to
maintain these checks produced more pain the checks produced value.)
Working with SVMs is a two step process: training and testing. The
training and testing programs are separate. The training program
reads in a set of training points (vectors) and saves the trained
Support Vector Machine to a file. The testing program reads in a set
of testing points and loads the saved SVM.
SvmFu supports different data types for the data elements (ie, the
points read in from the supplied training or testing file) and the
internal kernel values. Support for each combination of data element
and kernel value type must be explicitly compiled into SvmFu. By
default, the following four combinations are supported:
License
Getting SvmFu
Downloading
Compiling and Installing
./configure
make
(See the INSTALL file for generic information about the
configure script and options you can pass to it).
svmfutrain
svmfutest
svmfutrainmulti
mrep.pl
Using SvmFu
Overview
Data Types
| data element | kernel value |
| int | int |
| float | float |
| double | double |
| int | float |
where int is a 32-bit integer, float is a 32-bit IEEE single-precision floating point number, and double is a 64-bit IEEE double-precision floating point number. If other combinations of types are required, it is necessary to add two lines to the SvmFu source code and recompile; this does not require any knowledge of C or C++. The relevant location is in src/lib/SvmFuSvmTypes.h around line 50, and relevant documentation is inside the code.
Types are specified to the svmfutrain family of programs
by the -D and -K parameters, as described in the
clients portion of this documentation.
The training and testing sets are lists of points and their
respective classes. They are supplied to SvmFu as plain ASCII
text files in one of three formats: dense, sparseN, or sparse01.
For all formats, the data consists of (possibly signed and possibly
floating-point) numbers separated by whitespace. The particular
choice or amount of whitespace does not matter.
For example:
This example represents the same data as the
dense case:
This example represents data similar to the
dense case,
except that non-zero component values have been set to 1.
A typical and simple usage of the SvmFu clients is as follows:
svmfutrain is used to train binary classifiers. The input
data is expected to consist of points classified as either belonging to
class "1" or "-1". The following command-line options are supported:
svmfutest takes in an input file containing the training
points and their correct classifications and tests the performance
of a trained SVM on them. It takes only three arguments; the SVM,
kernel, and data type parameters are all taken from the saved SVM
file.
svmfutrainmulti is the SVM client for doing multiclass
classification. The datasets for svmfutrainmulti are
identical in form to those for the other clients, except that the
class values at the end of each line are non-negative integers (if you
have 10 classes, use the integers 0 through 9 for labels).
Most of the options are the same as for svmfutrain, and are
described there. Here we describe options unique to
svmfutrainmulti.
For example, the following file would specify one-vs-all classifiers
on five classes:
To get all-pairs on four classes, we'd use:
Arbitrary combinations of 1's, 0's, and -1's are possible, allowing
for the implementation of the coding scheme of your choice.
svmfutrainmulti will train one classifier for each line of
your splits file.
The svm save files now contain information about what kernel and
parameters were used by svmfutrain, so these options no
longer need to be specified to svmfutest, eliminating a major
source of error.
The sparse file formats no longer require padding of the data,
resulting in much shorter files when the data points have widely
varying numbers of non-zero elements. See the data formats section for more info.
Data Format
The input file contains, in order,
3 5
1 0 5 0 9 -1
2 4 0 0 0 1
0 0 7 6 2 1
This format is useful for when the data is sparse; that is, when the
majority of the components of each point are zero. You need only
specify the value of the non-zero component, by specifying both the
dimension index and the value at that point. The input file
contains, in order,
3
6 1 1 3 5 5 9 -1
4 1 2 2 4 1
6 3 7 4 6 5 2 1
Like sparseN, but useful
for when all of the vector components are either zero or one.
Basically identical to sparseN except that all values are assumed
to be 1 (and the number preceding each point is not doubled). The
input file contains, in order,
3
3 1 3 5 -1
2 1 2 1
3 3 4 5 1
Clients
svmfutrain -f dense.train -C 2 -s svm.out
svmfutest -f dense.test -s svm.out
Each of the programs have a number of command-line options, detailed below.
Svmfutrain
Svmfutest
Svmfutrainmulti
5 5
1 -1 -1 -1 -1
-1 1 -1 -1 -1
-1 -1 1 -1 -1
-1 -1 -1 1 -1
-1 -1 -1 -1 1
6 4
1 -1 0 0
1 0 -1 0
1 0 0 -1
0 1 -1 0
0 1 0 -1
0 0 1 -1
Mrep
mrep is the program you use to interpret the results of
svmfutrainmulti. Technically, it's not an SvmFu client per
se --- it's a perl script, and doesn't use the SvmFu library.
Nevertheless, this seems the place to describe it.
Usage: mrep [options] splitsFile [infile, -]
The required argument, splitsFile, is the same splitsFile
that was specified using the -p option to
svmfutrainmulti. The infile is the file that was
output from svmfutranimulti via either the -l or
-F options. The other options are:
--report=[list, errors, confusion, accuracy] default: accuracy
--loss=[hinge, zero-one] default: hinge
--error-type=[no-tie,allow-tie,ratio] default: no-tie
--nolabel
The report option controls the form of the output.
list gives the class and the prediction for each point,
errors gives just the errors, confusion gives the
confusion matrix, and accuracy, the default, gives the
overall accuracy.
The loss option controls how errors in individual classifiers
contribute to the error for a column in the splits matrix (if this
doesn't make sense, read Alwein and Schapire's paper, "Reducing
Multiclass to Binary").
The error function controls what happens in the case of ties.
no-tie treats all ties as errors, allow-tie treats
ties as correct, and ratio treats ties as partially correct.
The nolabel generates predictions, and is useful when the
labels are unknown. It can only be used with --report=list.
Tips
Chunking
In general, if you're solving on 2,000 or fewer points, you probably
don't want to use chunking. If you've got more than 2,000 points,
chunking with a chunksize of 2,000 is probably reasonable. Your own
mileage may vary; play with the numbers a bit and see what you get.
Note that if you use chunking, performance will become very slow if
the chunksize is smaller than the number of unbounded support
vectors.
Choosing values for tolerance and epsilon
If you do not
ensure that tol >> eps, the algorithm will
abort many of its steps because they end up being indistinguishable
from zero. Conventional wisdom is that a tol of
10E-3 is small enough. I have often found that when trying
to get the outputs from different SVM solvers to match closely (for
comparison purposes), a tol of 10E-6 produces much
better results.
WARNING: eta too small in takeStep.
and related errors.
SvmFu can get unhappy if your dataset contains identical data points
(two data points x and y for which K(x,x) = K(y,y) = K(x,y)
from different classes. The message should help you diagnose
this problem. If you get warnings like this, your answers are
probably not trustworthy. If you are sure there are no identical data
points in differing classes and you get this message, please email me.
Changes
The biggest user-visible changes relate to compilation and file
formats. Previous to 3.0, SvmFu was templated, but was compiled for a
single type of data and kernel value at a time. Now, instantiations
for multiple types are compiled into a library, and the user can
select among them at runtime.
FAQ
Can SvmFu handle regression? If not, can you point me at another code than can?
SvmFu cannot do regression. It only does classificaton. There are no
immediate plans to add regression. I do not recommend other codes,
but a wide variety of codes are available at kernel-machines.org. I have
not evaluated these codes, so I cannot comment on their quality.