◆ Common Lisp Testing Frameworks
I’ve been evaluating the various Common Lisp testing frameworks. I have a large body of code in my Project Euler stuff, and I use unit testing extensively; extensively enough that the simplistic unit testing package I had been using was getting unwieldy. So I figured I’d take a look at what was available and see how everything stacked up.
Since this page is long, I’ll present my summary up front: If your testing needs are simple, use lisp-unit. It you need many, well-organized tests (as I now do), use either Stefil or FiveAM, depending on whether you want highly-interactive or noninteractive tests, respectively.
First, a bit about my requirements. I use packages extensively in my
Project Euler setup. I have a main package, :euler
, which contains
“common” code, both things that are used extensively, like my primality
testing function, and things that might only be used in one place, but
that seemed like they belonged in the common section, like the
implementation of Dijkstra’s shortest-path algorithm or the function for
finding a minimum spanning tree. I have unit tests for most of these
functions. (Not all yet, but I’m working on it.) There is a separate
package for each problem, into which goes at least a function named
get-answer
as well as any other problem-specific code and tests. The
problem-specific tests include at least tests to check my code against the
examples given in the problems and (when I’ve gotten it) a test for the
answer to the problem. I generally want to do one of three things: test
everything, test just one problem’s code, or test the library functions.
I occasionally want to test just the answers to all of the problems, too.
In setting up the tests, I wrote a few macros and put all of my tests in
those so I could redefine the test framework easily. When I could use
hierarchies, I set them up as follows: suite euler-tests
is the root of
the tree. It contains euler-functions
and euler-problems
.
euler-functions
contains a suite/test (depending on the particular
framework) for each common function. euler-problems
contains one suite
for each problem, of the form euler-<problem-number>-tests
. That suite
contains a test/suite for the examples given in the problem
(euler-<problem-number>-givens
), one for the answer
(euler-<problem-number>-answer
), and possibly one for each internal
function I want to test (euler-<problem-number>-<function-name>
). In my
example runs, I’ve deliberately introduced several errors: a function
signalling an error when it shouldn’t and vice versa, a function returning
t
when it should return nil
and vice versa, and a couple of functions
that return an incorrect (numerical) value.
§ lisp-unit
lisp-unit was the package I’d been using, and I like it because it’s dead simple to work with. It uses packages to group the tests, so your existing package structure serves the part of test suites. Obviously, they’re not even hierarchical, let alone composable. There are no fixtures. While the design lacks some flexibility, since it’s tied directly to the package system, it makes up for it in simplicity. Because there’s no need to declare and organize your test suites, adding a test is as simple as
(lisp-unit:define-test <test-name>
<test-clauses>)
The test clauses use the fairly obvious assert-true
, assert-false
,
assert-equal
, and assert-error
tests (plus a couple of others).
Running the tests is either a matter of (run-tests <test-names>)
, which
runs tests in the current package (if no names are given, all tests are
run); or (run-all-tests <package> <test-names>)
to use a different
package (again, listing no tests will run them all).
The main drawbacks for large bodies of tests are that there’s no composition of test suites, so you have to give explicit commands to run all of your tests, and it prints one line for each test. Here’s a sample run:
* (eval
`(progn
(lisp-unit:run-all-tests euler)
,@(loop for i from 1 to 142
collect `(lisp-unit:run-all-tests
,(intern (format nil "E~3,'0D" i)))))))
EULER-BOUNCYP: 17 assertions passed, 0 failed.
EULER-LEAST-DIVISIBLE-REPUNIT: 2 assertions passed, 0 failed.
EULER-MODPOWER: 6 assertions passed, 0 failed.
EULER-NEXT-PRIME: 9 assertions passed, 0 failed.
EULER-NUM-PARTITIONS: 12 assertions passed, 0 failed.
EULER-PALINDROMEP: (PALINDROMEP 10221) failed:
Expected T but saw NIL
EULER-PALINDROMEP: 35 assertions passed, 1 failed.
EULER-PREV-PRIME: 7 assertions passed, 0 failed.
EULER-PRIMEP: (PRIMEP 2) failed:
Should have signalled ERROR but saw T
EULER-PRIMEP: 11 assertions passed, 1 failed.
EULER-UNIQUE-DIGITS-P: (UNIQUE-DIGITS-P 1) failed:
Expected NIL but saw T
EULER-UNIQUE-DIGITS-P: 9 assertions passed, 1 failed.
TOTAL: 108 assertions passed, 3 failed, 0 execution errors.
EULER-001-ANSWER: 1 assertions passed, 0 failed.
EULER-001-GIVENS: (GET-ANSWER 10) failed:
Expected 24 but saw 23
EULER-001-GIVENS: 0 assertions passed, 1 failed.
TOTAL: 1 assertions passed, 1 failed, 0 execution errors.
EULER-002-ANSWER: 1 assertions passed, 0 failed.
EULER-002-GIVENS: 1 assertions passed, 0 failed.
TOTAL: 2 assertions passed, 0 failed, 0 execution errors.
[ 72 lines snipped ]
EULER-035-GIVENS: 1 assertions passed, 0 failed.
EULER-035-ROTATE-NUM: (ROTATE-NUM 1 1) failed:
Expected 2 but saw 1
EULER-035-ROTATE-NUM: 9 assertions passed, 1 failed.
TOTAL: 11 assertions passed, 1 failed, 0 execution errors.
[ 123 lines snipped ]
EULER-095-ANSWER: The value -1 is not of type (MOD 536870911).
[ 87 lines snipped ]
In most cases, the error test is very informative. It gives the code that it ran, the expected return value, and the actual value. The only exception is when an unexpected error was signalled; in that case, only the test name is given. The problem, as you can see, is that these informative failure messages get lost among all the noise of the successful tests. The fact that there’s no composition means that you don’t even get a count of total failed tests at the end. (That is a failing of many of the Common Lisp test frameworks available).
lisp-unit’s main strength is that it’s easy to use. For simple testing situations, it’s easily the best one available, because it’s so easy to use there’s no barrier to writing all the tests you need. Its failure messages are useful and informative. Its main drawback is that it doesn’t scale well. If your test needs are simple then lisp-unit is probably the best choice for you.
§ LIFT
Initially, LIFT looked good. It’s obviously influenced by JUnit’s ways of doing things. Test suites are defined very much like CLOS classes; they have a hierarchical inheritance structure (with multiple inheritance, even) and can have slots just like CLOS classes. Fixtures for setting up and tearing down resources for the tests are supported (with the resources references by custom slots). The test running function simply returns a test result object, but the package uses the pretty-printing function for that object to detail the test results, which is rather neat and would lend itself well to automated processing of the test results. Another interesting feature is randomized testing based on a specification, similar to QuickCheck for Haskell.
Setting up tests requires defining a test suite and adding tests to it, which can either be done in one statement:
(lift:deftestsuite <suite-name> (<parent-suites>)
(<slots>)
(:setup
<forms to set up the slots>)
(:teardown
<forms to tear down slots>)
(:tests
(<test-clause>)
(<test-clause>)
(<test-clause>)))
or, since everything after the slot definitions is optional, in multiple statements:
(lift:deftestsuite <suite-name> (<parent-suites>)
(<slots>))
(lift:addtest (<suite-name>)
<test-clause>)
(Note that having the suite name in a list for addtest
is a special
case. Normally, that parameter would be a symbol giving the name of the
test, which would be placed in the most recently-defined test suite.
Using a list with a suite name in it overrides that behavior and assigns
the test to a specific suite and gives the test a name of the form
test-<number>
. Another special case is that you can omit both the name
and the suite; when the first form is a list of more than one element, it
is treated as a test clause and the test is given an autogenerated name
(as above) and placed in the most recently-defined test suite.)
Test clauses are very much like lisp-unit’s, except that the test macros
have slightly different names: ensure
(analogous to
lisp-unit:assert-true
), ensure-same
, and ensure-error
. The analog
to (lisp-unit:assert-false <form>)
is (ensure (not <form>))
.
The fact that test suites support multiple inheritance is a nice touch.
In my setup, it means that I could have an additional suite named
euler-answers
, and every answer suite would inherit from both its
project’s test suite and the euler-answers
suite. That way I would have
a suite that would run just the tests on the problems’ answers without
interfering with the rest of my test hierarchy structure. Of the test
frameworks I looked at, LIFT was the only one with multiple inheritance
for test suites.
One feature that’s not in LIFT just yet but which the author has mentioned adding is support for benchmarking tests. This is something that would really interest me, especially in conjunction with being able to run just the tests on the problem answers. (I’ve actually got some code that did benchmarking of my lisp-unit code, and I’ve used it to target my optimization of problems. Having support for that sort of thing baked into the testing framework would be very nice.)
Test and suite names appear to be effectively global; as far as I can
tell, they’re compared by string=
on symbol-name
. This means that
it’s easy to call tests that were defined in a different package than the
current one, but it also means that the normal package exporting rules
don’t apply, and all names must be globally unique.
That’s all good stuff. There were a few problems I ran into. First,
whenever I defined a test, SBCL raised a style-warning
about redefining
things in defmethod
, even if it was the first time I’d defined the test.
SLIME didn’t quite understand the warnings, so everytime I compiled a test
suite or file, I got an extra SLIME window listing all the spurious
warnings.
The larger problem, though, was its speed and memory footprint. Defining tests is very slow; when using LIFT, the time necessary to compile and load all of my Project Euler code jumped from the other frameworks' average of about 1.5 minutes to over nine minutes. Redefining tests felt even slower than defining them initially, but I don’t have solid numbers on that. After loading everything, memory usage was more than twice that of other frameworks. Running all of the tests took more than a minute longer than other frameworks, though that seems mostly to be a result of swapping induced by LIFT’s greater memory requirements.
Here’s a sample run:
* (lift:run-tests :suite 'euler-tests)
#<Results for EULER-TESTS 380 Tests, 5 Failures, 1 Error.
Failure: euler-primep : test-12
Condition: Expected ERROR but got NIL
Failure: euler-palindromep : test-36
Condition: Ensure failed: (EULER:PALINDROMEP 10221) ()
Failure: euler-unique-digits-p : test-10
Condition: Ensure failed: (NOT (EULER:UNIQUE-DIGITS-P 1)) ()
Failure: euler-001-givens : test-1
Condition: Ensure-same: 24 is not EQUAL to 23
Failure: euler-035-rotate-num : test-1
Condition: Ensure-same: 2 is not EQUAL to 1
ERROR : euler-095-answer : test-1
Condition: The value -1 is not of type (MOD 536870911).
>
As I mentioned before, the use of the print-object
generic function to
give a summary of the tests is very nice. The fact that it only details
failures is also quite nice; failures aren’t lost in the noise of
successful tests. The reporting of the failures isn’t quite as nice as in
lisp-unit. While failures of ensure
show the code that failed, failures
of ensure-same
merely show the mismatched values, not the code that
generated the values.
There’s no feedback while the test routines are running, which is a minor annoyance, but completely in keeping with the nature of functional programming.
Overall, LIFT is an intriguing framework with a lot of good things going for it, but the speed and memory issues killed it for me. It might work better with fewer test cases and suites, but in that case I’d probably recommend just using lisp-unit and its elegant simplicity.
§ XLUnit
XLUnit bills itself as a low-programmer-overhead adaptation of JUnit.
It’s based on both XPTEST and clos-unit (so I didn’t bother trying
either of those individually). It follows the approach of JUnit in using
native OO classes as test suites; each suite is simply a CLOS class that
inherits from the test-case
class. It does have its own macro for
declaring test cases, but its format reads very much like a generic method
being specialized on a particular class. Likewise, fixtures are supported
by implementing the set-up
and tear-down
generic functions for your
test suite class. The main drawback is that, although you can create
hierarchies of test suites (since they’re just CLOS classes), the suites
aren’t composable.
Setting up tests is pretty simple:
(defclass <suite-name> (xlunit:test-case)
(<slots>))
(xlunit:def-test-method <test-name> ((test <suite-name>))
<test-clauses>)
XLUnit uses fairly standard names for its test clauses, too:
assert-true
, assert-false
, assert-equal
, assert-condition
,
assert-eql
, and assert-not-eql
.
The UI is pretty similar to the standard *Unit text interface, too:
* (xlunit:textui-test-run (xlunit:get-suite euler-tests))
..F...F..F..
Time: 0.072
There were 3 failure:
1) EULER-UNIQUE-DIGITS-P: Assert false: (UNIQUE-DIGITS-P 1)
2) EULER-PALINDROMEP: Assert true: (PALINDROMEP 10221)
3) EULER-PRIMEP: Assert condition ERROR, but no condition signaled
FAILURES!!!
Run: 9 Failures: 3 Errors: 0
#<XLUNIT:TEST-RESULTS {AC399D9}>
* (xlunit:textui-test-run (xlunit:get-suite e001::euler-001-tests))
..F
Time: 0.002
There was 1 failure:
1) EULER-001-GIVENS: Assert equal: 24 23
FAILURES!!!
Run: 2 Failures: 1 Errors: 0
#<XLUNIT:TEST-RESULTS {B79A7F9}>
* (xlunit:textui-test-run (xlunit:get-suite e095::euler-095-tests))
.E
Time: 0.138
There was 1 error:
1) EULER-095-ANSWER: The value -1 is not of type (MOD 536870911).
FAILURES!!!
Run: 1 Failures: 0 Errors: 1
#<XLUNIT:TEST-RESULTS {B82F53A}>
It prints dots, ‘F’s, and ‘E’s as it runs each test and then summarizes
the failures. As with LIFT, it gives details of the code that failed in
some cases (assert-true
, assert-false
), but not in others
(assert-equal
, assert-error
, unexpected condition).
The def-test-method
macro feels a little clunky to me. I’d rather see
the class instance and test suite named separately from each other, but I
think part of that is just experience with other test frameworks that
worked differently. It does make plenty of sense when viewed as just an
extension of CLOS. The non-composability of suites is what really kills
XLUnit for me. If I didn’t want composition, I’d still be using
lisp-unit, which has even less syntactic overhead and comparable resource
usage. I’d say it has some usefulness for situations where you needed
fixtures but still wanted a simple programming interface to the
framework.
§ FiveAM
Initially, FiveAM reminded me a bit of LIFT with fewer gee-whiz
features. It, too, does hierarchical, composable test suites. (In fact,
it was the only other framework that supported composing suites.) It also
does fixtures, but somewhat differently than anyone else. In FiveAM,
fixtures are written as macros with a special keyword indicating where the
test body will go. On one hand, it’s a very Lispy approach to the issue.
(In particular, it lets you keep all of your state in local variables,
which is a bit more functional style than the OO approach of using class
slots.) On the other hand, you have to wrap every set of tests that use
the fixture in a with-fixture
macro, while most other approaches tie the
fixtures to the test suite so they’re set up and torn down automatically
for each test in the suite.
Like LIFT, FIveAM has support for specification-based random testing. I don’t have anywhere I use things like that, so I again didn’t test it.
FiveAM also has fewer test clauses; most cases will work with the is
macro, which codewalks its body to determine what exactly is being
tested.
Test suites are pretty easy to create:
(5am:def-suite <suite-name>)
It supports inheritance (and documentation) with key parameters:
(5am:def-suite <suite-name> :in <parent-suite>
:documentation <doc-string>)
Test may either be bound to suites explicitly:
(5am:test (<test-name> <dependencies> <suite>)
<test-clauses>)
or implicitly, as with the in-package
form:
(5am:in-suite <suite-name>)
(5am:test <test-name>
<test-clauses>)
The <dependencies>
parameter is for declaring that a given test depends
on another test completing successfully (or failing) first.
Sample run:
* (time (5am:run! '5am-euler-tests))
.................f...................................................f..........
...........f....................f...............................................
...........f....................................................................
..................................................X.............................
.............................................................
Did 381 checks.
Pass: 375 (98%)
Skip: 0 ( 0%)
Fail: 6 ( 1%)
Failure Details:
--------------------------------
EULER-PRIMEP []:
Failed to signal a ERROR.
--------------------------------
--------------------------------
EULER-PALINDROMEP []:
10221 did not satisfy PALINDROMEP.
--------------------------------
--------------------------------
EULER-UNIQUE-DIGITS-P []:
1 satisfied UNIQUE-DIGITS-P.
--------------------------------
--------------------------------
EULER-001-GIVENS []:
23 was not = to 24.
--------------------------------
--------------------------------
EULER-035-ROTATE-NUM []:
1 was not = to 2.
--------------------------------
--------------------------------
EULER-095-ANSWER []:
Unexpected Error: #<TYPE-ERROR {B2B7639}>
The value -1 is not of type (MOD 536870911)...
--------------------------------
NIL
As with other *Unit frameworks, it prints one character for each test run, giving a bit of visual feedback. The failure messages are decent, but (as with most frameworks) no code is shown for unequal values.
Overall, I like FiveAM slightly more than the other frameworks I tried. (Though I like lisp-unit’s failure messages, XLUnit’s approach to fixtures, and several of LIFT’s current and proposed features more than FiveAM.) FiveAM does support the one killer feature I was looking for–hierarchical, composable test suites–as well as a number of smaller things that seem very nice and in the spirit of Lisp.
§ ptester
ptester is barely a test framework. It has no test suites and no test
functions. All it provides is a set of macros for checking function
results (test
(analogous to lisp-unit:assert-equal
), test-error
,
test-no-error
, test-warning
, and test-no-warning
) and a wrapper
macro designed to enclose the test clauses which merely provides a count
of success and failures at the end. ptester expects that all testing is
done in predefined functions and lacks the dynamic approach present in
other frameworks.
I didn’t bother trying ptester with my code. The lack of any sort of organization meant that I would have had to rewrite most of my tests to get them all called properly. (I would have to put each set of test clauses in a function and then have the function called from somewhere. Compare to most frameworks, where the test is told which suite or clause it belongs to and running the suite automatically gets all of the tests it contains.)
§ CLUnit
Although there’s a Debian package for CLUnit, it didn’t load properly
into my environment (SBCL 1.0.0.0); it complained that the function
deftest
was undefined. A survey of the documentation indicates that
it’s lacking my single most important requirement: a composable test suite
hierarchy. I thus didn’t spend much time trying to get it to work.
It does support grouping of tests, but not hierarchically. There are no fixtures. Tests are named with strings, which bypasses some of the scope problems that can arise when using symbols. The use of function passing looks a little clunky to me, but it could be abstracted away with a macro if necessary. It does support multiple value returns, which sets it apart from most other frameworks I looked at (MSL.TEST does them, as does ptester, though ptester only considers the first return value by default).
A simple test definition looks roughly like this:
(clunit:deftest "<test-name>"
:category "<suite-name>"
:test-fn (lambda () <test-clauses>))
The test is considered to have failed if test-fn
returns nil
;
otherwise, it succeeded.
The functionality of other frameworks’ test clause macros
(lisp-unit:assert-same
, etc.) is accomplished by use of other keyword
arguments passed in to the deftest clause. For example, this lisp-unit
test
(lisp-unit:define-test addition
(lisp-unit:assert-equal 2 (+ 1 1)))
could be rendered as
(clunit:deftest "addition"
:input-form '(1 1)
:test-fn (lambda (args)
(destructuring-bind (a b) args
(+ a b)))
:output-form 2)
which, yes, is wordier, but the various keywords allow for some interesting combinations of tests.
When the tests are run, the names of failed tests are given, but nothing more. The setup appears to highly encourage one clause per test, so that imprecision shouldn’t be too much of an impediment.
I didn’t really experiment with this one much, since it failed to compile and was missing a key feature, but it does seem to have an interesting approach to testing. I fear, though, that its approach adds too much complexity to the process.
§ FReT
FReT also did not compile for me. It ran into problems with some of its internal macros. It says it is based on LIFT, which I’d already dismissed, and it has no documentation available aside from the source. I opted not the spend the time investigating it further.
§ MSL.TEST
MSL.TEST is another basic test framework. It checks multiple value
returns by default, which is nice (though it made a bunch of my
previously-passing tests fail; all the other frameworks I tried only
compared the primary return value by default). The framework only has
test groups, not a hierarchy, but it uses the return value of its test
clauses to determine success or failure, and the run-tests
function
returns t
only if all of its tests succeeded, so you can fake a
hierarchy be having one of your test clauses run the tests in a different
group. The test results are not aggregated, though, and its output format
is very noisy, printing one line per passing test and more for failing
tests.
Excerpt of a sample run:
* (time (msl.test:run-tests 'euler-tests))
** PASS: EULER-MODPOWER [00:00:00.00]
** FAIL: EULER-PRIMEP [00:00:00.00]
** PASS: EULER-PREV-PRIME [00:00:00.00]
** PASS: EULER-NEXT-PRIME [00:00:00.00]
Test EULER-PALINDROMEP failed
Form: (PALINDROMEP 10221)
Expected value: T
Actual value: NIL.
** FAIL: EULER-PALINDROMEP [00:00:00.00]
** PASS: EULER-NUM-PARTITIONS [00:00:00.05]
Test EULER-UNIQUE-DIGITS-P failed
Form: (UNIQUE-DIGITS-P 1)
Expected value: NIL
Actual value: T.
** FAIL: EULER-UNIQUE-DIGITS-P [00:00:0.004]
** PASS: EULER-BOUNCYP [00:00:00.00]
** PASS: EULER-LEAST-DIVISIBLE-REPUNIT [00:00:00.00]
Ran 9 of 9 tests in group EULER-FUNCTIONS
The following tests failed:
(EULER-UNIQUE-DIGITS-P EULER-PALINDROMEP EULER-PRIMEP)
Totals -- Passed: 6 67%
Failed: 3 33%
** PASS: EULER-001-ANSWER [00:00:00.00]
Test EULER-001-GIVENS failed
Form: (EULER-001:GET-ANSWER 10)
Expected value: 24
Actual value: 23.
** FAIL: EULER-001-GIVENS [00:00:00.00]
Ran 2 of 2 tests in group EULER-001::EULER-001-TESTS
The following tests failed:
(EULER-001::EULER-001-GIVENS)
Totals -- Passed: 1 50%
Failed: 1 50%
** FAIL: EULER-001-SUBGROUP [00:00:00.00]
** PASS: EULER-002-ANSWER [00:00:00.00]
** PASS: EULER-002-GIVENS [00:00:00.00]
Ran 2 of 2 tests in group EULER-002::EULER-002-TESTS
Totals -- Passed: 2 100%
Failed: 0 0%
** PASS: EULER-002-SUBGROUP [00:00:00.00]
[ 78 lines snipped ]
Test EULER-013-ANSWER failed
Form: (EULER-013:GET-ANSWER)
Expected value: 5537376230
Actual values: 5537376230
6107447457844511669265405808963434873323/15625000000000000000000000000000000000000.
** FAIL: EULER-013-ANSWER [00:00:0.004]
Ran 1 of 1 test in group EULER-013::EULER-013-TESTS
The following tests failed:
(EULER-013::EULER-013-ANSWER)
Totals -- Passed: 0 0%
Failed: 1 100%
** FAIL: EULER-013-SUBGROUP [00:00:00.01]
[ 651 lines snipped ]
Test EULER-095-ANSWER failed
Form: (EULER-095:GET-ANSWER)
The value -1 is not of type (MOD 536870911).
** FAIL: EULER-095-ANSWER [00:00:00.22]
Ran 1 of 1 test in group EULER-095::EULER-095-TESTS
The following tests failed:
(EULER-095::EULER-095-ANSWER)
Totals -- Passed: 0 0%
Failed: 1 100%
** FAIL: EULER-095-SUBGROUP [00:00:00.22]
[ 333 lines snipped ]
Ran 140 of 140 tests in group EULER-PROBLEMS
The following tests failed:
(EULER-142::EULER-142-SUBGROUP EULER-141::EULER-141-SUBGROUP
EULER-140::EULER-140-SUBGROUP EULER-138::EULER-138-SUBGROUP
EULER-137::EULER-137-SUBGROUP EULER-128::EULER-128-SUBGROUP
EULER-126::EULER-126-SUBGROUP EULER-121::EULER-121-SUBGROUP
EULER-103::EULER-103-SUBGROUP EULER-095::EULER-095-SUBGROUP
EULER-090::EULER-090-SUBGROUP EULER-084::EULER-084-SUBGROUP
EULER-069::EULER-069-SUBGROUP EULER-062::EULER-062-SUBGROUP
EULER-013::EULER-013-SUBGROUP EULER-001::EULER-001-SUBGROUP)
Totals -- Passed: 124 89%
Failed: 16 11%
** FAIL: EULER-TEST-CHILDREN [00:03:26.29]
Ran 1 of 1 test in group EULER-TESTS
The following tests failed:
(EULER-TEST-CHILDREN)
Totals -- Passed: 0 0%
Failed: 1 100%
MSL.TEST is exceedingly wordy, too much so for feasably running any large number of tests. As usual, I feel that in situations where the requirements are simple enough that MSL.TEST would work well, lisp-unit would work equally well, if not better.
§ Stefil
Stefil takes a much more interactive approach to testing. It is intended primarily for use from the REPL, preferably running within SLIME. Test failures are reported by raising conditions with information about the failing test:
Test assertion failed:
Binary predicate (= X Y) failed.
x: 24 => 24
y: (EULER-001:GET-ANSWER 10) => 23
[Condition of type STEFIL::ASSERTION-FAILED]
Restarts:
0: [CONTINUE] Record the failure and continue
1: [CONTINUE-WITHOUT-DEBUGGING] Record the failure, turn off debugging for this test session and continue
2: [CONTINUE] Skip the rest of the test (STEFIL::NAME-OF STEFIL::TEST) and continue
3: [RETEST] Rerun the test (STEFIL::NAME-OF STEFIL::TEST)
4: [CONTINUE] Skip the rest of the test (STEFIL::NAME-OF STEFIL::TEST) and continue
5: [RETEST] Rerun the test (STEFIL::NAME-OF STEFIL::TEST)
6: [CONTINUE-WITHOUT-DEBUGGING] Turn off debugging for this test session and invoke the first CONTINUE restart
7: [ABORT-TESTING] Abort the entire test session started with EULER-001::TEST-ALL
8: [ABORT-REQUEST] Abort handling SLIME request.
9: [TERMINATE-THREAD] Terminate this thread (#<THREAD "repl-thread" {AF0CB91}>)
Backtrace:
0: (STEFIL::RECORD-FAILURE* STEFIL::FAILED-ASSERTION :SIGNAL-ASSERTION-FAILED T :DESCRIPTION-INITARGS (:FORM (STEFIL:IS (= 24 #1=#)) :FORMAT-CONTROL "Binary predicate ~A failed.~%~
x: ~S => ~S~%~
y: ~S => ~S" :FORMAT-ARGUMENTS ((= STEFIL::X STEFIL::Y) 24 24 #1# 23)))
-- more --
Defining a test (via deftest
) results in a function with the same name
as the test; running that function runs the test. Tests are completely
composable; you can call one test from within another and their results
are aggregated. Test suites are defined by defsuite
, but internally
suites are just tests that call other tests. Adding a new test to a suite
just adds that test’s function to the functions called by the suite’s
function. Likewise, creating a suite adds it to the current suite, giving
you a hierarchy. My only problem with the setup is that I don’t see a way
to explicitly assign tests to suites, aside from dynamically binding
stefil::*suite*
. Normally, the current suite is set by in-suite
,
which requires careful attention if you’re jumping between different
suites often. (A somewhat mitigating factor is that tests remember which
suite they were created in, so the current suite only matters for
newly-defined tests.)
As an example of a hierarchy, here’s roughly how I set up things for my tests:
(in-package :euler)
(stefil:in-root-suite)
(stefil:defsuite* test-all)
(stefil:defsuite test-functions)
(stefil:defsuite test-problems)
(stefil:in-suite test-functions)
(stefil:deftest test-modpower ()
(stefil:is (= 399 (modpower 123456 123456 789))))
(in-package :e001)
(stefil:in-suite euler:test-problems)
(stefil:defsuite* test-all)
(stefil:deftest test-answer ()
(stefil:is (= <answer> (get-answer))))
(stefil:deftest test-givens ()
(stefil:is (= 24 (get-answer 10))))
Note that (defsuite* <suite-name>)
is equivalent to (defsuite <suite-name>)
followed by (in-suite <suite-name>)
.
Fixtures are supported in pretty much the same way that FivaAM does them;
you define them as macros and then run your tests within a with-fixture
macro.
Running the tests looks like this:
* (test-all)
.................X...................................................X.....................X........
............X..........................................................X............................
..........................................................................................E.........
.................................................................................
#<test-run 362 tests, 381 assertions, 6 failures in 169.139 sec>
That is, of course, only half the picture, since failures are reported by raised conditions, and that returned object is designed to be browsed with the SLIME inspector. I showed the full debugger response to a condition earlier; here are the other messages from test failures I saw in my run:
Test assertion failed:
(PROGN (PRIMEP 2)) failed to signal condition ERROR
[Condition of type STEFIL::ASSERTION-FAILED]
...
Expression (PALINDROMEP 10221) evaluated to false
0: 10221 => 10221
[Condition of type STEFIL::ASSERTION-FAILED]
...
Expression (NOT (UNIQUE-DIGITS-P 1)) evaluated to true
0: 1 => 1
[Condition of type STEFIL::ASSERTION-FAILED]
...
Binary predicate (= X Y) failed.
x: 24 => 24
y: (EULER-001:GET-ANSWER 10) => 23
[Condition of type STEFIL::ASSERTION-FAILED]
...
The value -1 is not of type (MOD 536870911).
[Condition of type TYPE-ERROR]
I like that you get lots of information about the code that was run, the result it produced, and the result that was expected. The only thing that could be clearer is the actual test that failed. You either have pore through the backtrace or wait for everything to finish and pore through the test result objects.
Speaking of test result objects, here’s the one returned by the test. It’s got a lot of information in it.
An object.
[type: STEFIL::GLOBAL-CONTEXT]
--------------------
Class: #<STANDARD-CLASS STEFIL::GLOBAL-CONTEXT>
Slots:
FAILURE-DESCRIPTIONS = #(#<STEFIL::MISSING-CONDITION {C3F7741}> #<failure (STEFIL:IS (PALINDROMEP 10221)) backtrace: TEST-PALINDROMEP,TEST-FUNCTIONS,TEST-ALL> #<failure (STEFIL:IS (NOT (UNIQUE-DIGITS-P 1))) backtrace: TEST-UNIQUE-DIGITS-P,TEST-FUNCTIONS,TEST-ALL> #<failure (STEFIL:IS (= 24 (EULER-001:GET-ANSWER 10))) backtrace: TEST-GIVENS,TEST-ALL,TEST-PROBLEMS,TEST-ALL> #<failure (STEFIL:IS (= 2 (EULER-035::ROTATE-NUM 1 1))) backtrace: TEST-ROTATE-NUM,TEST-ALL,TEST-PROBLEMS,TEST-ALL> #<error TEST-ALL,TEST-PROBLEMS,TEST-ALL,TEST-ANSWER: #<TYPE-ERROR {A8FE971}>>)
ASSERTION-COUNT = 381
PROGRESS-CHAR-COUNT = 381
PRINT-TEST-RUN-PROGRESS-P = T
DEBUG-ON-UNEXPECTED-ERROR-P = T
DEBUG-ON-ASSERTION-FAILURE-P = T
TOPLEVEL-CONTEXT = #<test-run (TEST-ALL)>
CURRENT-TEST = #<test EULER-142::TEST-ALL>
RUN-TESTS = #<HASH-TABLE :TEST EQL :COUNT 362 {B129149}>
RUN-FIXTURES = #<HASH-TABLE :TEST EQL :COUNT 0 {B129341}>
TEST-LAMBDAS = #<HASH-TABLE :TEST EQL :COUNT 0 {B129539}>
failure-descriptions
is an array containing an object for each failed
test. run-tests
is a hash table with one entry for each test run; the
keys are the test objects and the values are “test context” objects.
Here’s what one of the “assertion failed” objects looks like
An object.
[type: STEFIL::FAILED-ASSERTION]
--------------------
Class: #<STANDARD-CLASS STEFIL::FAILED-ASSERTION>
Slots:
TEST-CONTEXT-BACKTRACE = (#<test-run (EULER-001::TEST-GIVENS)> #<test-run (EULER-001:TEST-ALL)> #<test-run (TEST-PROBLEMS)> #<test-run (TEST-ALL)>)
PROGRESS-CHAR = #\X
FORM = (STEFIL:IS (= 24 (EULER-001:GET-ANSWER 10)))
FORMAT-CONTROL = "Binary predicate ~A failed.~%~
x: ~S => ~S~%~
y: ~S => ~S"
FORMAT-ARGUMENTS = ((= STEFIL::X STEFIL::Y) 24 24 (EULER-001:GET-ANSWER 10) 23)
And here’s what a test context object looks like.
An object.
[type: STEFIL::CONTEXT]
--------------------
Class: #<STANDARD-CLASS STEFIL::CONTEXT>
Slots:
PARENT-CONTEXT = #7=#<test-run (EULER-001:TEST-ALL)>
TEST = #<test EULER-001::TEST-GIVENS>
INTERNAL-REALTIME-SPENT-WITH-TEST = 16
TEST-ARGUMENTS = NIL
NUMBER-OF-ADDED-FAILURE-DESCRIPTIONS = 1
I like the touch of having the test’s runtime available. As I mentioned in the LIFT section, I’d written some functions to gather that data myself, but having the test collect it for me is even nicer.
I’m fairly impressed with Stefil. It’s got a very interesting interface, and the use of the debugger and inspector lends itself well to the interactive, short feedback cycle of Lisp programming. I’m not sure if the approach will work for everyone, but I’m going to give Stefil an extended try and see how well its testing approach works for me. If the live testing proves too bothersome, I’ll just use FiveAM.