Saturday, June 08, 2013

"This time, fer shur!" --Bullwinkle Moose

OK. I really will go on to doing something else in Haskell; I'm thinking of something that will actually get me into the M-word.

(Though in the back of my mind I will still wonder whether there's a way to make justOnes generate values in ascending order other than just sorting the darn things...)

I want to give a little more data.

I went back and compiled and ran the C solution to the "Fair and Square" problem I'd grabbed, to get time output so I could remember something other than just the vague "about .4 seconds". The variance seems to be less than it is for "my" Haskell program, (I have to use the scare quotes given my the integer square root and semilattice search tree code) so I'll just give one output, as usual from the data set with values up to a googol, from the middle:

real    0m0.455s
user    0m0.432s
sys     0m0.020s

In comparison, the best value from a Haskell run:

real    0m0.224s
user    0m0.176s
sys     0m0.044s

If I had to guess, I'd say maybe the difference in sys time is due to memory allocation; the C program visits the heap for buffers for I/O for input and output files, but that's about it. The rest it does via fixed-size declared arrays.

How much memory, then? What dominates for the C program is a single array, a two-dimensional 50000 x 200 array of ints. That comes to 24 MB. The graph generated from profiling the current version of the Haskell program shows it using somewhere between 6.25 and 6.5 MB (I really wish they'd show the scale all the way up on that). So half the time, and given the other arrays in the C code, I think it's safe to say about a quarter of the RAM.

The C code definitely wins on size of compiled program, though; with both stripped, it's 860K for Haskell, a hair over 10K for the C program. To be sure, that's taking advantage of shared libraries. I wonder whether ghc pays much attention to only including what's necessary in a compiled program? (Or maybe it's my doing, and I should be more specific about what I import.)

Thursday, June 06, 2013

Skating around the edges

Still thinking a bit about how to churn out those palindromes in order, or at least take advantage of what order there is. If you take a look at justOnes, it may be that each of the k values gives a hunk in ascending order, so you may be able to foldl' merge [] them and get them in order for cheaper... (UPDATE: no such luck) though that may not actually be cheaper than a single O(n log n) sort.

But now we take another peek at the output. The slowness of piecing together Strings to generate output has been noticed... oh, yes, it has been noticed... and people have come up with a way to cut down on it: using shows or something like it.

shows returns a function that takes a String as parameter and returns a String--the returned String is a printable representation of the thing (which must be in the ShowS type class) with the String parameter appended at the end. Instead of generating the String, you create a function that will generate it when called.

So.. remember

format :: (Int, Int) -> String

format (lineNum, result) = "Case #" ++ show lineNum ++ ": " ++ show result

? Well, now we have

showsResult :: (Int, Int) -> String -> String

showsResult (c, n) = ("Case #" ++) . shows c . (": " ++) . shows n

The use of "sections" makes the correspondence clear. Since we're now passing along functions instead of strings, they're used a little differently:

main = do
    s <- B.getContents
    let r = map format $ zip [1..] (map process (tail $ B.lines s))
    putStr (unlines r)


main = do
    s <- B.getContents
    let r = map showsResult $ zip [1..] (map process (tail $ B.lines s))
    mapM putStr $ map ($ "\n") r

Now, putStr, if you look it up, has the type String -> IO (), but it has a reassuring name. mapM, though, hints at the dreaded "M" word... monad. [insert dramatic chipmunk sound clip here]. All it is, though, is a monad-flavored map. Before, r was a [String]; now it's a [String -> String]. To get the [String], we use map with a section. (In Haskell, $ is the function application operator. We've used it before; its low priority makes it attractive as a way to avoid parentheses, but here, it's more than a syntactic convenience.) The "\n" means we don't have to use unlines, mapM putStr just runs putStr on each of the Strings on the list.

Does it help? Well... where time output is concerned, it's a wash. The profiler output shows it taking .01 second longer. Ah, but the memory graph! The -hc graph shows the format version somewhere around 7 MB, but the showsResult version at maybe 6.25 MB. That's got to count for something.

Tuesday, June 04, 2013

"Blessed rage for order, pale Ramon..."

So, the big cost center is nDigitYs. All it does is concatenate our three flavors of Y (the ones with no 2s, the ones with one 2, and the ones with two 2s) and sort them. Of the d-digit Ys, there are at most two with two 2s, and there are d `div` 2 with one 2. By far the majority are those with no 2s; for d == 49, of the 4,122 Ys, 4,122 - 2 - 24 = 4,096 have no 2s.

We know that the two-2 Ys are greater than all the others, so we could go with

nDigitYs n = sort (noTwos n ++ oneTwos n) ++ twoTwos n

...for what little good that does.

If you take a look at the result of oneTwos, you'll find that they're always in ascending order; we grind them out so that the optional 1 is first not there, and then in ascending order in the most signfiicant half, which suffices to order them. noTwos, on the other hand, isn't, for d > 7.  If there's some way to generate them in ascending order, we could go with

nDigitYs n = merge (noTwos n) (oneTwos n) ++ twoTwos n

(be sure to import Data.List.Utils) and O(n) beats O(n log n) any day. Unfortunately, I don't know of a good way to do that.

Sunday, June 02, 2013

OK, I have to take one more look...

What can I say? I can sympathize with Lot's wife.

I have yet to show you the spiffy graphic you can get showing memory usage.

What does this tell us?

First, we can say that it looks like we're down about a megabyte from earlier. (Alas, they don't give the scale all the way up the vertical axis; if only they'd move the word "bytes" out of the way!) The other thing is, it looks like it's accumulating thunks until about halfway through when something finally demands evaluation and the RAM soufflé collapses.

(What's a thunk? The name goes back to the days when the world was new and people were figuring out how to generate code for Algol 60 programs. When the defining committee set up the rules for "call by name" they perhaps didn't realize the full consequences of their actions (vide Jensen's Device). Implementing it required generating a little function for each call by name parameter that returned a reference to it, and those little functions were called "thunks". In the context of Haskell and other lazy programming languages, a thunk represents a value that hasn't yet been evaluated.)

Real World Haskell has a chapter on optimization that we've leaned heavily on, and it has a section on memory usage,. thunks, and strictness issues. I will look more closely at that section and take one last shot at the palindrome program--after all, these issues will come up in the future as well. (No, really, this will be the last time...)

After the smoke clears...

OK. I think I've taken this problem as far as I can go... though someday I will set it up to read and write named files so I can run it in ghci. After all, this all started because I wanted to demonstrate that it's the algorithm, not the language or whether it's interpreted or compiled after reading Clif Reeder's "Ruby is Too Slow for Programming Competitions". On the other hand, I now have a Haskell program that's within spitting distance of running twice as fast as a C++ solution for the problem, taking a bit over 0.2 seconds to run the hairier official input file. Eight minutes, the time Code Jam gives you to run and send back results when it counts, is 480 seconds; I tend to doubt that interpreter overhead would come near a factor of 2,400, in which case the point is made.

This started with the less-than-noble motivation of "Well, I'll show him...", but quickly changed, for a couple of reasons.
  • The unalloyed pleasure of problem solving always wins.
  • Haskell is a sheer delight to use.
It's been quite a while since I've had this much fun programming. Haskell made it easy to clearly and simply say what I meant, and when it came time to speed it up, the tools were there, and even the optimized versions were still intelligible. Not to mention that while there were bugs, there weren't many and they were easy to find. I could pull pieces into ghci and test if need be, and need wasn't all that often. All this (aside from Twan van Laarhoven's semilattice search trees) done by a Haskell newbie and duffer pushing 60 (with 40+ years of indoctrination in imperative programming aside from a bit of Lisp here and there), in spare time on evenings and (mostly) weekends.

Here's the result. You'll need the semilattice code, which is available at the link above. I compiled with -O2 in the interests of speed.

  OK, here's what I hope is the ultimate "Fair and Square" program.
  the core of the problem is, given two positive integers, M and N, determine
  the number of base 10 palindromes between M and N inclusive that are squares
  of base 10 palindromes.)

  Best time output, reading the L2 input file, on a 2.8 GHz AMD Athlon II
  X4 630:

  real    0m0.227s
  user    0m0.208s
  sys     0m0.016s

  [Ultimate? Silly me; there's always something else to try. However, I think
  I'm done with it, for now at least.]

  The main insights to speed up the process follow.
  (4)-(6) are directly from the Google Code Jam analysis of the problem at
  (9) is from Twan van Laarhoven's "Search trees without sorting", on his blog

  1. Look for palindromes in [ceil(sqrt(M))..floor(sqrt(N))] whose squares are
     palindromes. (We'll refer to one of the former set as X, and one of the
     latter set as Y, as is done in Google's analysis.)
  2. Don't look for palindromes, generate them. They're determined (aside from
     the middle digit if there are an odd number) completely by half their
     digits, so in a range of 10^(2*n) integers, there are on the order of
     10^n palindromes, a reduction on the order of (1).
  3. If Y^2 is to be a palindome, its most/least significant digit can't be
     greater than 3.
  4. In particular, the Ys are precisely those that, when you square them via
     long multiplication, have no carries in the addition of partial products.
  5. (3) implies that the Xs have an odd number of digits.
  6. (3) also implies that the middle digit of X, which is the sum of the
     squares of the digits of the palindrome being squared, must be no larger
     than 9. Therefore, *none* of the digits of Y can be greater than 3, and
     3 only appears in Y = 3.
  7. (6) implies that Ys come in three flavors: those with two 2s, those with
     one 2, and those with no 2s.
  8. Save for very small n, it's easier to count d-digit Ys than to generate
     them--and remember, the problem only asks for the count.
  9. By treating a binary tree as a sublattice, labeling branches with the
     upper bounds of the corresponding subtrees, one can quickly search for
     the first value in the tree >= a specified value.
 10. By labelling the sorted Ys for a given number of digits with their ordinal
     positions, one can use those tree searches to quickly determine how many
     Ys are in a given interval.

  NOTE: 0 is a one digit palindrome, and its square is, too, but since we are
  only dealing with positive integers, we'll ignore zero.

import Prelude
import Data.List
import SemilatticeSearchTree
import qualified Data.ByteString.Char8 as B

-- It's easy to calculate the number of n-digit Ys.

numNDigitYs :: Int -> Int

numNDigitYs n = if n == 1 then 3 else numTwoTwos n + numOneTwos n + numNoTwos n
    where numTwoTwos d = if even d then 1 else 2
          numOneTwos d = if even d then 0 else d `div` 2
          numNoTwos  d = let d' = d `div` 2 - 1
                             n' = sum [d' `choose` i | i <- [0..min d' 3]]
                         in if even d then n' else 2 * n'
choose :: Int -> Int -> Int

m `choose` 0 = 1
m `choose` n = product [m - n + 1..m] `div` product [1..n]                            

-- We may wish to memoize numNDigitYs...

ysInDigitRange d1 d2 = sum [numNDigitYs d | d <- [d1..d2]]

  Now the slow part: generating and counting Ys in a range other than "all
  d-digit numbers".

powersOfTen = [10 ^ n | n <- [0..]] -- or, if you wish, map (10 ^) [0..]

tenToThe :: Int -> Integer

tenToThe n = powersOfTen !! n

twoTwos :: Int -> [Integer]

twoTwos n
    | n == 1    = []
    | even n    = [twoTwosNoOnes]
    | otherwise = [twoTwosNoOnes, twoTwosNoOnes + tenToThe (n `div` 2)]               
    where twoTwosNoOnes = 2 * tenToThe (n - 1) + 2

oneTwos :: Int -> [Integer]

oneTwos n
    | n == 1     = [2]
    | even n     = []
    | otherwise  = [p + 2 * tenToThe halfN
                      | p <- justOnes n (min 1 (halfN - 1))]
    where halfN = n `div` 2

noTwos :: Int -> [Integer]

noTwos n
    | n == 1     = [1,3]
    | even n     = base
    | otherwise  = concat [[p, p + tenToThe halfN] | p <- base]
    where halfN = n `div` 2
          base = justOnes n (min 3 (halfN - 1))

  justOnes -- generate the palindromes with a specified number of digits all of
  whose non-zero digits are 1s. 1 must be the most/least significant digit; up
  to a specified number of 1s are scattered through the rest of each half.
  If the number of digits requested is odd, we leave a 0 in the middle digit,
  for the caller to either replace with another digit or not.

justOnes :: Int -> Int -> [Integer]

justOnes n o =
    let halfN = n `div` 2
        shell = tenToThe (n - 1) + 1
        innards = concat [(halfN - 1) `choices` k | k <- [0..o]]
        -- originally pair i = tenToThe i + tenToThe (n - (i + 1))
        ascending  = take (halfN - 1) (drop 1 powersOfTen)
        descending = take (halfN - 1) (reverse (take (n - 1) powersOfTen))
        memoPair = zipWith (+) ascending descending
        pair i = memoPair !! (i - 1)
    in [foldl' (+) shell (map pair c) | c <- innards]

choices :: Int -> Int-> [[Int]]

m `choices` n
    | n == 0           = [[]]
    | m == n           = [[m, m-1..1]]
    | otherwise        = [m : c | c <- (m - 1) `choices` (n - 1)]
                         ++ ((m - 1) `choices` n)

nDigitYs :: Int -> [Integer]

nDigitYs n = sort (noTwos n ++ oneTwos n ++ twoTwos n)

-- We turn the sorted lists from nDigitYs into search trees.
-- Here we use van Laarhoven's semilattice search tree routines/types.
-- Sigh... we're doing a bit of cargo cult programming here, because even
-- though we'll never search on the ordinal index, we'll follow the blog
-- post example.

yTree :: Int -> SearchTree (Max Integer, Max Int)

yTree n = fromList [(Max x, Max y) | (x, y) <- zip (nDigitYs n) [0..]]

yTrees = map yTree [1..]

dDigitYTree :: Int -> SearchTree (Max Integer, Max Int)

dDigitYTree d = yTrees !! (d - 1)

-- ...leading up to this function.

ysInRange :: Integer -> Integer -> Int -> Int

ysInRange m n d
      | nVal == n = nPos - mPos + 1
      | otherwise = nPos - mPos
      where (mVal, mPos) = findFirst' m
            (nVal, nPos) = findFirst' n
            findFirst' x = case findFirst (Ge x, Any) (dDigitYTree d) of
                               Just (Max i, Max j) -> (i, j)
                               Nothing             -> (tenToThe d, numNDigitYs d)

-- counting decimal digits and bits
-- the numbers here are big enough to nudge us past the obvious

powersOfTwo = map (2^) [0..]

bigTwos = map (2^) powersOfTwo
bigTens = map (10^) powersOfTwo

bitsIn n   = bDigits n bigTwos
digitsIn n = bDigits n bigTens

bDigits :: Integer -> [Integer] -> Int

bDigits n xs = bDigits' n 1
    where bDigits' n s = let upToN = takeWhile (<= n) xs
                         in case upToN of
                            [] -> s
                            _  -> bDigits' (n `div` (last upToN))
                                           (s + powersOfTwo !! (length upToN - 1))

-- The following is derived from a response to a Stack Overflow question
-- which in turn is taken from Crandall & Pomerance, "Prime Numbers: A
-- Computational Perspective".

floorSqrt :: Integer -> Integer

floorSqrt 0 = 0
floorSqrt n = floorSqrt' (2 ^ ((1 + bitsIn n) `div` 2))
    where floorSqrt' x =
               let y = (x + n `div` x) `div` 2
               in if y >= x then x else floorSqrt' y

-- then the obvious...

ceilSqrt :: Integer -> Integer

ceilSqrt n =
    let y = floorSqrt n
    in if y * y == n then y else y + 1

-- Finally, the top level, which counts where it can, and generates, filters,
-- and counts where it must. Note that we break the interval down into ranges
-- that are subsets of d-digit numbers for appropriate values of d.

numYs :: Integer -> Integer -> Int

numYs m n
    | mDigits == nDigits   = ysInRange m n mDigits
    | mPower10 && nPower10 = ysInDigitRange mDigits (nDigits - 1)
    | mPower10             = ysInDigitRange mDigits (nDigits - 1)
                             + ysInRange (tenToThe (nDigits - 1) + 1) n nDigits
    | otherwise            = ysInRange m (tenToThe mDigits - 1) mDigits
                             + ysInDigitRange (mDigits + 1) (nDigits - 1)
                             + ysInRange (tenToThe (nDigits - 1) + 1) n nDigits
    where mDigits    = digitsIn m
          nDigits    = digitsIn n
          mPower10 = m == tenToThe (mDigits - 1)
          nPower10 = n == tenToThe (nDigits - 1)

numXs :: Integer -> Integer -> Int

numXs m n = numYs (ceilSqrt m) (floorSqrt n)

-- A new main, heavily inspired by "Haskell I/O for Imperative Programmers"
-- (What's with that tail, you ask? We're skipping that first line that says
-- how many more lines there are. They could have been persnickety and padded
-- the file with additional lines to make sure you have to pay attention to
-- that first line, but they weren't.

main = do
    s <- B.getContents
    let r = map format $ zip [1..] (map process (tail $ B.lines s))
    putStr (unlines r)

process :: B.ByteString -> Int

process line = case map B.readInteger (B.words line) of
               [Just (m, _), Just (n, _)] -> numXs m n
               _                          -> -1

format :: (Int, Int) -> String

format (lineNum, result) = "Case #" ++ show lineNum ++ ": " ++ show result

O(n) good, O(log n) better

What we need is a list/array of b, b2, b4, b8, b16, ... to count base b digits. The initial loop unrolling should have given us the idea already.

powersOfTwo = map (2^) [0..]
bigTwos = map (2^) powersOfTwo
bigTens = map (10^) powersOfTwo

bitsIn n   = bDigits n bigTwos
digitsIn n = bDigits n bigTens

bDigits :: Integer -> [Integer] -> Int

bDigits n xs = bDigits' n 1
    where bDigits' n s = let upToN = takeWhile (<= n) xs
                         in case upToN of
                            [] -> s
                            _  -> bDigits' (n `div` (last upToN))
                                           (s + powersOfTwo !! (length upToN - 1))

Whew! OK, so does it help? Best observed time output:

real    0m0.227s
user    0m0.208s
sys     0m0.016s

Worst observed time output:

real    0m0.238s
user    0m0.212s
sys     0m0.024s

Profiling output claims 0.20 seconds runtime (still; the difference is a few "ticks" that aren't noticeable in the calculated time). It's a little better, but we are definitely in the realm of diminishing returns. Let's check out the cost centers:

COST CENTRE            MODULE                %time %alloc

nDigitYs               Main                   17.3   20.4
fromList.(...)         SemilatticeSearchTree  15.7   21.6
justOnes               Main                   10.7   12.9
process                Main                    5.1    8.9
bDigits.bDigits'       Main                    5.1    2.1
fromList               SemilatticeSearchTree   4.6    0.5
justOnes.pair          Main                    4.1    0.0
floorSqrt.floorSqrt'.y Main                    4.1    4.5
choices                Main                    3.6    6.8
bDigits.bDigits'.upToN Main                    3.6    2.8
meet                   SemilatticeSearchTree   3.0    0.0
choose                 Main                    2.5    2.3
yTree                  Main                    2.5    3.3
findFirst              SemilatticeSearchTree   2.5    0.0
justOnes.innards       Main                    2.0    1.2
mkBranch               SemilatticeSearchTree   2.0    2.5
main.r                 Main                    1.5    0.3
noTwos                 Main                    1.5    2.3
main                   Main                    1.0    0.6
numYs                  Main                    1.0    0.1
floorSqrt              Main                    1.0    0.7
satisfy                SemilatticeSearchTree   1.0    0.0
satisfy                SemilatticeSearchTree   1.0    0.6
meet                   SemilatticeSearchTree   0.0    2.8

Before, we had digitsIn and digitsIn.digitsIn', which collectively consumed 12.5% of the time and 9.3% of allocation. Now we have bDigits.bDigits' and bDigits.bDigits'.upToN, which just eat 8.7% of time and 4.9% of allocation (though speaking of allocation, of course we now have those three new lists sitting around).

UPDATE: It bugged me that I was grabbing a whole prefix of a list when I really only wanted the last item of that prefix (along with something like its length), so I added yet another helper function, resulting in

bDigits n xs = bDigits' n 1
    where bDigits' n s = case lastLE n xs of
                           Nothing     -> s
                           Just (m, p) -> bDigits' (n `div` m) (s + twoToThe p)

lastLE :: Integer -> [Integer] -> Maybe (Integer, Int)

lastLE n xs =
    let lastLE' xs prevVal prevIndex
           | head xs <= n = lastLE' (tail xs) (head xs) (prevIndex + 1)
           | otherwise    = if prevIndex < 0 then Nothing
                                             else Just (prevVal, prevIndex)
    in lastLE' xs (-1) (-1)

where twoToThe is just like tenToThe, using powersOfTwo instead of powersOfTen.

That helped; with the change, the cost center listing shows bDigits, bDigits.bDigits', and lastLE.lastLE' collectively chowing down on just 3.3% of total time and 2.1% of allocation, which beats the heck out of 8.7 and 4.9. The -hc heap graph shows the peak at around 6.5 MB rather than the 8 it sailed up to when we went to the semilattice trees.

A little more stuff

So, we know that things get slow when we actually have to generate some of the Ys for a range check. The range checking is now O(log n) where n is the number of Ys we have to check. twoTwos is trivial, so that leaves oneTwos and noTwos, and they mostly call justOnes. Of course there's also the sorting, the labeling with ordinal position, and the tree building. sort doesn't even appear in the .prof file, so it must be down in the noise (or perhaps considered a primitive; after all, we don't see the additions split out). What we do see in the "cost center" list are
  • nDigitYs
  • fromList
  • digitsIn.digitsIn'
  • justOnes
  • justOnes.pair
  • floorSqrt.floorSqrt'.y
One small change to justOnes turned out to be surprisingly effective: the value returned was calculated as

[shell + sum (map pair c) | c <- innards]

It turns out to chop an easy 10% off the execution time, according to profiling output, to recast that as

[foldl' (+) shell (map pair c) | c <- innards]

"WTF?" you say? Well, folding a list takes several things:
  • a function (we're using (+), the parentheses telling Haskell we're not trying to add a function to an Integer)
  • a starting value (we're using shell)
  • the list to be folded
Folds can get you sums (+) or products (*) of lists of values; you'll often find the identity for the function used as the starting value for that kind of use. On the other hand, the function doesn't necessarily have to have type

a -> a -> a

Instead, you could have a starting value of type b and then the function would have type

b -> a -> b

which you'll see for things like fromList, which builds one data structure from another. Since we want shell plus a sum, we might as well put shell in as the starting value rather than using 0 and adding shell when the smoke clears.

Probably more important is the way that foldl' is strict. Laziness is Haskell's outstanding virtue, but there are times when you want strict, and avoiding bulding up potentially large stacks of thunks awaiting execution for a fold is one of those times.

With that one change, the profile files say the runtime drops from 0.23 s to 0.20 s. The new cost centers:

COST CENTRE            MODULE                %time %alloc

fromList.(...)         SemilatticeSearchTree  18.4   20.6
nDigitYs               Main                   15.9   19.5
digitsIn.digitsIn'     Main                   10.0    9.1
justOnes               Main                    8.5   12.3
justOnes.pair          Main                    4.5    0.0
floorSqrt.floorSqrt'.y Main                    4.0    4.3
fromList               SemilatticeSearchTree   4.0    0.5
process                Main                    3.5    8.4
choices                Main                    3.5    6.5
findFirst              SemilatticeSearchTree   3.0    0.0
digitsIn               Main                    2.5    0.2
satisfy                SemilatticeSearchTree   2.5    0.0
meet                   SemilatticeSearchTree   2.5    2.7
justOnes.innards       Main                    2.0    1.1
floorSqrt              Main                    2.0    0.6
main                   Main                    1.5    0.5
dDigitYTree            Main                    1.5    0.0
yTree                  Main                    1.5    3.2
meet                   SemilatticeSearchTree   1.5    0.0
mkBranch               SemilatticeSearchTree   1.5    2.4
noTwos                 Main                    0.5    2.2
choose                 Main                    0.0    2.2

Peeking at the call stacks shows that by far the majority of calls to digitsIn.digitsIn' take place in floorSqrt, where we're counting bits and the values passed range up to a googol, which means we're talking up to maybe 330-bit numbers (100 / .30103). Going eight at a time helped, but apparently not enough if we're now taking ten percent of our time counting bits.

A flashback and analogy

You've probably heard about how the notion of sum types (e.g. Algol 68 union s, Rust enum s, Haskell type s) and product types (e.g. tup...