Saturday, June 01, 2013

So that's why they are taking time...

We have yet to peek further into the profiling output; it's overdue.

The "cost center" tells you the total time used in particular functions, and that helps find slow functions--but while that's a sufficient condition for being a cost center, it's not a necessary condition. It may just be a function that someone's calling a lot.

Perhaps that's the case with tenToThe. The call tree shows calls in context, and when we look at it we see the following (edited slightly due to line length):

     noTwos                       Main 184          43    0.8    2.0    31.9   28.6
      tenToThe                    Main 229          21    0.0    0.0     0.0    0.0
      noTwos.base                 Main 186          42    0.0    0.0    31.2   26.6
       justOnes                   Main 187          42    8.8   13.3    31.2   26.6
        justOnes.pair             Main 195       85582    7.3    6.4    17.3    6.4
         tenToThe                 Main 196      171164   10.0    0.0    10.0    0.0
        justOnes.shell            Main 191          42    0.0    0.0     0.0    0.0
         tenToThe                 Main 192          42    0.0    0.0     0.0    0.0
        justOnes.halfN            Main 190          42    0.0    0.0     0.0    0.0
        justOnes.innards          Main 188          42    0.4    1.0     5.0    7.0
         choices                  Main 189       60804    4.6    5.9     4.6    5.9
      noTwos.halfN                Main 185          42    0.0    0.0     0.0    0.0


pair is the hot spot for tenToThe calls. Let's see what we can do about pair.

Originally:

justOnes :: Int -> Int -> [Integer]

justOnes n o =
    let halfN = n `div` 2
        shell = tenToThe (n - 1) + 1
        innards = concat [(halfN - 1) `choices` k | k <- [0..o]]
        pair i = tenToThe i + tenToThe (n - (i + 1))
    in [shell + sum (map pair c) | c <- innards]


First time I ever used "innards" in a program! Let's memoize pair (are you starting to notice a theme here?), giving

        ascending  = take (halfN - 1) (drop 1 powersOfTen)
        descending = take (halfN - 1) (reverse (take (n - 1) powersOfTen))
        memoPair = zipWith (+) ascending descending
        pair i = memoPair !! (i - 1)


zipWith generalizes zip; in fact, you could write zip as

zipWith (\x y -> (x, y))

Results? Best observed time output:

real    0m0.234s
user    0m0.216s
sys     0m0.016s


Worst observed time output:

real    0m0.242s
user    0m0.228s
sys     0m0.012s


And the cost centers:

COST CENTRE              MODULE                %time %alloc

nDigitYs                 Main                   21.8   19.1
fromList.(...)           SemilatticeSearchTree  11.4   20.2
justOnes                 Main                   10.0   14.2
digitsIn.digitsIn'       Main                    9.5    8.9
fromList                 SemilatticeSearchTree   4.7    0.5
justOnes.pair            Main                    4.3    0.0
choices                  Main                    4.3    6.4
process                  Main                    3.8    8.3
floorSqrt.floorSqrt'.y   Main                    3.8    4.2
digitsIn                 Main                    2.4    0.2
satisfy                  SemilatticeSearchTree   2.4    0.6
mkBranch                 SemilatticeSearchTree   2.4    2.4
choose                   Main                    1.9    2.2
meet                     SemilatticeSearchTree   1.9    0.0
satisfy                  SemilatticeSearchTree   1.9    0.0
findFirst                SemilatticeSearchTree   1.9    0.0
numNDigitYs.numNoTwos.n' Main                    1.4    0.9
noTwos                   Main                    0.9    2.1
meet                     SemilatticeSearchTree   0.9    2.6
yTree                    Main                    0.5    3.1
justOnes.innards         Main                    0.0    1.1


tenToThe is out of the running, and other things are moving up in percentage of time. Even with the new list, the total allocation is down from about 165 MB to 155 MB.

It's important to be guided by data. Looking at another part of the code,

noTwos :: Int -> [Integer]

noTwos n
    | n == 1     = [1,3]
    | even n     = base
    | otherwise  = concat [[p, p + tenToThe halfN] | p <- base]
    where halfN = n `div` 2
          base = justOnes n (min 3 (halfN - 1))


The otherwise (actually odd n) case seems like it's doing a bunch of work, creating a bunch of two-item lists and concatenating them... but changing it to

    | otherwise  = base ++ [p + tenToThe halfN | p <- base]

turrned out to be slower.

I would say that this is the first optimization I've done that is arguably less than clear. (OK, maybe the switch from String to ByteString, but I would disagree.)

Thursday, May 30, 2013

Up to date... or not

Well... I have ghc 7.4.1. It may be that more recent versions of ghc do a better job of optimization; they're up to 7.6.something.

Bodhi Linux is based on LTS releases of Ubuntu, and it doesn't look like the 12.04 repositories have been keeping up. I'd hate to have to switch distributions to get the latest ghc, but it may come to that.

UPDATE: BWAHAHAHA! I thought that the more recent version was better because it was allocating half as much memory. It was, instead, that I had a 32-bit version in one case, 64-bit in the other. Now I get to decide whether to plow onward with Ubuntu 13.04, which I installed to get ghc 7.6.something but which is flaky in some respects, or drop back to Bodhi Linux.

Wednesday, May 29, 2013

String: elegant but deadly

In Haskell, String is a synonym for [Char], i.e. a list of Char. This is a very elegant way to deal with strings--but it's a killer for performance.

The recommended way around this is to use the ByteString type. It's not a drop-in replacement for String, sad to say; in particular replacing read and show are problematic. I lucked out on reading Integers, though the returned type is different, and I wimped out on generating the output. You also get to add an import line, in this case

import Data.ByteString.Char8 as B

and qualify some function names.Here's the revised main and supporting functions:

main = do
    s <- B.getContents
    let r = map format $ zip [1..] (map process (tail $ B.lines s))
    putStr (unlines r)

process :: B.ByteString -> Int

process line = case map B.readInteger (B.split ' ' line) of
               [Just (m, _), Just (n, _)] -> numXs m n
               _                          -> -1

format :: (Int, Int) -> String

format (lineNum, result) = "Case #" ++ show lineNum ++ ": " ++ show result


Is it worth it? Indeed it is. time output:

real    0m0.271s
user    0m0.240s
sys     0m0.028s


The cost centers:

COST CENTRE            MODULE                %time %alloc

nDigitYs               Main                   16.6   17.9
fromList.(...)         SemilatticeSearchTree  13.0   18.9
tenToThe               Main                   12.6    0.0
justOnes               Main                    8.3   13.3
digitsIn.digitsIn'     Main                    7.9    8.4
justOnes.pair          Main                    6.7    6.4
fromList               SemilatticeSearchTree   4.0    0.4
process                Main                    3.6    7.8
satisfy                SemilatticeSearchTree   3.6    0.0
meet                   SemilatticeSearchTree   3.2    0.0
yTree                  Main                    2.8    2.9
choices                Main                    2.8    6.0
floorSqrt.floorSqrt'.y Main                    2.8    4.0
numNDigitYs            Main                    1.2    0.1
choose                 Main                    1.2    2.0
noTwos                 Main                    1.2    2.0
digitsIn               Main                    1.2    0.2
meet                   SemilatticeSearchTree   1.2    2.5
bound                  SemilatticeSearchTree   1.2    0.0
findFirst              SemilatticeSearchTree   1.2    0.0
mkBranch               SemilatticeSearchTree   0.4    2.2
justOnes.innards       Main                    0.0    1.0


main and main.r are nowhere to be seen. process is down from 35.9% time, 41.7% allocation, to 3.6% and 7.8% respectively.

Total allocation: 165,375,112 bytes, less than half of what it was. Not shabby.

P.S. Just before this change, the input file with much smaller values but ten times as many lines took well over half the time of the input file with huge values. That's no longer the case:

real    0m0.083s
user    0m0.076s
sys     0m0.004s

 
P.P.S. process is a bit brittle; if someone left more than one space between the two values on a line, B.split will obligingly represent the stuff between them with an empty string. We really should filter those out. UPDATE: turns out the ByteString has words; we're safe.

Tuesday, May 28, 2013

Things that make you go "Hmmm..."

OK. I rolled my own main, with serious help from "Haskell I/O for Imperative Programmers"--and I hasten to add that if you, like me, are learning Haskell, run, do not walk, to this page and read it.

Doing I/O, at least of the sort that programs like this do, in Haskell requires an attitude adjustment if you want to work with Haskell rather than against it. If you're a shell scripter, you already have the idea: you've worked with pipes. Pipes are like function composition; see
So, here's my main. I toyed with using interact, but having to skip the first line (the one that tells you how many lines are left) made it problematic, as well as the issue of adding a "case number" for getting the output in the form that the problem statement requires.

-- A new main, heavily inspired by "Haskell I/O for Imperative Programmers"
-- http://www.haskell.org/haskellwiki/Haskell_IO_for_Imperative_Programmers
-- (What's with that tail, you ask? We're skipping that first line that says
-- how many more lines there are.
They could have been persnickety and padded
-- the file with additional lines to make sure you have to pay attention to
-- that first line, but they weren't.


main = do
    s <- getContents
    let r = map format $ zip [1..] (map process (tail $ lines s))
    putStr (unlines r)

process :: String -> Int

process line = let [m, n] = map read (words line)
               in  numXs m n

format :: (Int, Int) -> String

format (lineNum, result) = "Case #" ++ show lineNum ++ ": " ++ show result


So, it compiled, it ran, it gave the same results as before. What about the time and the cost centers? The time seems to be a little bit slower, though there's some overlap between it and the previous version. This version seems to take somewhere between 0.38 and 0.4 seconds. This despite the following profiler output:

COST CENTRE            MODULE                %time %alloc

process.(...)          Main                   35.9   41.7
nDigitYs               Main                   11.1    8.4
tenToThe               Main                    8.1    0.0
fromList.(...)         SemilatticeSearchTree   7.6    8.9
digitsIn.digitsIn'     Main                    5.5    3.9
justOnes.pair          Main                    5.1    3.0
main.r                 Main                    4.1    6.0
justOnes               Main                    3.9    6.2
main                   Main                    3.2    9.5
fromList               SemilatticeSearchTree   1.8    0.2
floorSqrt.floorSqrt'.y Main                    1.4    1.8
meet                   SemilatticeSearchTree   1.4    0.0
choose                 Main                    1.2    1.0
choices                Main                    1.2    2.8
mkBranch               SemilatticeSearchTree   0.7    1.0
yTree                  Main                    0.2    1.4
meet                   SemilatticeSearchTree   0.2    1.1


Note that process only chews up 35.9% of the time, 41.7% of the allocation, compared with 39.5% and 56.0% respectively for solve under the former regime. main and its immediate helpers are all that changed. How could execution time increase if the only part that changed decreased the percentage of execution time it took? It is, as someone once said, a puzzlement. I think I like this main, though, and will see if there's anything I can do to improve it.

UPDATE: Aha! Note that we've added a couple of cost centers: main.r and main. They collectively add up to 43.2% of time and 57.2% of allocation. That's how the result can take longer despite the distribution being flatter.

Monday, May 27, 2013

Talk about TMTOWTDI...

At least from my newbie POV, the variety of ways people set up the top levels of their "Fair and Square" programs is pretty amazing.

Some use the standard issue recursive versions of top-level loops you see in texts; some use replicateM. Some ignore the first line, some don't.

I have some reading to do.

The remaining time sinks

Counting Digits

We unrolled digitsIn' before; perhaps we could use another shot. Add another couple of lines for base^8 and the cost centers change:

COST CENTRE            MODULE                %time %alloc

solve                  Main                   39.5   56.0
nDigitYs               Main                   11.2    8.6
fromList.(...)         SemilatticeSearchTree   8.2    9.1
tenToThe               Main                    7.9    0.0
justOnes.pair          Main                    4.2    3.1
justOnes               Main                    3.7    6.4
digitsIn.digitsIn'     Main                    3.0    4.1
floorSqrt.floorSqrt'.y Main                    2.8    1.9
meet                   SemilatticeSearchTree   2.3    0.0
satisfy                SemilatticeSearchTree   2.1    0.0
fromList               SemilatticeSearchTree   1.9    0.2
digitsIn               Main                    1.6    0.1
findFirst              SemilatticeSearchTree   1.2    0.0
choices                Main                    0.7    2.9
mkBranch               SemilatticeSearchTree   0.7    1.1
yTree                  Main                    0.5    1.4
meet                   SemilatticeSearchTree   0.5    1.2


Not shabby. It's down from 11.5% of the time to 3%, and 7.6% of allocations to 4.1%. If we were hard core, we'd go ahead and do specific base 10 and base 2 versions and avoid recalculating the powers of the base, but there are bigger fish to fry, especially the elephant in the room (metaphor blender set to purée!), solve.

I/O Bound?

So here's that outer layer, courtesy of jagd:

solve i num = do
    [a, b] <- getLine >>= return . map read . words
    putStrLn $ "Case #" ++ show i ++ ": " ++ (show $ numXs a b)
    if i < num
       then solve (i+1) num
       else return ()


main = do
   num <- getLine >>= return.read
   solve 1 num


The input is reminiscent of the old days of FORTRAN 66 where, rather than trying to detect end of file, you'd start your input card deck with a card that told your code how many more cards there were to read.

We've moved from generating and testing a huge list of palindromes to the point that the above code takes two fifths of the CPU time and does nearly three fifths of the trips to the heap. What is this doing to take so much time with an input of just 1,001 lines comprising a little over 160K of data?

Once I thought it was the repeated walks to concatenate up a String to feed to putStrLn, so I broke the one putStrLn down into three putStr calls and a putStrLn, but it made no difference that I could see, so I suspect it's the input.

I've heard that one issue is the use of String, a synonym for [Char], and that there's a ByteString type that carries less overhead... but that doesn't suggest anything, except maybe trying to ask for the flavor of getLine that hands back a ByteString. I'll investigate that, and also take a look at how some of the other posted solutions read and parse the input, to see what I can learn. It bugs the heck out of me to think that I'm spending that much time and taking so many trips to the well to read such a dinky file.

UPDATE: Out of idle curiosity, I ran the first large input file, 10,000 ranges bounded above by 1014. time output:

real    0m0.300s
user    0m0.288s
sys     0m0.008s


The profiler output shows solve taking not quite three fourths of the time and doing 90% of the allocation! There's got to be a better way.

Qapla'!

I finally got things in sync with van Laarhoven's module, and the output is correct. (Just to make sure, I went back to the Code Jam site and submitted the output. I think that being up until nearly 3:00 a.m. caused my initial reaction to
Judge's response for last submission: Correct.
to be "Oh, [expletive]! The judge says I need to correct my last submission!" Then I realized it was an adjective, not a command, and relaxed.)

I'm a little bummed at using someone else's code for a nontrivial portion of the problem, but OTOH, I did learn more Haskell, which is one of my real goals.

But back to the results: the best time output is

real    0m0.361s
user    0m0.336s
sys     0m0.024s


and the worst is

real    0m0.377s
user    0m0.364s
sys     0m0.008s


Both are less than the 0.4 seconds I saw for the C++ solution I tried.

[We pause here to do a victory dance.]

Where are our cost centers, and how about that memory usage?

COST CENTRE            MODULE                %time %alloc

solve                  Main                   37.7   54.0
digitsIn.digitsIn'     Main                   11.5    7.6
nDigitYs               Main                    9.5    8.3
fromList.(...)         SemilatticeSearchTree   6.8    8.8
tenToThe               Main                    6.2    0.0
justOnes               Main                    4.6    6.2
justOnes.pair          Main                    3.3    3.0
satisfy                SemilatticeSearchTree   2.0    0.0
choices                Main                    1.8    2.8
floorSqrt.floorSqrt'.y Main                    1.8    1.8
fromList               SemilatticeSearchTree   1.8    0.2
yTree                  Main                    1.5    1.4
floorSqrt              Main                    1.1    0.3
digitsIn               Main                    1.1    0.1
meet                   SemilatticeSearchTree   1.1    1.1
mkBranch               SemilatticeSearchTree   1.1    1.0
findFirst              SemilatticeSearchTree   1.1    0.0


Aside from solve and digitsIn.digitsIn', the time distribution is a lot flatter. Now it's even more astonishing just how much time and allocation solve, which just reads the file, calls the function, and prints out the results, takes! Any serious improvement is going to have to come from there.

We are definitely trading space for time. Previous graphs of memory usage peaked at around four megabytes, but this version just about doubles that.

UPDATE: ...and you know, we did tack those ordinal positions on with each Y value and added the tree structure and the upper bound info, so doubling sounds plausible.

Sunday, May 26, 2013

Almost there...

I have something that compiles... but doesn't give the correct result. I suspect I've blown the interface to van Laarhoven's SemilatticeSearchTree module in a way that, while it passes syntactic muster, is keeping the search from working right. The presence of a bunch of zeroes in the output might indicate that I'm consistently getting back Nothing.

What does the time output (0.3240.347s real time) tell us about what we might see when it's correct? Well, it's certainly a lower bound--but how much longer will the correct version run? The buggy one has presumably built the trees, and is doing all the work, but if the maxima are taking it immediately to Nothing, we're only skipping a couple of O(log(n)) operations times a thousand input lines, where n is no bigger than a bit over 4,000 (the number of 49-digit Ys). I'm hoping the corrected version won't take all that much longer... but we can't know until we have it right. More news as it happens.

A flashback and analogy

You've probably heard about how the notion of sum types (e.g. Algol 68 union s, Rust enum s, Haskell type s) and product types (e.g. tup...