Saturday, May 04, 2013

Palindromes again?

Actually, it's more Haskell than palindromes.

I've been looking over other people's solutions to the "Fair and Square" problem, and they are pretty impressive. As a still-inexperienced Haskeller, I am taking inspiration from then rather than feeling discouraged.

In particular, jagd's solution reminded me that I was not thinking Haskell. Haskell is lazy; values aren't calculated until you require them. That lets you define infinite data structures! You only get into trouble if you do something that requires the whole thing.

For example, [1..] is the infinite list of all positive integers. You can safely use it, as long as you only use a finite amount of it. For example, if you try

uptoSqrt n = [x | x <- [1..], x * x <= n]

and then ask for

uptoSqrt 25

you will get back 1, 2, 3, 4, and 5, and then the function call will hang. You and I  realize that \x -> x * x is monotonically increasing for positive numbers, but Haskell doesn't, and obligingly tries to check the infinitely many integers greater than 5 to see whether any have squares no greater than 25. That takes forever, so it never finishes.

To get the result you want, you will have to write something like

uptoSqrt n = [x | x <- (takeWhile (\y -> y * y <= n) [1..])]

What's takeWhile? Here's a declaration for it:

takeWhile :: (a -> Bool) -> [a] -> [a]

Give it a function that takes an a and returns a Bool, and a list of as, and it will hand you back a prefix of that list--all the items up to, but not including, the first one for which the function returns false.

With that definition,

uptoSqrt 25

gives you the

[1,2,3,4,5]

you expect, because we know that [1..] is in ascending order and the square function is monotonically increasing.

(There's nothing magic about takeWhile; if you ask for

takeWhile even [2, 4..]

it will print even numbers forever, because [2, 4..] is the infinite list of positive even integers, and hence takeWhile will never find an odd number to make it stop.)

That means that I can simply define the whole infinite list of "fair and square" numbers. To make it possible to get those in a certain range, I will have to generate them in ascending order, so back comes the sort that we omitted for a while. I'm going to do something that clashes a bit with a Haskell convention for patterns in function definitions, because it matches the nomenclature of the problem discussion, in which the "fair and square" numbers are referred to as X and the palindromes they are squares of as Y.

possibles nDigits =
    if      nDigits == 1 then [0..3]
    else if nDigits == 2 then [11, 22]
                         else sort (concat [noTwos nDigits, oneTwo nDigits, twoTwos nDigits])


ys = concat [possibles n | n <- [1..]]
xs = [y * y | y <- ys]

palSquares m n = takeWhile (<= n) (dropWhile (< m) xs)

dropWhile is kind of like takeWhile, but it gives you a suffix of the list you hand it, namely everything from the first item for which the function returns false on. If you drop the values less than m from the front of a sorted list, and then grab the values <= n from the front of what's left, you have precisely those values between m and n inclusive. (I'm being a bit free with my language here, of course. Haskell doesn't destroy the original list--at least not if anyone's looking.)

Note that we aren't using floorSqrt or ceilSqrt any more, so they can go away, and we thus aren't counting bits any more.  We're defining the whole list, so we aren't counting decimal digits, either, and digitsIn can go away as well.

So do infinite data structures provide any advantage? Well, consider that I have yet to actually solve the Code Jam problem as such. Remember that it requires reading a file with ranges of values m and n for which we have to print a line of a given format with the value of

length (palSquares m n)

Defining the infinite list of Xs is a form of memoization, remembering the values of a function you calculate so you don't have to do the work if you invoke it again with the same arguments. Suppose two of the input lines are

200 1000
40 500

Evaluating

palSquares 200 1000

will evaluate all the Xs between 0 and 1000 (actually one more past that, because you have to find the stopping place), so that when it comes time to process the next line, all the values between 40 and 500 will have already been determined, so all we have to do is select the values in range.

It's not what you'd want for serious memoization; after all, the input lines could all be within a small interval of the upper bound, so we would arguably have wasted the determination of all those lesser "fair and square" numbers. But it does have the potential of saving time. Let's see what happens with the Code Jam inputs (and with jagd's version of main)...

Cool! Of course, I didn't submit these officially--I'm using a portion of someone else's solution, and peeked at the analysis of the problem, and took far longer to write the program than the rules permit, so it would in no way be fair to take credit--but Google let me download the inputs, and confirmed that the output generated is correct. Here's the time output, first for large input 1, 10K lines with ranges a subset of [1, 1014]:

real    0m0.229s
user    0m0.220s
sys     0m0.008s


and then for large input 2, 1K lines with a range of [1, 10100]:

real    0m1.796s
user    0m1.784s
sys     0m0.012s


Not shabby for time--and to refer back to Mr. Reeder's blog post, I'm sure that with those times, an interpreted version would have no trouble running in the eight minute limit Code Jam imposes--but how about memory? Time to instrument again... and sure enough, keeping that list, or perhaps those lists, around eats RAM. With large input 1, it uses 60+ kilobytes of RAM (the graph generated doesn't show the scale all the way up the Y (RAM) axis), and with large input 2, it uses about five megabytes of RAM. Even the five megabytes is small change these days, and now I suspect I wasn't looking closely at which run I was checking memory usage for last time around. I bet I thought a 10100 run was a 1014 run, and I shouldn't have blamed the sort at all.

UPDATE: I'm truly amazed at the brevity of jagd's solution; you really should check it out. I will have a good time studying it further.  Just for the heck of it, I compiled it with ghc -O2, and tried it on the large input files.  For large input 1,

real    0m0.226s
user    0m0.220s
sys     0m0.008s


For large input 2,

real    0m14.410s
user    0m14.200s
sys     0m0.184s


I wonder what it's doing differently to make it run 63.8 times slower on the larger input rather than 7.84 times slower?

UPDATE:  part of it is the way it's keeping the list of Ys both in numeric and string form, with lots of conversions back and forth. That means more memory; on large input 2, jagd's program uses over 55 Mbytes of RAM, with a very interesting shape to the plot of RAM versus time.

I see that jagd also at least thinks he or she is generating some values that might not be valid Ys, given the

filter (ispalindrome . square)

applied to potential Ys. Maybe that's part of the RAM usage fluctuation.

("." is function composition in Haskell, and filter has a signature that should be familiar to you from takeWhile:

filter :: (a -> Bool) -> [a] -> [a]

filter takes a function and a list and hands  you back all the items in the list for which the function returns true.)

No comments:

Riddler Classic, May 23, 2020—Holy Mackerel!

Another one using Peter Norvig's word list . It turns out that the word "mackerel" has a curious property: there is exactly ...