?

Log in

No account? Create an account
color cycle (slow)

Kistaro Windrider, Reptillian Situation Assessor

Unfortunately, I Really Am That Nerdy

Previous Entry Share Next Entry
Victory
color cycle (slow)
kistaro
There was a TopCoder match tonight (for those of you not aware, my handle there is Windrider), and the NSA was sponsoring a $5,000 prize pool. 70% of the pool goes to Division 1, with 30% going to Division 2. To give everybody at least some chance (not just the people who consistently score highly for the whole round), the prize is further divided by room, so your position in your room- not your overall rank- determines your payout. (Rooms are filled randomly, 20 people each.) The room winner scores 50% of the pot (in Division 1- D2 uses a different distribution), second place gets 30%, and third gets 20%. There are no extra prizes for overall rank, but overall is what decides your rating.

I came in 2nd in Room 13, with a score of 1107.33 points. I came in 27th overall. This is my highest score for a Division 1 problem set, my highest ranking for a Division 1 problem set, and the only time I've won money in an SRM. (I've won cash in previous tournaments- we'll see if I can do it again when the TCCC Round 1 begins this Saturday.) This is also the first time I've swept the problems in Division 1, solving all three of them correctly.

I'm happy about the (approximately) $45, but getting every problem (including the Hard) in Division 1 is what really has me in a good mood. That's the first time I've done that in a real match. (Early tournament rounds are more like D2 in difficulty and don't count.) This match did play to my strengths- discrete mathematics- but still, I'll take what I can get.

My rating is now at 1891, the highest it's ever been. It'll probably drop across the next few matches, but it's fun to have a rating like that, at least for a while. But getting the Hard problem right with the first algorithm I thought of... that's the really cool part.

The punchline is that I coded up my 1000 and submited it as a joke. I just thought up what I thought was a crazy, impossible algorithm, which I wrote mostly so I could use the variable name "antimanhattan". I just had this guess as to what might work, what might exhibit some properties of the right answer- which points I could ignore and still get the right answer. (The problem: Out of 250,000 points generated by an algorithm with fairly messy output, find the pair with the longest Manhattan distance in two seconds or less.) My "only use the manhattanest, the antimanhattanest, the unmanhattanest, and the unantimanhattanest; return longest corner-to-corner" was just this crazy idea I thought of, but then it passed all the example cases. So I submitted it... and it accidentally worked.

A lot of good ideas are like that.

Did I get it by accident? Not really. I coded it right. And I know discrete mathematics well enough that apparently, even when I don't know why, my intution can be correct- and good intuition requires at least some level of knowing what I'm doing.

But anyway. 27th place of 470. SQUEE.

  • 1
Very nice! Congratulations!

Awesome, congratulations! :D

Mmm, bacon userpic.

...now I'm hungry.

*giggle* Uh... sorry? ;D

Congratulations!

The scary part here is that I think I understand what you mean by

> "only use the manhattanest, the antimanhattanest, the unmanhattanest, and the unantimanhattanest; return longest corner-to-corner"

... and my own brain chipped away at the problem.

I'm unfamiliar offhand with this Manhattan thing. Does the idea of "manhattanness" and "unmanhattanness" map to a standard X/Y coordinate plane with "Manhattan distance" mapping to straight-line distance?

Because if so, I think the algorithm breaks. Not always -- it depends on the data set -- but let's say you have the following data set:

(-10, 0)
(10, 0)
(0, -10)
(0, 10)
(9, 9)
(-9, -9)

The four points your algorithm selects are the first four, with corner-to-corner distances of 20. But the longest linear distance is between the last two, 18 sqrt 2.

answering my own question

... trivial Google search: http://www.nist.gov/dads/HTML/manhattanDistance.html

OK, so I think that refined my understanding. You're not mapping "Manhattan" to X and "unManhattan" to Y; you're mapping them respectively to [ M(x,y): x>=0, y>=0, M = x + y ] and [ M(x,y): x>=0, y<0, M = x + |y| ], yes? Because that specifically addresses my sample failure.

I don't have a ready proof that that catches all such gotcha cases, but it's a much stronger contender.

... Actually, wait, no, that does work. If you pick one point from each quadrant with the highest M-distance to (0,0), then every most-efficient corner-to-corner path between them has a path of the same length which passes through (0,0). Thus the corner-to-corner distance is equivalent to the corner-to-origin-to-corner distance and there can be no gotcha relying on selection of points such that the most efficient path doesn't go through the origin.

Re: answering my own question

The problem constraints included a constraint functionally equivalent to "all points are in the range (0, 0) to (107, 107), lower inclusive", so there are no negative points.

That's not my calculation of the Manhattanest and Unmanhattanest points. The Manhattanest pointis the point with maximal (X+Y). The Antimanhattanest point is the point with maximal (Y-X), which is equivalent to mirroring the board left to right. The Unmanhattanest had minimal (X+Y), and the Unantimanhattanest had minimal (Y-X).

The only candidate pairs I needed to consider were the Manhattanest and the Unmanhattanest, and the Antimanhattanest and the Unantimanhattanest.

My solution stood through the Challenge Phase and the System Tests; it's mathematically correct. But yes, as you noticed, the Manhattan distance is very different from the Euclid distance!

Using the Manhattan distance is what makes the problem solvable within two seconds- it becomes an O(n)-complexity problem. The Euclidian-distance variant can be solved by brute force in O(n2) time, but when n=250K this would time out relative to TopCoder's two-second time limit. (Anything involving 250,000 points generally has to be O(n) to pass.) I know that for the Euclidian-distance variant you can discard any point not on the convex hull of the collection of points-so-far, but the first trick is figuring out whether it's on the chull or not, and the second is "this doesn't help you at all when the points are in a ring".

It's also the Manhattan distance that let me solve it! It converts the problem from traditional mathematics to combinitorics- from my most serious weakness (relative to computer programming) to my greatest strength. This problem took a lot more insight than ability to pound out a thousand lines of code in ten seconds, which is a pleasant change for Hard problems.

The Nasty Little Algorithms I still really need to learn to get and maintain a red-bracket rating in TopCoder are Bipartite Matching and Max Flow. I'd still need to improve my math intuition and mathematical programming skills to have a stable red rating (rating >= 2200 and your username shows up in red; the 1500-2199 bracket gives you a yellow username, which is still considered skilled- my rating is 94th %ile and of 788 active American TopCoder members where "active" is defined as "played a match for rating points within the last six months", I'm in 32nd place; 288 of 4876 worldwide), but I think I could keep a low red rating with just those two. They solve a lot of problems, and they're the only graph theory problems I consistently miss. Of course, both of the problems are a huge pain in the tail...

Re: answering my own question

What you said convinces me we're talking about the same algorithm over different data sets. If you somehow felt the urge to add (-5x10e6, -5x10e6) to all of your data points (which is, if I remember my set theory terms correctly, trivially a homeomorphism), your method would map to the one I discussed.

As for the Euclidean variant -- this might sound unorthodox, but, I suspect that for integral coordinates, the pair with the greatest Euclidean distance will be in the subset of points with the greatest Manhattan distance to the center of your data set, i.e., (Xmax - Xmin, Ymax - Ymin). [A proof would involve showing that all points within a ring of radius r from that origin have Manhattan distance to it bounded by some F(r), and you can find a larger Euclidean distance by selecting at least one point outside the ring than you can by staying iside of it.] If N is very large, you can then use Manhattan distances to identify the top candidates, and then brute-force among a very small subset of N.

Re: answering my own question

The algorithm you're describing seems to rely on the faulty assumption that there will always be points in all four quadrants. If you have at least four points (not a given) that are not colinear, this is possible by redefining the origin somewhere in the middle. Otherwise, the problem is isomorphic to any coordinate space- that's why I can get away with Y-X (or X-Y would work just as well) for antimanhattanism rather than needing to add a max-X offset to it; it just doesn't make any difference at all. So corner-to-origin-to-corner? Almost, for some arbitrary pseudo-origin that is "between" the points. Note that the points are not guaranteed to be distinct; one of the example cases had all the points actually on the same point. The correct answer for that case was the first two points- the two points are required to be distinct, but ties are to be broken lexicographically by index of point as generated by the generator function (250K is too large for a TopCoder input, so they gave a "random" number algorithm and the input is the seed).

I can disprove your suspicion: A=(-4,0), B=(1,4), C=(1,-4), D=(4, 0). The Manhattanest distance is AB or AC, but the Eulerest distance is AD, and D is not among any of the Manhattan extrema relative to the origin or to point A.

Re: answering my own question

> The algorithm you're describing seems to rely on the faulty assumption that there will always be points in all four quadrants.

I realized this at work today. The whole quadrants thing was just me being sloppy with rigor in order to get the thought experiment working a little more quickly.

Obviously, it's not needed at all; there's nothing about "the point with the maximum value of X+Y" that implies either or both X and Y have to be positive. In terms of helping me track in to the "corners" idea, it was helpful, though. The more we talk about the origin, the more confusing it'll get, so I'll just concede the model has a limited utility and try to move on to your terminology.

Anyway.

Your example as given doesn't actually disprove my suspicion. By my algorithm, the two points with the greatest Manhattan distance from the origin are the subset I'm checking for greatest Euclidean distance in. So I measure between B and C: Eu.Dist. = 8. A and D's Eu.Dist also = 8 and thus isn't longer.

However, if you double all the distances and nudge A, you have your disproof: (-9, 0), (2, 8), (2, -8), (8, 0). AD = 17, BC = 16, and B and C are chosen because MDist from origin are both 10.

However however, this counterexample fails (i.e., the algorithm works) if you simply move the origin one positive unit along the X-axis, showing that A) the algorithm is very sensitive to origin position, and B) thus shouldn't be tied to an origin at all. So I ask you:

If we revise my idea to base it on your formulations, can you still find a counterexample? (In other words, "The pair with the greatest Euclidean distance will be in the subset of {Manhattanest, anti-Manhattanest, unManhattanest, anti-unManhattanest}.")

Re: answering my own question


Every point on the blue circle is equidistant from the origin. Every point in the green and red quadrilaterals has the same Manhattan distance from the origin.

I don't have time to get an actual set of counterexample points, but those regions between the larger diamond and the circle are where my suspicions lie. With points at all four locations where the large diamond is tangent to the circle, put a pair of points on the X-axis such that the first is in the left-side region bounded by the circle and the outer Manhattan diamond and the other is in the right-side such region. That pair has the greatest Eucidian distance to each other, but not any of the Manhattan extrema.

Re: answering my own question

Alright, question answered.

The green square reminds me that there's still room for a weaker version of my conjecture. If you set the radius and center of the red square s.t. that your points with greatest Manhattan distance are on a pair of its opposite faces, any point-set within the green square is guaranteed to not be a greatest-Euclidean-distance solution. (Proof: The minimum EuDist between facing sides of the red square is where it's tangent to the circle and is 2r [r = circle radius]. The maximum EuDist between facing sides of the green square is between corners and is 2r. But you already have a solution of that length.)

So this wouldn't be much of a help in a TopCoder problem, but might have some weird, abstruse applications in the "real world". Assuming normal distribution of points (this is the key phrase, and why it's useless for TopCoder), this can whack off 25% of the work of brute-forcing a large dataset. t(.75n^2 + n) is better than t(n^2) (well, for n>4). You run the risk that you're just wasting t(n)'s worth of work, but programmers take that risk with compression algorithms all the time. :P

Re: answering my own question

Actually, you can prove a stronger claim than the red and green diamonds. The only points you need to consider are on the convex hull of the set(the shape you'd get by making the points nails in a board and snapping a rubber band around the whole thing). It's useless in TopCoder because you can engineer an input such that all the points are on the convex hull (ring-shape, for example), but a real-world application would definitely find it a major optimization- for any real-world application that needs "longest pair", which is a questionable proposition to start with. (Shortest pair has a clean O(n log n) algorithm that was my first assignment in my Intro to Algorithms class.) Quickhull (which is at least as good as Jarvis' March in the worst case) performs well in scattered cases, and then it would take most of the points out of consideration. From there, your algorithm might as well just check all pairs, because anything more intricate would wind up having to iterate over the remaining points to check them for possibility and wind up taking the same amount of time, if not more, than just checking everything.

But would calculating the Manhattan extrema (to give us the red diamond from which the blue circle and green diamond can be calculated fairly easily) help us anyway? We know that calculating the Manhattan extrema is O(n), and then checking for "inside diamond" is O(1) per point so we'd simplify the problem drastically for O(n) work. However, the simplification would make quick-hull much less efficient because it would leave a data set shaped like the worst case! Graham-scan would work better than quick-hull in that case, and it would probably be faster than a quick-hull over all the data points would have been.

Simplifying the problem overall is likely to produce better results (with regard to time complexity) than saving multiplications by using Manhattan distance comparisons rather than Euclidian distance comparisons (remember, the algorithm can save a lot of time by not executing the square root and just working with the squares of the distances- it's good enough for greater than/less than comparision purposes), so here's something I'm not sure of: What can we prove about the blue circle? Can we prove that the only points of interest are between the blue circle and the red diamond (inclusive)?

I'm not actually a mathemetician, I just sort of look like one when I'm doing things with combinatorics...

This discussion is taking on a life of its own

Quickhull, Jarvis and Graham are Greek to me (as I never actually took any formal computing classes -- just their underlying mathematical ones). So I don't really have any comment on that part of the post.

> Can we prove that the only points of interest are between the blue circle and the red diamond (inclusive)?

I think this is the best we can do without having special-case evaluation based on the data:

The first step would be noting that by our definitions, we already have a path of at least 2r EuDist. So any point that is less than Z EuDist from every point inside the red square (and thus from every other point in our data set) is not a point of interest. These points are inside a circle of radius (2-sqrt2)*r centered on the blue circle's center.

If the particular Manhattanest points have a EuDist of Xr (>2r), we can substitute that in instead and the radius increases to (X-sqrt2)r. But the subset of points we can throw out entirely via this method still seems small.

Hmm ... on the other hand.

I'm thinking there might be some potential to taking the ... um ... diago-Manhattanest points, as well? i.e., the points with the greatest Manhattan distance if the coordinate system were to be rotated by 45 degrees. This should be still only a t(N) problem, since the algorithm for diago-Manhattanism can, I'm pretty sure, be dervied just from combinations of X1, X2, Y1, and Y2 (plus a single multiplication by sqrt 2 once you've figured the biggest pairs). This would give us a second set of red/green squares at 45-degree angles to the ones here, and I'm wondering if a combination of those and the originals would be enough to catch the exceptions that Manhattan points alone let through.

(On further thought, I don't think this would be sufficient, since there would be smaller slivers at the 22.5/67.5/etc. degree marks, and so on and so forth. But I don't have a formal proof of that yet, just intuition.)

Re: This discussion is taking on a life of its own

Quickhull, Jarvis and Graham are Greek to me (as I never actually took any formal computing classes -- just their underlying mathematical ones). So I don't really have any comment on that part of the post.

They're all different algorithms for the same thing: find the convex hull. They all have different time complexities and perform well in different situations.

The first step would be noting that by our definitions, we already have a path of at least 2r EuDist.

No we don't. Consider an equilateral triangle.

Iterating the algorithm by rotating the coordinate system is going to run into the law of diminishing returns after about the third attempt. Finding out what the overhead is and how much of a difference it makes is probably interesting.

I have more thoughts on this, but it's too late and I need sleep and I just wanted to say "Won't somebody please think of the triangles?" before I forgot.

Re: This discussion is taking on a life of its own

>> The first step would be noting that by our definitions, we already have a path of at least 2r EuDist.

> No we don't. Consider an equilateral triangle.


um ...? Consider what about it? The red square isn't arbitrary; we select its center and size based on your algorithm that finds the two points with the greatest Manhattan distance. Therefore, the red square is guaranteed to either have a pair of points on sides 1 and 3 respectively, or on sides 2 and 4 respectively (our pair with the greatest Manhattan distance), depending on whether your manhattan-antimanhattan pair or your unmanhattan-antiunmanhattan pair won out.

Since the length of the red square's sides are 2r, points on opposite sides must be at least 2r apart in EuDist.

Re: This discussion is taking on a life of its own

Let's say we have the following three points:
(1,0), (0,1), and (-.4,-.4)

Now, in the manhatten metric, the distance between the first two is two, and the other two pairings both have a distance of 1.8, so the 'red square' would go to the unit points. That puts (-.4,-.4) inside the Blue Circle.

I think that your conjecture is based on the notion of a well-defined 'manhatten center', when, in fact, for the points, (a,b), (c,d), any point along the line segment (a,d)-(c,b) is a midpoint of sorts.

If the manhatten diameter of the point set is d_m, and the number of dimensions is n, then the radius of the inner circle should be:
\frac{1}{2} d_m (1 - \frac{1}{\sqrt{n})
which is not all that large.


  • 1