How good is the estimate in what case? Worst case? Best case? Expected case?
And what do you mean by 'good'? Absolute error? Absolute positive error? Error margin with a certain % of confidence?
And in the case of any answer implicating probabilities, are there any assumption about the kind of input files to be expected? Or are all 1GB 0,1 strings equally probable?
Also, do the 1024 bytes imply any limit on code size, or could I potentially include 1 GB of precomputed values in the source code? If there is no limit on the size of the source code, the problem becomes meaningless, since it can be answered precisely, with just 1 bit of writable extra space.
In case you are not allowed an unreasonable code size, I don't think you can have any 'good' estimate for when there are significantly more distincts than can be represented as a set in 1024 bytes. I tend to believe the best you can hope for is to be able to say 'there are more than x distincts' for a very small x. I think you have to actually represent them somehow; after thorough reflection I tend to find it unlikely you can gain any useful information by 'seeing' a new number without having a representation of all that you have seen up until that point. Without any statistical particularities of the input file, the best you can achieve is 1 bit / stored number, so you can store no more than 8192 numbers. Thats 2^13. There can be as many as 2^27 distincts. So, even if you're lucky and manage 1 bit / number, your error will be huge (up to 99.99389%). It is likely that you will do much less than 1 bit / number, unless you plan on getting lucky.
In case one cares about worst-case positive error (so underestimate the number of distinct integers by the least possible amount) with 100% confidence, the problem is reduced to finding a cute data structure that can represent the highest number of 64-bit distinct integers, that can fit in 1024 bytes and can be updated by exposing it to a 8 byte number. A lot of interesting ideas come to mind. Mine are essentially based on a 'compacted' trie and some counters.
Also, keep in mind that the number of distinct integers that a data structure can store can depend on the actual distribution of the integers it is exposed to! So the actual sampling from the input file could matter. This can easily be randomized however (the sampling), or you could do even cuter things to make sure your structure tries to count as many distinct integers as it possibly could, regardless of how they are ordered in the input file.
Here are a few other interesting observations:
- An input file from the problem statements' point of view is equivalent to a partition of the number 2^64 into at most 2^27 components.
- So for any correct answer r, there are around (2^64)^(r-2) possible choices of r distinct integers:
http://en.wikipedia.org/wiki/Partition_(number_theory)#Asymptotics_of_restricted_partitions . So, a particular configuration can be represented using around 64^(r-2) bits.
- In case you don't 'cheat' by keeping any information regarding the input file in the actual code, essentially your program will be (at most) a finite automata with 2^(1024*

states and 2^64 maximum possible transitions from each state and each state must represent a result.
- So, there are 64^2048 possible states. Each state has to be mapped to a possible result in 1..2^27.
- Transition from a state representing a result r to a state representing r+1 by exposing the automata to a 64 bit number implies that it should somehow be able to discern based on 1024*8 state bits between the 2^(64^(r-3)) cases when the 'new number' is a duplicate and the remaining 2^(64^(r-2)) - 2^(64^(r-3)) cases when it's not. So even if r is a small 100,003, the share magnitude of 64^100000 compared to 8192 is just overwhelming and a sufficient argument to support the case that no good approximation is possible.
- So, with a limited program size and so little memory space I think a strategy that is almost as good as the best strategy is to ignore the input file completely and just answer with the expected number of duplicates in 2^27 64-bit numbers

.