count(distinct)

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

count(distinct)

« : Octombrie 24, 2012, 04:49:36 »

http://infoarena.ro/blog/count-distinct


	Memorat

•Alexxino7

Strain

Karma: 4

Deconectat

Mesaje: 14

RÄƒspuns: count(distinct)

« Răspunde #1 : Octombrie 24, 2012, 12:47:41 »

I am a little confused on how the numbers are represented in the file. Are they space separted ? and number 1 will be 1 or 00..01?
Later Edit: Forget about space separeted, I didn't think that one through.


	Memorat

•devilkind

Echipa infoarena
Nu mai tace

Karma: 284

Deconectat

Mesaje: 1.240

RÄƒspuns: count(distinct)

« Răspunde #2 : Octombrie 24, 2012, 13:30:36 »

Prima ideea care mi-a trecut prin cap ar fi sa impart intervalu [1 .. 2^64] in K intervale (K-ul cel mai mic posibil astfel incat sa pot stoca 2^64 / K int-uri). Iterez prin fisier si vad cate numere apar in fiecare interval. Dupa asta as vrea sa estimez in fiecare interval cate duplicate sunt stiind cate numere sunt acolo si lungimea intervalului. Pentru asta m-as folosi de expected number of duplicates. Mai exact as incerca sa calculez pentru un interval:

expected_number_of_duplicates = 1 * P[1] + 2 * P[2] + 3 * P[3] + ... (P[x ] - probabilitatea sa existe x duplicate in cadrul intervalului).

Suma finala a expected_number_of_duplicates din fiecare interval ar fi un rough estimate al numarului de duplicate din fisier. Solutia se imbunatateste destul de mult cu cat creste memoria (daca pot sa stochez mai multe intervale atunci estimarea se imbunatateste), insa pentru 1024 bytes cred ca numarul de intervale e prea mic si estimarea s-ar putea sa fie foarte departe.

Will keep thinking.


	Memorat

•Cristy94

De-al casei

Karma: 37

Deconectat

Mesaje: 128

RÄƒspuns: count(distinct)

« Răspunde #3 : Octombrie 24, 2012, 14:18:37 »

@Savin

Asta a fost prima idee care mi-a venit si mie in cap Very Happy

, dar cred ca are o eroare foarte mare si nu merge pe 1mB...

@sagetuz

60%? Nu vine 94% rata de eroare? :-/ Adica 6% sa fie raspunsul corect Confused


« Ultima modificare: Octombrie 24, 2012, 14:29:19 de către Buleandra Cristian »	Memorat

•cprodescu

Strain

Karma: 1

Deconectat

Mesaje: 1

RÄƒspuns: count(distinct)

« Răspunde #4 : Octombrie 24, 2012, 15:57:17 »

Pentru cazuri cu putine numere distincte, bloom filter aduce un accuracy destul de bun, dar daca numarul de numere distincte este semnificativ peste numarul de buckets, nu mai obtinem nici o informatie (filter ul e saturat).

Pentru numere mari, putem face o histograma si computa o estimare folosind varianta. Pe bucket avem nevoie de 28 biti* pentru a evita overflow ul.
_
* Presupunand ca sunt storate binar avem 128 milioane (2^27) numere.

As propune sa folosim o histograma de 256 elemente (256 * 28 = 7168 biti) si un bloom filter de 928 biti.

Iteram prin fisier si updatam histograma si filter ul (folosing un hashing function cu destula entropie). Daca la final filter ul nu e saturat, returnam numarul de biti setati (bucket uri atinse). Altfel, computam din histograma varianta si returnam
> total - sqrt(varianta / N),
adica
> 128M - sqrt(varianta / 928)

Ex:
- sub 23 de numere distincte, cu peste 50% sansa bloom filter ul va returna corect.
- daca avem un singur outlier (ex: 60M + 1 numere distincte (60M dintre ele o data si un numar de 68M ori) ), varianta estimeaza corect.

@sagetuz, @Cristy94:
Cum calculati rata de eroare independent de input?


	Memorat

•MirceaD85

Strain

Karma: -32

Deconectat

Mesaje: 38

RÄƒspuns: count(distinct)

« Răspunde #5 : Octombrie 28, 2012, 16:29:14 »

How good is the estimate in what case? Worst case? Best case? Expected case?
And what do you mean by 'good'? Absolute error? Absolute positive error? Error margin with a certain % of confidence?
And in the case of any answer implicating probabilities, are there any assumption about the kind of input files to be expected? Or are all 1GB 0,1 strings equally probable?
Also, do the 1024 bytes imply any limit on code size, or could I potentially include 1 GB of precomputed values in the source code? If there is no limit on the size of the source code, the problem becomes meaningless, since it can be answered precisely, with just 1 bit of writable extra space.
In case you are not allowed an unreasonable code size, I don't think you can have any 'good' estimate for when there are significantly more distincts than can be represented as a set in 1024 bytes. I tend to believe the best you can hope for is to be able to say 'there are more than x distincts' for a very small x. I think you have to actually represent them somehow; after thorough reflection I tend to find it unlikely you can gain any useful information by 'seeing' a new number without having a representation of all that you have seen up until that point. Without any statistical particularities of the input file, the best you can achieve is 1 bit / stored number, so you can store no more than 8192 numbers. Thats 2^13. There can be as many as 2^27 distincts. So, even if you're lucky and manage 1 bit / number, your error will be huge (up to 99.99389%). It is likely that you will do much less than 1 bit / number, unless you plan on getting lucky.

In case one cares about worst-case positive error (so underestimate the number of distinct integers by the least possible amount) with 100% confidence, the problem is reduced to finding a cute data structure that can represent the highest number of 64-bit distinct integers, that can fit in 1024 bytes and can be updated by exposing it to a 8 byte number. A lot of interesting ideas come to mind. Mine are essentially based on a 'compacted' trie and some counters.

Also, keep in mind that the number of distinct integers that a data structure can store can depend on the actual distribution of the integers it is exposed to! So the actual sampling from the input file could matter. This can easily be randomized however (the sampling), or you could do even cuter things to make sure your structure tries to count as many distinct integers as it possibly could, regardless of how they are ordered in the input file.

Here are a few other interesting observations:
- An input file from the problem statements' point of view is equivalent to a partition of the number 2^64 into at most 2^27 components.
- So for any correct answer r, there are around (2^64)^(r-2) possible choices of r distinct integers: http://en.wikipedia.org/wiki/Partition_(number_theory)#Asymptotics_of_restricted_partitions . So, a particular configuration can be represented using around 64^(r-2) bits.
- In case you don't 'cheat' by keeping any information regarding the input file in the actual code, essentially your program will be (at most) a finite automata with 2^(1024* Cool

states and 2^64 maximum possible transitions from each state and each state must represent a result.
- So, there are 64^2048 possible states. Each state has to be mapped to a possible result in 1..2^27.
- Transition from a state representing a result r to a state representing r+1 by exposing the automata to a 64 bit number implies that it should somehow be able to discern based on 1024*8 state bits between the 2^(64^(r-3)) cases when the 'new number' is a duplicate and the remaining 2^(64^(r-2)) - 2^(64^(r-3)) cases when it's not. So even if r is a small 100,003, the share magnitude of 64^100000 compared to 8192 is just overwhelming and a sufficient argument to support the case that no good approximation is possible.
- So, with a limited program size and so little memory space I think a strategy that is almost as good as the best strategy is to ignore the input file completely and just answer with the expected number of duplicates in 2^27 64-bit numbers Smile


« Ultima modificare: Octombrie 29, 2012, 13:45:06 de către Mircea Digulescu »	Memorat

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

RÄƒspuns: count(distinct)

« Răspunde #6 : Octombrie 30, 2012, 08:23:33 »

There's no probability distribution assumption. You can use relative error. We're looking for the worse case not average case.

There are more rigorous ways to define the error, from Alex:
output some X' so that:
X/c <= X' <= cX
where X=#distinct elements
with probability 90%


	Memorat

•MirceaD85

Strain

Karma: -32

Deconectat

Mesaje: 38

RÄƒspuns: count(distinct)

« Răspunde #7 : Noiembrie 01, 2012, 14:52:31 »

Keeping this under a tweet, Cosmin, what is the maximum allowed program (code) size? If unlimited, the problem can be answered precisely.


	Memorat

•MirceaD85

Strain

Karma: -32

Deconectat

Mesaje: 38

RÄƒspuns: count(distinct)

« Răspunde #8 : Noiembrie 01, 2012, 15:23:23 »

Another tweet: What is the probability of a given input file?If you assume uniform distribution, just ignore the input and output 134217728.


	Memorat

Pagini: [1] În sus

Imprimă

infoarena > Comunitate - feedback, proiecte si distractie > Blog > Subiect: count(distinct)

« mesajul precedent următorul mesaj »

	Ajutor	Subiect: count(distinct) (Citit de 3957 ori)
0 Utilizatori şi 1 Vizitator pe acest subiect.

infoarena informatica de performanta