Interview puzzle: Count distinct (2)

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

Interview puzzle: Count distinct (2)

« : Februarie 24, 2017, 09:37:53 »

http://www.infoarena.ro/blog/count-distinct-2


	Memorat

•marius21

Strain

Karma: -20

Deconectat

Mesaje: 27

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #1 : Februarie 24, 2017, 09:59:25 »

Un arbore de intervale cu un hash map de la user id-uri la count-uri Ã®n fiecare nod?

EDIT: Nevermind, acu cÄƒ mÄƒ gÃ¢ndesc, asta doar mÄƒ lasÄƒ sÄƒ iau numÄƒrul de log-uri per un user, eficient, mai greu cu numÄƒrul de useri distincÈ›i


	Memorat

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #2 : Februarie 24, 2017, 10:00:32 »

daca ai count_distinct in 2 intervale cum stii count distinct pe uniunea lor?


	Memorat

•darkseeker

De-al casei

Karma: 29

Deconectat

Mesaje: 106

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #3 : Februarie 24, 2017, 10:40:26 »

1) Sorted order means that they are sorted by timestamp and in case of equality, by user id ?
2) Does one need to answer the queries online or offline ?


	Memorat

•tudoras8

Strain

Karma: 0

Deconectat

Mesaje: 3

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #4 : Februarie 24, 2017, 10:41:33 »

O idee ar fi sa extindem un arbore indexat binar ca in locul unui vector de intregi sa avem un vector de hashset-uri. Hashset-ul AIB(x) va contine toate vizitarile din intervalul de timp (x - 2^k, x], unde k - numarul de zero-uri terminale din reprezentarea binara a lui x. Dimensiunea hashset-ului AIB(x) reprezinta numarul de vizitatori distincti din intervalul de timp pe care il reprezinta.
Va trebui facuta o preprocesare ca sa construim structura de date, dar solutia se preteaza si la cazul online.
Complexitate timp: Preprocesare: O(n * log(n)), Query: O(log(n))
Spatiul: O(n * log(n)) - vom stoca in acele hashset-uri toate vizitarile site-ului.

Nu sustin ca aceasta idee este cea optima.

EDIT: Nu merge, e aceeasi problema ca la Petcu Marius.


« Ultima modificare: Februarie 24, 2017, 10:47:25 de către Tudor Anastasiu »	Memorat

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #5 : Februarie 24, 2017, 10:47:27 »

@Boaca Cosmin
1) by timestamp
2) offline is fine, online is better


	Memorat

•darkseeker

De-al casei

Karma: 29

Deconectat

Mesaje: 106

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #6 : Februarie 24, 2017, 10:48:31 »

O solutie care poate aproxima numarul de elemente unice cu o precizie destul de mare si nu foloseste foarte multa memorie, este sa tii un AIB de hyperloglog-uri si atunci complexitatea pe query ar fi .. log N * size(HLL), unde size(HLL) e in jur de 2000 pentru o aproximare foarte buna in practica.

Solutia asta merge online, dar e doar o aproximare.

https://en.wikipedia.org/wiki/HyperLogLog


	Memorat

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #7 : Februarie 24, 2017, 10:52:31 »

Fain, ma gandeam si eu la ideea cu hyper log log. Explica putin mai in detaliu te rog.

Am si alta solutie, care e exacta, si e destul de eficienta.


	Memorat

•darkseeker

De-al casei

Karma: 29

Deconectat

Mesaje: 106

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #8 : Februarie 24, 2017, 12:11:06 »

Pai, tinem un aint si in fiecare nod avem un HLL care e construit din elementele de pe intervalul corespunzator nodului din AINT (in postul anterior am zis un AIB dar nu cred ca merge cu AIB pentru ca nu ai cum sa construiesti HLL-ul pe [a..b] din HLL-ul pe [1..a] si [1..b]). Cand facem query trebuie doar sa facem merge la HLL-urile astea pe intervalele care compun intervalul cerut. Merge-ul se face in O(size).


	Memorat

•bciobanu

Strain

Karma: 5

Deconectat

Mesaje: 19

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #9 : Februarie 24, 2017, 13:35:29 »

Let's map every user_id to a smaller value, but do it in such way that we maintain their order (we normalize them). Every timestamp query characterizes a substring in our log, therefore we can replace the timestamps with array indexes according to our log.

We can solve the problem offline first. Let's sort these modifies queries (which are of type (left, right) -> count the number of distinct user_ids in the [left, right] substring) by the right index, increasingly. We can process them like this:

Cod:

j = 0
for i in range(0, q):
  while (j <= query[i].right):
    if (last_position[value[j]] != -1):
      BIT.add(last_position[value[j]], -1)
    BIT.add(j, 1)
    last_position[value[j]] = j
    j = j + 1
  answer[query[i].original_index] = BIT.query(query[i].right) - BIT.query(query[i].left - 1)

Suppose value contains the normalized user_id's in their initial order and BIT is a binary indexed tree. Time complexity is O((Q + N) * log(N)).

We can improve this idea to handle online queries. The idea is really similar. Suppose segTree is a persistent segmented tree, which handles set update on position and query on interval.

Cod:

for i in range(0, n):
  if (last_position[value[i]] != -1):
    segTree[i] = segTree[i - 1].update(last_position[value[i]], 0)
  else:
    segTree[i] = segTree[i - 1]
  last_position[value[i]] = i
  segTree[i] = segTree[i].update(i, 1) 

for i in range(0, q):
  print segTree[map(query_right)].sumOfRange(map(query_left), map(query_right))

At every update O(log(N)) nodes are updated, so the space complexity is O(N * log(N)), but the time complexity is the same as in the offline solution.


« Ultima modificare: Februarie 24, 2017, 13:43:19 de către Bogdan Ciobanu »	Memorat

•raduberinde

Strain

Karma: 13

Deconectat

Mesaje: 26

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #10 : Februarie 24, 2017, 16:12:38 »

Awesome problem!

For each entry at time t we calculate Prev[t] = previous position of the user at that timestamp t (easy to calculate by maintaining a hash table). For first-time elements Prev[t] = -1.

Now the number of distinct users between [start, end] is equal to the number of values with start <= t <= end and Prev[t] < start. This is a 2d range query if every point is (t, Prev[t]). We can use any data structure for 2d queries like range trees (w/fractional cascading) which can do it on O(N log N) space and O(log N) per query.

I don't know if by "online" you mean that we can get queries intermingled with more log lines, if so since points we are adding are always to the right, we only need to update the right-most branch in the range tree (O(log N) nodes). Without fractional cascading we can just maintain a BST in each range tree node and that makes the update and query time O(log^2 N). It may possible to maintain fractional cascading information for O(log N) time but it must be tedious.


	Memorat

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #11 : Februarie 24, 2017, 18:23:25 »

@Bogdan Ciobanu: I don't think that works. To be specific substractions don't work well with the count_distinct operation.

@Radu Berinde: Applause

good job. That's the solution I had in mind.


	Memorat

•bciobanu

Strain

Karma: 5

Deconectat

Mesaje: 19

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #12 : Februarie 24, 2017, 19:23:21 »

The same idea is used to solve distincte or DQUERY on SPOJ, which are the same problem as this one. Perhaps the editorial for distincte will shed some light over this.


	Memorat

•Cosmin

Echipa infoarena
Nu mai tace

Karma: 351

Deconectat

Mesaje: 1.799

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #13 : Februarie 24, 2017, 19:32:09 »

@Bogdan,

) misto. Ma lasa memoria Smile

, stiam ca am mai vazut-o si pe infoarena Smile


	Memorat

•bciobanu

Strain

Karma: 5

Deconectat

Mesaje: 19

RÄƒspuns: Interview puzzle: Count distinct (2)

« Răspunde #14 : Februarie 24, 2017, 19:54:23 »

Citat din mesajul lui: Cosmin Negruseri din Februarie 24, 2017, 19:32:09

@Bogdan,

) misto. Ma lasa memoria Smile

, stiam ca am mai vazut-o si pe infoarena Smile

Ma bucur

. Ar fi trebuit sa o mentionez din primul post, pentru ca de acolo o stiam. Contributia mea a fost extinderea pentru varianta online. Intre timp au tot aparut problemele unde se poate folosi aceasta idee, precum unique, calafat sau pq.


	Memorat

Pagini: [1] În sus

Imprimă

infoarena > Comunitate - feedback, proiecte si distractie > Blog > Subiect: Interview puzzle: Count distinct (2)

« mesajul precedent următorul mesaj »

	Ajutor	Subiect: Interview puzzle: Count distinct (2) (Citit de 13696 ori)
0 Utilizatori şi 1 Vizitator pe acest subiect.

infoarena informatica de performanta