Pagini: [1]   În jos
  Imprimă  
Ajutor Subiect: Shortest Snippet  (Citit de 6801 ori)
0 Utilizatori şi 1 Vizitator pe acest subiect.
Cosmin
Echipa infoarena
Nu mai tace
*****

Karma: 351
Deconectat Deconectat

Mesaje: 1.799



Vezi Profilul
« : Iunie 11, 2015, 05:46:13 »

http://www.infoarena.ro/blog/snippet
Memorat
retrograd
Client obisnuit
**

Karma: 3
Deconectat Deconectat

Mesaje: 50



Vezi Profilul
« Răspunde #1 : Iunie 11, 2015, 09:05:25 »

I am thinking of a solution that sounds like this:

We first preprocess the text: For each charater of T, we mark the position of the next 'a', 'b', 'c', ...
This step can be done via DP in O(T*SIGMA), in the following manner:

Next[ last ][ c ] = oo, for c in a..z
Next[ i ][ c ] = ( i + 1 if T[ i+1 ] == c else Next[ i+1 ][ c ] )

Next, we find out each encounter of the first letter, and we set a weight of 1
We make a MIN-heap out of all those and at each step we try to add one more character to the heap top, by looking at the Next table
(basically in the heap we have the length and index, and in separate arrays we can track how much the index has advanced.

Once the first index has advanced to the P-th position, the algorithm is over.

The complexity I think is O(T*SIGMA + T*P*logT), which is pretty bad, but in practice it should work out better.

I am thinking of a divide-et-impera type of solution. Let's see if that path leads me anywhere...
« Ultima modificare: Iunie 11, 2015, 09:22:56 de către Lucian Bicsi » Memorat
retrograd
Client obisnuit
**

Karma: 3
Deconectat Deconectat

Mesaje: 50



Vezi Profilul
« Răspunde #2 : Iunie 11, 2015, 09:22:07 »

In case the letters are distinct, we can do this in linear time, by tracking for each character c, which is the substring starting at the right-most position that will be able to use c. (basically the position itself)

Basically, we can precompute for each character what its position is in the pattern, let's call it Pos[c], and then by iterating through the text, we do this:
   if the current character is the first one, we set Right[T[t]] = t;
   else Right[T[t]] = Right[P[Pos[t] - 1]] (we could use the max function on this one, but it would be redundant)

Then, if T[t] is the last character and Right[T[t]] != 0 (that is, if we index T from 1), we try to update the solution with the value t - Right[T[t]] + 1.
I put the condition Right[T[t]] != 0, because if Right[T[t]] = 0, that means we haven't found any substrings that contain all the pattern.

The total extra memory is O(SIGMA).

 
Edit: This method can be used for a pattern P without distinct characters, by keeping a list of all the positions a character occurs in the pattern, but that would force us to update for a position t, the values for all the positions of the character T[t].

In other words, we keep for each character c (except the first one) a list of all positions, let's call this Positions[c], and our algorithm should work out as following:

Cod:
for i in range(1, len(T) + 1):
   if T[ i ] == P[ 0 ]:
      Right[ 0 ] = i
   for each pos in Positions[ T[ i ] ]:
      Right[ pos ] = Right[ pos-1 ]
  
   if Right[ len(P)-1 ] > 0:
      best = min(best, i - Right[len(P)-1] + 1)

We can see here that, if P contains a lot of the same characters, the complexity shoots up to O(T*P), however it is still better than my previous one, so I thought I should mention it Smile

« Ultima modificare: Iunie 11, 2015, 09:39:56 de către Lucian Bicsi » Memorat
Cosmin
Echipa infoarena
Nu mai tace
*****

Karma: 351
Deconectat Deconectat

Mesaje: 1.799



Vezi Profilul
« Răspunde #3 : Iunie 11, 2015, 20:53:25 »

The first solution is a good attempt, but sigma can be big and you can improve on that.

I didn't understand much in your second solution. (right isn't defined)
Memorat
dutzul
De-al casei
***

Karma: 42
Deconectat Deconectat

Mesaje: 119



Vezi Profilul
« Răspunde #4 : Iunie 13, 2015, 21:39:32 »

Hey,  are 2 of my ideeas, the first one works in T*P time and memory

The second one is slower than the first one , but uses only O(T) memory;

1)

The first ideea is kind of a brute-forces , try to match the string P at every position of T

 the question that arises is :

 i matched X characters of P , and i am at position i now , where is the closest position J such that J>i and P[X+1]=T[J];
 in other words where do i need to "jump" next.

 we can keep a matrix which looks kind of : Next [ T ][ SigmaP ] where is the closest position that i need;
 so that way we can achieve O(T*P) time and memory

2)

  second ideea optimizes the memory but runs slower :
  Again , we try the same thing try to match the string P at every position of T

  How to get rid o matrix Next[][]?

  well first i divide string T into sqrt(n) "chunks"
 
  for each chunk , i create a list of pair<character_at_position,position> and sort the list;
 
  now i can make the folowing query on a chunk , does character C exist in the chunk? if so where is the smallest index that i need (the one that is higher than the character i just mathed)

   this can be done by a binary search
   
The time complexity looks bad  lol , but atleast i improved the memory, i will think of a better ideea later;

Memorat
dutzul
De-al casei
***

Karma: 42
Deconectat Deconectat

Mesaje: 119



Vezi Profilul
« Răspunde #5 : Iunie 13, 2015, 21:56:22 »

Oh , yes you dont need the sqrt root trick actualy , i am tired lol

3rd ideea.
Create a List of length T with indices 1...len(T) lets call it V;

sort it using the folowing comparition function pair<T,i> (first we compare by the character and then by the position)

now , when i need "to jump" just like in sol 1 i can binary-search my next position in Vector A;

Time : O(T*P*log(T)) Memory O(T)

Memorat
gogu
Client obisnuit
**

Karma: 42
Deconectat Deconectat

Mesaje: 98



Vezi Profilul
« Răspunde #6 : Iunie 13, 2015, 22:39:48 »

To me "very big file" sounds like "we can't store it in memory", so even O(T) memory is too much.
I'd try a simple online algorithm, that goes something like this:
For each length L from 1 to P-1, keep track of the earliest index where you could find the first L characters in order. When you process a character c, just try to update all position in your string where c occurs. This is worst case O(T * P) but on average you could get it very close to O(T), with maybe a couple optimization for frequent letters.
In fact, it is O(T) when all the P letters are distinct.
Memorat
Vman
Echipa infoarena
Vorbaret
*****

Karma: 45
Deconectat Deconectat

Mesaje: 176



Vezi Profilul
« Răspunde #7 : Iunie 14, 2015, 06:33:09 »

If you're sliding let's say the middle you could launch a few threads. Or even a threadpool with threads for every letter and a wait/notify mechanism although there's some overhead in this situation.
« Ultima modificare: Iunie 14, 2015, 06:38:32 de către Duta Vlad » Memorat
loses
Strain


Karma: 0
Deconectat Deconectat

Mesaje: 14



Vezi Profilul
« Răspunde #8 : Iunie 15, 2015, 21:47:36 »

Gogu is right.
The most basic aproach to this problem is the T*P DP.
DP[len][itr] = biggest index such that [DP[len][itr] ... len] containts P[1 .. itr]

Actually on every iteration the DP uses nrOfTimesThisLetterIsInP steps to recalculate the DP
So if P is made out of unique characters, it's O(T)

At the end of the day, IMO, you can't use some kind of KMP shenanigans since the pattern can be really messy.
If you have some kind of abbabaabbaababba as P and T is made of abbabaabbaababbCabbabaabbaababbCabbabaabbaababba
I don't think you can do better than O(T*P), since every letter REALLY CHANGES half of the values in the DP .. but you need to compute it all, hoping that you will find the pattern (in this case, at the end of the string)

It seems strange that a trivial solution runs so good, and it's kinda intuitive.
Maybe there's another way around it. I will think about it, because this seems kinda simple, to be honest.

Memorat
retrograd
Client obisnuit
**

Karma: 3
Deconectat Deconectat

Mesaje: 50



Vezi Profilul
« Răspunde #9 : Iunie 15, 2015, 21:57:46 »

Ok, so let me polish up the explanation to my second attempt:
Let's assume the characters of P are unique (for the other case, see my older post)

Basically, say the pattern is P[ 1...len(P) ] (we index it from 1)

Right will be an array that will help us analyze the best solution, in the following manner:
Right[ i ] is the right-most position from where a substring that contains, in order, P[1..i], starts.

We choose this approach because, at any time, if we get the last character of the pattern at any time, we compare the minimum length with the current solution, which would be (i - Right[ len(P) - 1 ] + 1).

Now, to get to the part of constructing Right and checking:

We iterate through the Text.
At each step, we find out at which position(s) (see my older post after this) the current character is in the pattern, let pos be that one.
Then we set Right[ pos ] = Right[ pos - 1 ], because we have just found P[ pos ] at the current position

There are two special cases to this:
  • pos == 1, in which case, we have just found the start of a new, better "solution" -----> we set Right[1] = i (i indicates the current position in Text)
  • pos == len(P), in which case we have found another substring, namely the one starting at Right[ len(P) - 1 ] -----> we update our min-length

The cute thing about this is that it uses only O(P) memory.

EDIT: As I have seen here, the big guys attacked the problem in pretty-much the same manner, so now I think there shouldn't be much ambiguity in my solution anymore Smile.
« Ultima modificare: Iunie 15, 2015, 22:20:47 de către Lucian Bicsi » Memorat
StarGold2
Strain
*

Karma: 11
Deconectat Deconectat

Mesaje: 46



Vezi Profilul
« Răspunde #10 : Septembrie 11, 2015, 12:09:00 »

I am thinking of a solution which sounds like this:

We first preprocess a matrix D(i, j) which means the rightmost position of the character 'i' in the first 'j' elements of the string T.

This matrix can be preprocesses in O(T * SIGMA) where SIGMA is the length of the alphabet.

Next, we will calculate a matrix Dp(i, j) which means the shortest substring in the first 'j' elements of the string T which contains the first 'i' elements of the pattern P.

The solution will be in Dp(len(P), len(T)) where len(string) means the length of a string 'string'.

And now about how to calculate the matrix Dp: Dp(i, j) =:

if(T[j] == P) Dp(i, j) = the length rightmost substring which contains the first (i-1) characters of the pattern + (j - D(P[i-1] , j) ) // we add j to our rightmost substring because that one is optim

But now appears a question: "How to calculate the len of the rightmost substring, because our Dp matrix keeps the best one, not the rightmost one ?"
And the answer: "We first calculate the rightmost one and then the best one"

The code looks like this:

for(int i = 1; i <= SIGMA; i ++)
    for(int j = 1; j <= len(T); j ++)
        if(P == T[j])
            D(i, j) = j;
        else
            D(i, j) = D(i, j-1);

memset(D, INF);

for(int i = 1; i <= len(P); i ++)
    for(int j = 1; j <= len(T); j ++)
        if(P == T[j])
            Dp(i, j) = Dp(i-1, D(i-1, j-1)) + (j - D(i-1, j-1);

The answer is the smallest number on the line len(P)

I hope this algorithm runs well, I think that the first line should be preprocessed, else the entire matrix will be filled with INF, also make sure exist a D(i-1, j-1) character, else put INF.

The algorithm uses O(T*(SIGMA + len(P)) memory, but can be optimized by keeping the last 2 lines, and O(T*(SIGMA + len(P)) time.

I am searching for a better algorithm which uses less memory and runs faster.
If you think something is wrong with this algorithm, let me know as soon as possible.
Memorat
Pagini: [1]   În sus
  Imprimă  
 
Schimbă forumul:  

Powered by SMF 1.1.19 | SMF © 2006-2013, Simple Machines