- You are here
- Everything Explained.Today
- A-Z Contents
- S
- ST
- STR
- STRE
- STREA
- STREAM
- Streaming algorithm

In computer science, **streaming algorithms** are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access to limited memory (generally logarithmic in the size of and/or the maximum value in the stream). They may also have limited processing time per item.

These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream.

Though streaming algorithms had already been studied by Munro and Paterson^{[1]} as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83, the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy. For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs,^{[2]} in which the space allowed is linear in the number of vertices, but only logarithmic in the number of edges . This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in

*o(n)*

In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for random access, but instead arrives one at a time in a "stream".^{[3]} If the stream has length and the domain has size, algorithms are generally constrained to use space that is logarithmic in and . They can generally make only some small constant number of passes over the stream, sometimes just one.^{[4]}

Much of the streaming literature is concerned with computing statistics onfrequency distributions that are too large to be stored. For this class ofproblems, there is a vector

a=*(a*_{1,}...*,**a*_{n)}

0

a

a

In the cash register model, each update is of the form

*\langle**i,
c\rangle*

*a*_{i}

*c*

*c*=1

In the turnstile model, each update is of the form

*\langle**i,
c\rangle*

*a*_{i}

*c*

*a*_{i}

Several papers also consider the "sliding window" model. In this model,the function of interest is computing over a fixed-size window in thestream. As the stream progresses, items from the end of the window areremoved from consideration while new items from the stream take theirplace.

Besides the above frequency-based problems, some other types of problemshave also been studied. Many graph problems are solved in the settingwhere the adjacency matrix or the adjacency list of the graph is streamed insome unknown order. There are also some problems that are very dependenton the order of the stream (i.e., asymmetric functions), such as countingthe number of inversions in a stream and finding the longest increasingsubsequence.

The performance of an algorithm that operates on data streams is measured by three basic factors:

- The number of passes the algorithm must make over the stream.
- The available memory.
- The running time of the algorithm.

These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an

*(\epsilon,\delta)*

*\epsilon*

1-*\delta*

Streaming algorithms have several applications in networking such asmonitoring network links for elephant flows, counting the number ofdistinct flows, estimating the distribution of flow sizes, and soon. They also have applications indatabases, such as estimating the size of a join .

The th frequency moment of a set of frequencies

a

*F*_{k(a)}=

n | |

\sum | |

i=1 |

k | |

a | |

i |

The first moment

*F*_{1}

*F*_{2}

*F*_{infty}

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.

A direct approach to find the frequency moments requires to maintain a register for all distinct elements which requires at least memoryof order

*\Omega(N)*

See main article: Count-distinct problem.

=Flajolet et al. in introduced probabilistic method of counting which was inspired from a paper by Robert Morris. Morris in his paper says that if the requirement of accuracy is dropped, a counter *n* can be replaced by a counter which can be stored in bits.^{[7]} Flajolet et al. in improved this method by using a hash function which is assumed to uniformly distribute the element in the hash space (a binary string of length).

*h:[m]* → *[*0*,*2^{L}-1*]*

Let represent the kth bit in binary representation of

*y*=*\sum*_{k\geq0}bit*(y,k)**2^{k}

Let

*\rho(y)*

*\rho(*0*)*

*\rho(y)*=*\begin{cases}
Min(k:bit(y,k)*==1*)**&if**y>*0*\\
L**&if**y*=0*
\end{cases}*

Let *A* be the sequence of data stream of length *M* whose cardinality need to be determined. Let *BITMAP* [0...''L'' − 1] be thehash space where the (*hashedvalues*) are recorded. The below algorithm then determines approximate cardinality of *A*.

Procedure FM-Sketch: for i in 0 to L − 1 do BITMAP[i] := 0 end for for x in A: do Index := ρ(hash(x)) if BITMAP[index] = 0 then BITMAP[index] := 1 end if end for B := Position of left most 0 bit of BITMAP[] return 2 ^ BIf there are

- For

*i**\gg*log*(N)*

- For

*i**\ll*log*(N)*

- For

*i* ≈ log*(N)*

=The previous algorithm describes the first attempt to approximate *F*_{0} in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.

Bar-Yossef et al. in introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash function *h* which can be normalized to [0,1] as

*h:[m]* → *[*0*,*1*]*

*O\left(\dfrac{*1*}{\varepsilon*_{2}

*\upsilon*=Max*(h(a*_{i}*))*

*F'*_{0}=*\dfrac{t}{\upsilon}*

*O\left(\dfrac{t}{F*_{0}

Procedure 2 K-Minimum Value Initialize first t values of KMV for a in a1 to an do if h(a) < Max(KMV) then Remove Max(KMV) from KMV set Insert h(a) to KMV end if end for return t/Max(KMV)

=KMV algorithm can be implemented in

*O\left(\left(\dfrac{*1*}{\varepsilon*_{2}

*O(*log*(m))*

*O\left(\dfrac{*1*}{\varepsilon*_{2}

*O\left(*log*\left(\dfrac{*1*}{\varepsilon}\right)* ⋅ log*(m)\right)*

Alon et al. estimates by defining random variables that can be computed within given space and time. The expected value of random variables gives the approximate value of .

Assume length of sequence *m* is known in advance. Then construct a random variable *X* as follows:

- Select be a random member of sequence with index at,

*a*_{p=l}*\in(*1*,*2*,*3*,\ldots,n)*

- Let

*r*=*|\{q:q\geq**p,a*_{q=l\}|}

- Random variable

*X*=*m(r*^{k-(r-1)}^{k)}

Assume *S*_{1} be of the order

*O(n*^{1-1/k}*/*λ^{2}*)*

*O(*log*(*1*/\varepsilon))*

*Y*_{1,Y}_{2,...,Y}_{S2}

*Y*

Now calculate expectation of random variable .

*\begin{array}{lll}
E(X)**&*=*&*

n | |

\sum | |

i=1 |

m_{i} | |

\sum | |

i=1 |

*(j*^{k-(j-1)}^{k)}*\\
&*=*&*

m | |

m |

*[(*1^{k+(2}^{k-1}^{k)+\ldots+}

k | |

(m | |

1 |

-*(m*_{1}-1*)*^{k}*))**\\
&&* + *(*1^{k+(2}^{k-1}^{k)+\ldots+}

k | |

(m | |

2 |

-*(m*_{2}-1*)*^{k}*))*+*\ldots**\\
&&* + *(*1^{k+(2}^{k-1}^{k)+\ldots+}

k | |

(m | |

n |

-*(m*_{n}-1*)*^{k}*))]**\\
&*=*&*

n | |

\sum | |

i=1 |

k | |

m | |

i |

=*F*_{k}*\end{array}*

=From the algorithm to calculate discussed above, we can see that each random variable stores value of and . So, to compute we need to maintain only bits for storing and bits for storing . Total number of random variable will be the .

Hence the total space complexity the algorithm takes is of the order of

*O\left(\dfrac{k*log*{*1*\over**\varepsilon}}{*λ^{2}

=The previous algorithm calculates

*F*_{2}

*O(**\sqrt{n}(*log*m*+log*n))*

*\{*-1*,*1*\}*

This further reduces the complexity to calculate

*F*_{2}

*O\left(\dfrac{*log*{*1*\over\varepsilon}}{*λ^{2}

In the data stream model, the **frequent elements problem** is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the **majority problem**, which is to determine whether or not any value constitutes a majority of the stream.

More formally, fix some positive constant > 1, let the length of the stream be, and let _{} denote the frequency of value in the stream. The frequent elements problem is to output the set .^{[8]}

Some notable algorithms are:

- Boyer–Moore majority vote algorithm
- Count-Min sketch
- Lossy counting
- Multi-stage Bloom filters
- Misra–Gries summary

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.^{[9]}

Counting the number of distinct elements in a stream (sometimes called the moment) is another problem that has been well studied.The first algorithm for it was proposed by Flajolet and Martin. In 2010, Daniel Kane, Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem. It uses space, with worst-case update and reporting times, as well as universal hash functions and a -wise independent hash family where .

The (empirical) entropy of a set of frequencies

a

*F*_{k(a)}=

| |||||

\sum | log{ | ||||

i=1 |

a_{i} | |

m |

*m*=

n | |

\sum | |

i=1 |

*a*_{i}

See main article: Online machine learning. Learn a model (e.g. a classifier) by a single pass over a training set.

Lower bounds have been computed for many of the data streaming problemsthat have been studied. By far, the most common technique for computingthese lower bounds has been using communication complexity.

- Selection and Sorting with Limited Storage. Munro. J. Ian. Paterson. Mike. 1978. IEEE Computer Society. 19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16–18 October 1978. 253–258. 10.1109/SFCS.1978.32.
- Feigenbaum. Joan. Sampath. Kannan. 2005. On graph problems in a semi-streaming model. Theoretical Computer Science. 348. 2. 207–216. 10.1016/j.tcs.2005.09.013.
- Book: Babcock. Brian. Babu. Shivnath. Datar. Mayur. Motwani. Rajeev. Widom. Jennifer. 2002. Models and Issues in Data Stream Systems. Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS '02. New York, NY, USA. ACM. 1–16. 10.1145/543613.543615. 978-1581135077. 10.1.1.138.190. 2071130.
- Book: Bar-Yossef. Ziv. Jayram. T. S.. Kumar. Ravi. Sivakumar. D.. Luca Trevisan. Trevisan. Luca. 2002-09-13. Counting Distinct Elements in a Data Stream. Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science. en. Springer, Berlin, Heidelberg. 1–10. 10.1007/3-540-45726-7_1. 978-3540457268. 10.1.1.12.6276.
- Book: Optimal Approximations of the Frequency Moments of Data Streams. ACM. Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing. 2005-01-01. New York, NY, USA. 978-1-58113-960-0. 202–208. STOC '05. 10.1145/1060590.1060621. Piotr. Indyk. David. Woodruff. 7911758.
- Book: Counting Distinct Elements in a Data Stream. Springer Berlin Heidelberg. 2002-09-13. 978-3-540-44147-2. 1–10. Lecture Notes in Computer Science. Ziv. Bar-Yossef. T. S.. Jayram. Ravi. Kumar. D.. Sivakumar. Luca. Trevisan. José D. P.. Rolim. Salil. Vadhan. 10.1007/3-540-45726-7_1. 10.1.1.12.6276.
- Approximate counting: A detailed analysis. BIT Numerical Mathematics. 1985-03-01. 0006-3835. 113–134. 25. 1. 10.1007/BF01934993. Philippe. Flajolet. 10.1.1.64.5320. 2809103.
- Book: Cormode, Graham. Encyclopedia of Algorithms. 2014. Springer US. 9783642278488. Kao. Ming-Yang. 1–5. en. 10.1007/978-3-642-27848-8_572-1. Misra-Gries Summaries.
- 10.1145/2623330.2623740. SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. 871–880. 2014. Schubert . E. . Weiler . M. . Kriegel . H. P. . Hans-Peter Kriegel. 9781450329569.

- . First published as .
- .
- .
- Kane . Daniel M. . Nelson . Jelani . Woodruff . David P. . An optimal algorithm for the distinct elements problem . Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems . PODS '10 . 2010 . 978-1-4503-0033-9 . 41 - 52 . 10.1145/1807085.1807094 . ACM . New York, NY, USA. 10.1.1.164.142 . 10006932 . .
- .
- .
- .
- Heath, D., Kasif, S., Kosaraju, R., Salzberg, S., Sullivan, G., "Learning Nested Concepts With Limited Storage", Proceeding IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Pages 777-782, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA ©1991
- .