Suppose you’re learning a new language and want to boost your vocabulary in a very time-efficient way.
People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.
Here is a formalization. Suppose for a large set of m books of average length n words, you want to pick the one book that has the highest vocabulary impact from the set of books. This vocabulary impact of a book is measured by a weighted sum across all vocabulary words in the book, each word weighted by how common the word is in the language, as measured by word frequency across all books; this essentially gives a probability weight of each word in the language.
That’s an easy problem to solve. First, optionally filter out stop words like (in the case of English) “the“ and “and” considered in some sense to have “not much meaning.“ Second, across all books build a unique word list along with counts of number of occurrences of each word. Finally, for each book evaluate the coverage of those words, computing a score as described above.
Computing the unique word list and scores can be done by a hashing process that runs in linear order mn time typically, worst case mn log(mn). To compute the score also costs average linear time. So the entire process can be done in linear time.
What if you want to find the best two books to read, with the best joint vocabulary coverage? To find the optimal solution, the best known time is not linear but is quadratic for the general case.
How about the best k books for arbitrary k > 0?
This is an NP-hard problem, meaning that the compute time to solve this problem exactly for the best known algorithms for the general case grows exponentially in k as the size k of your desired set of reading books increases.
So you cannot expect to solve this problem exactly for large k. All hope is not lost, however. This is a maximal weighted cover problem, residing in a subset of the NP hard problems known as submodular problems. Because of this, approximate algorithms are known that guarantee approximation accuracy within a certain known factor of the true best solution (see [1-5]).
These algorithms are described in [6], [7], [8]. The basic idea of the algorithms is to add a single high-impact book at a time to the running set of high-impact books—a greedy algorithm. It is not guaranteed to be the best book list, but it is reasonable.
The helpful Python submodlib package runs very fast on this problem.
One can improve the quality of the result by spending more on computation. First, you could use a blocking strategy to compute precisely the best two books to add at each step, or three books, and so forth (computational complexity is quadratic, cubic and so forth). Similarly one could use a look-ahead strategy: add two books to the list that are together the best by an exact computation, then leave one off and find the best second and third books, and so forth. These in general do not improve on the submodularity bound, however in practice the result is more accurate.
One can also use various heuristics to improve performance in some cases. For example, if a book has little or no vocabulary that is not present in the other books, it can sometimes be safely discarded. However, in general the exact case remains hard (unless P=NP).
References
[1] Abhimanyu Das and David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection. Proceedings of ICML, 2011.
[2] Abhimanyu Das and David Kempe. Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection.
Journal of Machine Learning Research (JMLR), 19(3):1–35, 2018.
[3] M. Conforti and G. Cornuéjols. Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds. Discrete Applied Mathematics, 7(3):251–274, 1984.
[4] Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. Proceedings of SODA, 2013.
[5] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions. 2013.
https://arxiv.org/abs/1311.2110
[6] Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. “An analysis of approximations for maximizing submodular set functions—I.” Mathematical programming 14, no. 1 (1978): 265-294.
[7] “Maximum coverage problem,” https://en.wikipedia.org/wiki/Maximum_coverage_problem.
[8] Wei, Kai, Rishabh Iyer, and Jeff Bilmes. “Fast multi-stage submodular maximization.” In International conference on machine learning, pp. 1494-1502. PMLR, 2014.
The post Learning languages with the help of algorithms first appeared on John D. Cook.