×

Loading...

CS246, Mining Massive Data Sets (Stanford)

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.


div widget
Sign in and Reply Report

Replies, comments and Discussions:

  • 工作学习 / 学科技术 / CS246, Mining Massive Data Sets (Stanford) +13

    The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
    Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.


    div widget
    • Lecture 01 +9


      div widget
      • Slide: +6
      • Suggested Readings - Chapter 1: Data Mining +6
        • The most commonly accepted definition of “data mining” is the discovery of “models” for data. +1
          • However, more generally, the objective of data mining is an algorithm. +1
        • Bonferroni’s Principle is really a warning about overusing the ability to mine data. +1
          • if you look in your data for too many things at the same time, you will see things that look interesting, but are in fact simply statistical artifacts and have no significance.
        • There are many phenomena that relate two variables by a power law,
          that is, a linear relationship between the logarithms of the variables.


          :

      • 非常好,我居然把这集看完了,讲课小哥的咳嗽挺吓人的。 +1
      • Suggested Readings - Chapter 2: Large-Scale File Systems and Map-Reduce +4
        • Schematic of a MapReduce computation


          :

        • To deal with applications such as these,
          a new software stack has evolved. These programming systems are designed to get their parallelism not from a “supercomputer,” but from “computing clusters” – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.
      • In pioneer days they used oxen for heavy pulling, +1
        and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace Hopper

        The volume of data being made publicly available increases every year, too. Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.


        • Concept developed by Google: (1), Achieve parallel computing on a large array of inexpensive machines; (2), Tolerant of H/W failures.
      • MapReduce is a style of programming. +1


        :

        • (1), Bring computation close to the data;(2), Store files multiple times for reliability.
    • Lecture 02 +8


      div widget
    • Lecture 03 +7


      div widget
    • 这学期娃正好选这门课