This website requires Javascript to function properly. Please go to the setting of your web browser and enable Javascript for this website.

CS246, Mining Massive Data Sets (Stanford)

iceberg_online(data)

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.

div widget

(#14991678@0)
Last Updated: 2022-11-30

Replies, comments and Discussions:

工作学习 / 学科技术 / CS246, Mining Massive Data Sets (Stanford) -iceberg_online(data); 2022-11-30 {2495} (#14991678@0) +13
The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.

div widget

Lecture 01 -iceberg_online(data); 2022-11-30 {1493} (#14991680@0) +9

div widget

Slide: -iceberg_online(data); 2022-12-26 (#15056468@0) +6
Suggested Readings - Chapter 1: Data Mining -iceberg_online(data); 2022-12-27 (#15058703@0) +6

The most commonly accepted definition of “data mining” is the discovery of “models” for data. -iceberg_online(data); 2023-9-11 (#15658067@0) +1

However, more generally, the objective of data mining is an algorithm. -iceberg_online(data); 2023-9-16 (#15665886@0) +1

Bonferroni’s Principle is really a warning about overusing the ability to mine data. -iceberg_online(data); 2023-9-16 (#15665880@0) +1

if you look in your data for too many things at the same time, you will see things that look interesting, but are in fact simply statistical artifacts and have no significance. -iceberg_online(data); 2023-9-16 (#15666585@0)

There are many phenomena that relate two variables by a power law, -iceberg_online(data); 2023-9-17 {282} (#15667469@0)
that is, a linear relationship between the logarithms of the variables.

:

非常好，我居然把这集看完了，讲课小哥的咳嗽挺吓人的。 -nodream(～～～); 2022-12-27 (#15058765@0) +1
Suggested Readings - Chapter 2: Large-Scale File Systems and Map-Reduce -iceberg_online(data); 2023-1-7 (#15087788@0) +4

Schematic of a MapReduce computation -iceberg_online(data); 2023-9-12 {207} (#15660014@0)

:
To deal with applications such as these, -iceberg_online(data); 2023-9-17 {311} (#15667470@0)
a new software stack has evolved. These programming systems are designed to get their parallelism not from a “supercomputer,” but from “computing clusters” – large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

In pioneer days they used oxen for heavy pulling, -iceberg_online(data); 2023-9-9 {1306} (#15653455@0) +1
and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace Hopper
The volume of data being made publicly available increases every year, too. Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.
Reference: Hadoop: The Definitive Guide, 4th Edition Tom White

Concept developed by Google: (1), Achieve parallel computing on a large array of inexpensive machines; (2), Tolerant of H/W failures. -iceberg_online(data); 2023-9-23 (#15677664@0)

MapReduce is a style of programming. -iceberg_online(data); 2023-9-10 {208} (#15655072@0) +1

:

(1), Bring computation close to the data;(2), Store files multiple times for reliability. -iceberg_online(data); 2023-9-23 (#15677662@0)

Lecture 02 -iceberg_online(data); 2022-11-30 {1496} (#14991681@0) +8

div widget
Lecture 03 -iceberg_online(data); 2022-11-30 {1496} (#14991682@0) +7

div widget
这学期娃正好选这门课 -3293(春眠不觉晓); 2023-9-17 (#15667491@0)

CS246, Mining Massive Data Sets (Stanford)

Replies, comments and Discussions:

More Topics