A PhD Data Scientist: Jack of All trades, master of one.
一亩-三分-地,独家发布

???????ȡ???ݣ?ϴ???ݣ???????Prototype?Լ???data solution?????????????ݹ???ԭ??(MapReduce))??

visit 1point3acres.com for more.
    Ttest, Regression, ANOVA, Logistic Regression, DOE, Machine Learning, Data Mining, MapReduce, SQL, R/Matlab, Python, Java

一亩-三分-地,独家发布

??????Ҫ???IT????ҵ?????ݿ?ѧ It does not define a data engineer. Rather, it's a close call to a "full-stack data scientist". Master this list and you will not only be able to work for established firms, but startups too.
Ҫѧ????????Ҫ?ܶ?ʱ?䣬???ϣ????̫????????data scientist, OK, dream on!
Waral 博客有更多文章,
牛人云集,一亩三分地
?о?????data scientist/researcher֮??ְλ????Ը?ǿ???ܸ?????????????׶Է???Ҫ????ʲô?????ˣ???ɶ????һ??ģ????ǻ??ͳ?Ƶ???ũ??????Machine learning???????Ż???logistics ??Ӧ???????ǻ???̵?ͳ??ʦ??
(data business person һ?㲻??data scientist) ??Ҫ??SQL??????????BI analyst Ҳ???ڴ??С?

ѧϰ?б?һ????׼???????ã?????????ƽʱ????Ҫ?õġ????Լ?ѧ???mark as green
????????ղر????????¥?·??ġ??ղء? -?? ȷ?? -?? Ȼ?????»?????? ????ݵ?????-???ղ?????

???????ijУ??Data Science??Ŀ??Σ?????Χ????ܷ???ijУ??I have no idea.

????????must have:
from: 1point3acres.com/bbs
ͳ??Statistics ͳ?ƺͻ???ѧϰ.1point3acres网
  hypothesis testing, point/interval estimation
  pvalue, power, (type 1/2 error)
  clt, delta method, derive coef and var(coef) etc
  t-test: assumptions, remedy. ???????ⷶΧbasics listed above ?뿴????? http://onlinestatbook.com/2/index.html
  glm (lm, logistic regression, anova etc):asssumptions, model selection and validation, diagnostics, remedy ???????ⷶΧ

  times series         Forecast with R
         Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics)
         1point 3acres 论坛

         Bayesian for hackers (python)
         Coursera Graphical Model (VERY nicely explained)
         Bayesian reasoning and machine learning book (quite difficult to read)
         ???ţ?A first course in Bayes һ?¾Ϳ????ˣ??ܲ???

  longitudinal, mixed model
  doe:all kinds of design, response surface

Machine Learning        Coursera Andrew Ng
    stanford Statistical Learning (Tibshrani & Hastie)
        -- ???黹????һ?????ư棬???ض???ʵ????????R, very easy to read. recommend starting from here.
    Caltech?Ǹ?learning from Data??û?ܸ?????Please, make sure you know your logistic regression inside and out!

Deep Learning:
See my separate thread here: http://www.1point3acres.com/bbs/ ... 1&extra=#pid2601595
Learn recommender system
Learn some NLP
Make sure you KNOW how things work, not just how to call a certain package in a certain language!!!
Experimental Design / Causal Inference
This is somewhat a niche area. But as a DS, you will most likely deal with some AB tests, if you are with a reputable internet company. It is not just using some tool to compute power for a chisquare test or t-test. Be sure you know the difference between observational study and designed experiment. Be sure you know when to use which. Students from biostat/epi background will have an edge here. If you are able to handle very complex expt design, then you are opening many doors -- think multi-sided Marketplace and interfering subjects (Uber/Lyft, airbnb, eBay), Social network (Snap/FB/Linkedin) problems, think about problems that can't cleanly randomize users (opt-in, marketing campaign, mobile app feature roll out).

Optimization & MoreIntro to linear programming https://www.math.ucla.edu/~tom/LP.pdf Good and easy read.
See see.stanford for additional courses on convex opt.
Prof. Ferguson also has some good reading material on game theory https://www.math.ucla.edu/~tom/Game_Theory/
Udacity Intro to AI is a great course (also one of the very first MOOC in this world) that connects the many concepts together, including particle filters, Kalman filter, HMM etc.

ͳ??????Statistical Computing: R/Matlab/Python. SAS(?)
    R and Matlab ??????ҵ????Ϊ?ǵ?ͬ?ġ?????Matlab is not free, Octave is free ???Dz?????ô???á??뿼????ѧR?????????Matlab ?Ļ?pick up R Ҳ?ͷַ??ӵ????顣.本文原创自1point3acres论坛
    ???????????һ???????ᣬֻ??SAS Base/Stat????????Ҳ????ѧ?????ģ???Ҳ?????ݿ?ѧ???ʺ??㡣??????Ҫ??SAS???ɣ?????????д??macro??SAS??ȷ?ڴ????ݵĽ?ģ????dz????ã????Ǹ???????ҵ???ϴ?????????????˶???R/Py/Java ??????ǽ??????????쳣???ѡ??????????ܹ󣬺ܶ?ط?δ??Ը???? 来源一亩.三分地论坛.
    Python: Data Analysis with Python (book), pandas
    R: data.table, or plyr, lubridate, reshape2, build a R package, there are now lots of such courses on both udacity and coursera. Start from any.
        know how to get data from any source (DB, web, xml, plain text, etc)
EDA (exploratory) - Descriptive stats udacity
Inference - udacity
read code from your favorite packages

??? : A compiled language, and a scripting language
    ?ұȽ?ƫ??Udacityһ???һ????quiz ?ķ?ʽ???????ⲻ??(codecademy?????Լ?????ѧ?????
    Udacity CS101
    Udacity CS 215 (Algorithm, ??Coursera Princeton and StanfordҪ?򵥣????ٹ?һ?鲻????
    Udacity (Peter Norvig) CS212 Design of a Computer Program ?dz??ã?ǿ???Ƽ?

Java ???ݽṹ???㷨
1. Udacity java ?????ſ??һ???40Сʱѧ?꣩?ʺ???ʲô?Ǻ???ʲô?Ǹ?ֵ????֪?????ˡ?
2. Data structure
  Java:  Berkeley 61B http://www.cs.berkeley.edu/~jrs/61b/
  Java:  Berkeley 61B http://www.cs.berkeley.edu/~jrs/61b/
        ?̲???Head First Java & Data Structures and Algorithms in Java??
       my progress bar: week 5, lab1, hw1
3. Algorithm:                  Udacity Algo in Python ?Ƚ?laid back???????̫ϣ???Ѿ?????????????Σ??????????????á?????
Java Coursera Algo I&II (Princeton)??????????????????Ȥ??
                  ???????? Stanford Algo I&IIҲ?ܺã????߲????໥???档

???ٻ?????ѧ?ĵ?һ????????C#??????C#????û??ʲô?ر????ŵ??飬???Ƽ??????û??ǰûѧC, java, C++ֱ?ӿ?C#?????ֱ?޷?????
C++?Ƚ??ѣ???data scientist ??˵Ӧ??Ҳû??java?㡣??Ȼ??????Ǵ?ţ??plz????û˵??

Design pattern:????ͬѧ?Ƽ??ģ?
http://courses.caveofprogramming ... ns-and-architecture
?Ҽٶ?????????????????ȫ?????ף?????ְλ?ر?ǿ????ͳ??ʦ?????߽?Data scientist, statistics/analytics??????ְλ˵??????Դ?????ȫһ?????????㶼???Լ??裬????ҪһЩ?????????? ??
????ˮƽ?ǣ?. from: 1point3acres.com/bbs
IT??˾???ƣ?Leetcode MediumҪ?????????ԣ?ˢ???ɡ?. 1point3acres.com/bbs

?????????ũ??????????????ƫ??data engineer?ģ?Ҫ??????

. visit 1point3acres.com for more.?漰֪ʶ????????Ҳ????ڣ?
     ?Ľ?MapReduce?㷨??beyond brute force)
     ????漰?????ݣ???ʱ?临?Ӷ?Ҫ???Ƚϸ?     Binary search, and be prepared to talk about complexity
     very basic DFS/BFS
     reservoir sampling
     string manipulation
. from: 1point3acres.com/bbs      if DP dynamic programming is ever asked, it will be very basic
     basic data structures
Most likely you don't need leetcode hard
-- ????????????????????
牛人云集,一亩三分地
Regex (a couple of hours) http://deerchao.net/tutorials/regex/regex.htm

-google 1point3acres
SQL (a week) http://www.w3schools.com/sql/    Coursera: Intro to DB

SQL ????Ҫ????ʲô?̶ȣ?
???JD??DS, ???Լ?û?????ر??ر???ĵģ????ǿ϶???Ҫ??
JOIN ?? subquery
WHERE
advanced: windowing function
???googleһ??׼??sql???Ե?link ??????Щ??Ϣ.留学论坛-一亩-三分地

. 1point3acres.com/bbs

MapReduce: some knowledge    Udacity series:    http://blog.udacity.com/2013/11/sebastian-thrun-launching-our-data.html)    
Coursera: intro to Data Science  
    Coursera: Big data and web intelligence
    learning by doing --- yes! wrote my very first reducer for real life projects!    MongoDB (udacity) (NOSQL)

Spark/Scala - try this book: Advanced Analytics with Spark (very doable and easy to follow, superb examples)
Scala?Ƽ? Coursera: functional programming in scala - ??????

Spark MOOC http://www.1point3acres.com/bbs/thread-135600-2-1.html
Book: Learning spark
围观我们@1point 3 acres
Basic Engineering: https://see.stanford.edu/CourseIt also has great content on optimization, which is harder to find elsewhere.

If your want to be a DS for IT firms, then Maybe:
   jquery/ajax (start from codecademy very simple js and jquery intro, then find books) w3c school one is also really good.
web services   get basic idea of how browsers work (udacity - Website Performance Optimization)
   udacity web development (build a blog) (40 hours)
SE
   Software Development Life Cycles (udacity, mostly videos, as a quick intro only), amazingly, this one filled lots of holes in my knowledge base. Highly recommend
   Also a book is mentioned here, worth a quick flip through, unfortunately, no ebook that I found works. Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts-Refactoring_ Improving the Design of Existing Code

-- this is helpful not only for working in IT, but helps overall coding style/efficiency as well. Wished I'd known earlier.
   Many servers are in linux. at least familiarize yourself with the command line stuff. There's a not so good course on edx.
Basic shell script or similar
jq, sed, awk

????Ϊ?򣬸??ݽ?겵ı?????????hide complex formula/engineering details??????????big picture
    ???˾????ǣ?ϰ????Щ??????õİ취?ǣ?ȥ??????Ҫ?Թ??ԵĽ?????????ʱ??ע?????Ƿ????????????Է????????ʣ??ش?????Ҫѡȡ???϶Է??????Ĺؼ??֣??????ǡ??Լ???Ϥ???Ĺؼ??֡???Ҫ????д??С??Χ????ི???intuition???ٶѻ???ʽ??. 围观我们@1point 3 acres
    1. ??һ???Լ?רҵ?????ſΣ?e.g ͳ??ѧ????ȥ??????רҵ???˽?????ͳ?ƣ????ӣ??????ȫ????ͳ?Ƶ??˽???ʲô??pvalue, power, false positive, randomization, inference etc.
    2. Consulting - ??ЩѧУ????????session????????˷?ʱ?䣬ȥ?ѱ??˽?????ȥ?????????????רҵ??????ʲô???⣬???ǵ?˼·???????ﲻͬ??????????????ǣ???????????????㡣. 围观我们@1point 3 acres
    3. ??presentation - ??Ҫ??רҵѧ????????????ȥ????Ҫ?????????101??????????????Ŀ?ģ?????չʾ???רҵ??ô??????£?????Ϊ??impress others with your techinal prowess???????öԷ???????????ȡ??Ľ??顣.留学论坛-一亩-三分地
    Data Journalism (course, starting early 2014) --- it was not as good as I expected. I do not recommend it.

围观我们@1point 3 acres
??ͼ????̬??????ܻ?ggplot (a few hours), ??̬??d3????????javascript, also great!?? ?Ƽ???. 1point3acres.com/bbs
     Nathan Yau: books visualize this & Data points, and his flowing data blog
     for d3: Interactive Data Visualization for the Web . free online tutorial by author: http://alignedleft.com/tutorials/d3/about ???û??ô??. 围观我们@1point 3 acres
html (a few hours, w3c)
css (a few hours, w3c), or codecademy, or the d3 book mentioned above
javascript (codecademy as a start, a book to follow later)

Udacity also has a new course on vis

Prototype your data products:
    mean stack. https://thinkster.io/angulartutorial/mean-stack-tutorial/
    R open CPU. R Shiny (limited usage with free version).     If you are not into Angular, try the flask+React stack, ???ֵ?ȷ?ܿ?
  ??????flask, udacity?пΣ?react??ѧ???ɣ????Բο?udacity ????components?ĿΣ?

??Ȼ???Dz???Ҫ??ǰ?ο????????ǿ?????Ҳ???????и??????ǰ?Σ???ѧϰ??MM?ľ??飬???? http://www.1point3acres.com/bbs/thread-104335-1-1.html
Design:  (optional but nice to know) ???û????Ȥ?????ٿ???????????ÿ?????ɫ??  ?????????Ȥ??ͼ?ÿ????뻨һ????ĩ?????⼸????
    1. Before and After
    2. Nondesigner's design book
    3. Don't make me think
    4. The Wall Street Journal Guide to Information Graphics
围观我们@1point 3 acres
    sharelatex (invite enough users to get free versioning) /writelatex.com
    Go to conferences, see what people are working on. Read their papers.
more info on 1point3acres.com
Domain Knowledge: google/wikipedia is your friend

    Doing Data science (book)
    Data Science in Business
other һЩ?Ҹо???̫??ʱ?䵫?ǻ????õ?С????
   excel, power pivot etc

?????ݵ?????ɶ????????Big Data: A Revolution That Will Transform How We Live, Work, and Think??. 一亩-三分-地,独家发布
?ͺܽ??Ƶ?һ?? ??Automate This: How Algorithms Took Over Our Markets, Our Jobs, and the World??
Ȼ??Ȼ????Nate Silver ??The Signal and the Noise: Why So Many Predictions Fail-but Some Don't??. more info on 1point3acres.com
Case study:  Twitter data analytics http://tweettracker.fulton.asu.edu/tda/
?????Ƽ??? MS  data science ѧϰcurriculum  http://datasciencemasters.org/
??Ҹ????Ƽ??İ???????˼·??????ȷ?ķ?ʽ???µĹ??ߣ?It's more important than you think!!. 留学申请论坛-一亩三分地
coursera reproducible research??ѧתknitr????Ҫcopy paste anything

Udacity Git Course (??ã?û??֮һ??
???ݿ?ѧ??һ?? apprenticeship model???Һ??ʵ??˴??????£??ɳ???ܿ졣



earlgrey 发表于 2016-3-10 12:04:23
li3939108 ?????? 2016-3-4 15:00.1point3acres网
K????ã?????ECE??CE track?? PhD??coding?????????ԣ??ײ㵽Ӧ?ò?ɶ??֪??һ?㣬?Լ???Ruby on Railsд ...

Casella and Berger??????ܺã??????????ҹ???׼?????Բ?̫??

All of Statistics ?Ȿ??Ҳ??master levelͳ?ƿεĽ̳̣??????topic????һЩ??Ҳ???ִ?һ?㣬??Ȼ????????topic??????

linear regression, DOE, ML ?????ⶼ?п??ܳ???

ML: ?????? stanford cs 229?? ????notesд?Ļ?ͦ???


Zzzed 发表于 2017-8-24 22:51:16

????????MOOC????Data Science?????෱?࣬ ѡ??̫???????޴????֣?????ת?е?ͬѧ????????̵?ʱ???ڻ???????????????ʵ?õ?֪ʶ?? ????Ϊ????֮һ??????ᣬ???ھͽ?????Լ??ľ???˵???Լ????ⷽ????ĵá?
???ҿ????? Data Science/Analytics ??????Ҫ???????¼?????ļ??ܣ?

1. SQL, 数据库相关的技能
SQL???ѣ???????Ҫ?????????????չ⿿?????? select, from, where, group by ??ԶԶ?????ģ???õ???ϵ????????һ??дһ?߿??ó??Ľ?????Ӷ??????ÿ?????ʵ???ڱ????????????ʲô???????߼???ʲô??
SQLҲ?????ݷ???????ʱ?ص㿼??ķ??棬Google, Facebook, Uber, Slack?ȵ???Щ??ĿƼ???˾????ȥ???ؿ??죬????Ҫ????fancy????????䣬???ǻ????????ü򵥵????????ȥʵ?ֺܸ??ӵ??߼???ϵ?? ?ⷽ?????Դ?Ƚ????ż????? SQLZOO ?? W3 School??SQL???֣????????????˵?ÿ??????֣????Ҷ?????ǰ??˵?Ŀ???????һ??дSQLһ?߿???query?????Ľ???? ?????????????????????????ݱ???????ʲô??
???׵???Դ??΢????edx?ϵ?һ??MOOC??Querying with Transact-SQL, ???ſ?Ҳ?????ڳ?ѧ?ߣ?????ѧϰ??ʱ??Ҫ??һЩ????Ϊ???ݻὲ????һЩ??????window function ?? table expression??

2. ͳ?ƵĻ???ԭ??
?󲿷ִ?ͳ?Ļ???ѧϰ???㷨??????ͳ??ѧ??????ͳ??ѧ??֪ʶҲ??????????????????̽???׶Σ?Explanatory Data Analysis?? ?͹????и??ָ?????Statistical Testing????. 一亩-三分-地,独家发布

3. Data Science/Machine Learning Modeling
Udemy: Python for Data Science and Machine Learning Bootcamp. more info on 1point3acres.com
Edx: Analytic Edge

Udacity: Intro to Machine Learning

???ſ???Google X ʵ???ҵĴ?ʼ?? Sebastian Thrun ??ͬʱҲ??Udacity?Ĵ?ʼ?ˣ????ڵģ?ȫ??ĺ?????????ML???㷨???м?ÿ??һ???µ??㷨?????ᴩ???˺ܶ?С??ϰ?????㹮????ѧ????֪ʶ??????Sebastian??Ϊҵ???ţ????ML?Ľ???Ҳ??????ֱ???׶???
Stanford Online:  Statistical Learning

???????̽??ML?㷨???????ѧ???ۻ???????????????˹̹??????ξ?????IJ???ѡ????Ȼ?γ̵???ѧ?????漰?϶࣬????ֻҪ??????λ????(??λ??ţ??????һλ?????˴?????????LASSO Regression)һ???????????DZȽ????׶??ģ????γ?Ҳ?????˽??????ֵ?R??Ϊ???????
??????Щ????Data Science/Analytics ?????ſΣ???ӭ??λ??ţ???????䣡
??Ȼ???к???????coursea?ϵ?JHU??data scienceϵ?У?????????Ͳ????????ˡ?

回复 支持 4 反对 0 使用道具 举报
黑夜雪 发表于 2014-9-13 16:36:02
????????? ??ҹѩ ?? 2014-9-13 16:38 ?༭ . 1point 3acres 论坛

??????ѧ??C, javascript???????DZ???ǰ???꣬???????⡣MATLAB?õĶ࣬????STATA?ľ???????֪???????????ɶ???ô??ѧУΪɶ??ôϲ??????EE PHD+ECON MASTER DOUBLE MAJOR, some experience in econometrics
Ŀ?꣺һ??֮??ѧ??Python, Java, R, HTML5, Javascript, CSS, Machine Learning, MapReduce, SQL
????Python??JavaԤ?ƻ?ʱ????࣬???ڿ?ʼѧϰ??????R?????㻨?޶?ʱ?䣬׼???и???ŵ??˽⡣HTML5ϵ???????ٴ???????ʱ??׼???????վ??ѧ??Ū??Machine learning׼????Python??ʵ?֣???Ҫ????????ӣ?http://blog.renren.com/share/231 ... ose_time=1410188191?????Ứһ??ʱ?䡣ʣ?µ???ʱ?ƻ???????
???ڼƻ???3???£???Python??ѧϰ1???£?????google?Ŀγ?+СK???????㷨?????ݽṹ??graphûѧ?꣩???Լ?дcode?????ݽṹ??ʵ????һ?飨CSͬѧ˵??Ҫ??ѧ???ݽṹ????ѧ??recursive programming????һ??д?˸????????ij???graphѧ???תս<Python for data analysis>??ͬʱ??ʼ????JAVA(??Core Java????),????JAVAʵ?ֻ??????ݽṹ???????ʱ?俴????ǰC?Ŀμ?????Ҫ??Ϥpointer??Ȼ??һЩ???ܸ???Ĺ????Ρ?
Note: ??ʵ?ʼ??ʱ??ִܵ?coding??????????????coding?????ǻῪ??һ??????֪?????????һ??????ļ??ܡ?data science??Ҫ??????EE??Ӳ??Ҳϣ????ᣬ??????Ҳ?Ҳ????????ӱܵ??????ˡ?SIGH


 楼主| 小K 发表于 2016-12-24 00:12:03
https://www.quora.com/Are-there-good-online-courses-for-Operations-Research

Nikos Makrymanolakis, M.Sc., ph.d (cand.) in the area
Some very good and relevant courses about OR subjects in coursera:

* Discrete Optimization (https://www.coursera.org/course/... ) by Professor Pascal Van Hentenryck
* Algorithms, Part I (https://www.coursera.org/course/... ) by Kevin Wayne and Robert Sedgewick. 牛人云集,一亩三分地
* Algorithms, Part II (https://www.coursera.org/course/... ) by Kevin Wayne and Robert Sedgewick
* Algorithms on Graphs and Trees (https://www.coursera.org/learn/a... ) by Alexander S. Kulikov and Michael Levin
* Algorithms: Design and Analysis, Part 1 (https://www.coursera.org/course/... ) by Tim Roughgarden
* Algorithms: Design and Analysis, Part 2 (https://www.coursera.org/course/... ) by Tim Roughgarden

Most of the algorithms covered in the above section, are OR used algorithms. The discrete optimization course is excellent, focus entirely on optimization (you will love the professor).
1point3acres.com/bbs
Feng Mai, ‎Assistant Professor at Stevens Institute of Technology

Operations Research is a broad field. For optimization I would recommend . 围观我们@1point 3 acres
围观我们@1point 3 acres
Prof. Stephen Boyd's convex optimization (available on YouTube) and
Prof. Pascal Van Hentenryck's discrete optimization (coursera).


Somewhat older list

The list from stanford


 楼主| 小K 发表于 2015-6-29 11:22:09
\ (•◡•) /  scala ??????Կ?ʼ???????ˣ?????????
?ظ? ֧?? 1 ???? 1 ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2014-9-3 09:09:23
?Է???ʱ????һ??udacity??software development life cycles???ſΣ?actually?dz???????????С????˵?????֪ʶ???ϵ?һЩ©????
?ظ? ֧?? 1 ???? 0 ʹ?õ??? ?ٱ?
wenbo5565 发表于 2017-7-24 13:33:36
???ѧϰ???????? ֮ǰ????һ??ͳ?Ƶ?master??????????????1??࣬??Ҫ?????????ݺ???tableau??visualization. ????׼?????ڶ???master,??????machine learning???????????????ҵ?Ժ?????º?predictive modeling/machine learning?йصĹ????????ڵ?ˮƽ??????python??sklearn??package?μ?һЩkaggle?ı??? ??óɼ??ܴﵽ15%-20%??ˮƽ ??????Ȼ?о????????algorithm??blackbox. ??̸?ѧϰ??????ô????? (֮ǰ??ͳ????Ҫ??frequentist?ĽǶȣ????׼??????һЩBaysian??CS????ĿΡ?????û??data structure??algorithm?ľ????ʺ?????????ҵ??????ѧ?Ƶ???design and analysis of algo?Ŀ?ô?????Ҫ????data structure?ڿ?????design and analysis of algo) ??л??
?ظ? ֧?? 1 ???? 0 ʹ?õ??? ?ٱ?
iverson1122 ?????? 2016-1-27 07:43:05 | ֻ????????

EE phd?ڶ??????جج?????꣬?????????ʼ???Ժ??????????ˡ??ܽ???һ???Լ???skill set???????Ӱ?ûһ??????ϵ?ġ????ǵ????Ķ?????machine learningմ??ߣ?ֻ???ñ??˵?ģ?????ã?????˵????һƪ?õ?svm??paper??????ҲŸ????svm?Ĺ???ԭ???????????һ??????Ȥ???ټ??ϵ?ʦ??ҽѧǰ??????Ĺ?עԶԶ???ڼ??????棬?????¶???????ѧ????data science??

????????machine learning??????ģ??ԭ????Ŷ????ף??߽?һ??ı???PGM?й???dz?????ţ??˴??????⣬??Ҿ???MLҪѧ??ʲô?̶??أ?????ͼģ??????ģ?ͣ?ѧ??֮??о?Ҳ????û?õ?????ͣ????һ?????е?ӡ?󡱵Ľ׶Σ???ͳ??ѧ?Ϲ?????graduate level??ͳ?ƿΣ??????ĵ???ƣ????????䣬t-test, F-test, ANOVA, PCA, Regression??ֻ??˵????֪???Ƶ????̣????ھʹ???и?ӡ??֪????ʲô??????õõ??? ʱ?????з????ͱ?Ҷ˹û̫?Ӵ??????㷨??????ԱȽ????ţ?֮ǰ??Ů????ˢ?㷨???ʱ???Լ?Ҳż????????һ?£??о???ʽ?ҹ???֮ǰͻ??һ??Ӧ?û??á???ļ?????????ԣ?java, R, Matlab, Python????????ֻ????ˢ??/???ݷ??????棬??û??ʲô????/Ӧ?ÿ????ľ????????ݿ?/?????? ????Web????о????Լ??????̰壨??ȫС?ף??????????ѧMySQL??ѧ??֮??׼????Udacity??hadoop???ܿλ???coursera??python ??ȡweb?????????ſΣ???????????????￴???ģ???лK??~????. 牛人云集,一亩三分地

?ظ? ֧?? 1 ???? 0 ʹ?õ??? ?ٱ?
pureds 发表于 2013-11-24 13:15:42
?ظ? ֧?? 1 ???? 0 ʹ?õ??? ?ٱ?
nibuxing 发表于 2013-11-20 10:49:57
??K????????= =??һֱҲ????һ???ٶ?һ??дһƪ?Լ????ɵ????ɼ?¼??????ллK??ķ?????????Ŭ????ϣ???ڲ??õĽ??????и?????ջ?
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-11-27 03:03:33
bayesian methods for hackers. 牛人云集,一亩三分地
https://github.com/CamDavidsonPi ... Methods-for-Hackers
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-12 09:13:19
Introduction to Hadoop and MapReduce

. 留学申请论坛-一亩三分地

Install VM as said (use winRAR in windows, 7zip will fail)

next steps:. 牛人云集,一亩三分地
read this:
http://forums.udacity.com/questi ... at-to-do-next#ud617
especally this http://www.youtube.com/watch?v=c_cJKZ4vzhA&t=57

watch this:. 一亩-三分-地,独家发布
https://www.udacity.com/course/v ... 8873795/m-309382595
the difference between file system on linux and on hdfs!!!!. from: 1point3acres.com/bbs

even there's a local file directory called data, still need to create one on hdfs:

hadoop fs -mkdir data

hadoop fs -ls
Found 1 item
drwxr-xr-x   - training supergroup          0 2013-12-11 17:16 data

then there's a HDFS folder called data. . 一亩-三分-地,独家发布

now put the actual data into HDFS:
hadoop fs -put purchase.txt data (1st purchase.txt is the file on your local Filesystem, 2nd data is HDFS folder)
then you can check you do have this:
hadoop fs -ls data
Found 1 items
-rw-r--r--   1 training supergroup  211312924 2013-12-11 17:17 data/purchases.txt

hs ../code/mapper.py ../code/reducer_f2.py data/purchases.txt outdata2
packageJobJar: [../code/mapper.py, ../code/reducer_f2.py, /tmp/hadoop-training/hadoop-unjar8573115774818496995/] [] /tmp/streamjob8981780528938293292.jar tmpDir=null
13/12/11 17:33:16 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/12/11 17:33:17 WARN snappy.LoadSnappy: Snappy native library is available
13/12/11 17:33:17 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/11 17:33:17 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/11 17:33:17 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
13/12/11 17:33:17 INFO streaming.StreamJob: Running job: job_201312111650_0004
13/12/11 17:33:17 INFO streaming.StreamJob: To kill this job, run:
13/12/11 17:33:17 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker= -kill job_201312111650_0004
13/12/11 17:33:17 INFO streaming.StreamJob: Tracking URL:
13/12/11 17:33:18 INFO streaming.StreamJob:  map 0%  reduce 0%
13/12/11 17:33:30 INFO streaming.StreamJob:  map 12%  reduce 0%
13/12/11 17:33:33 INFO streaming.StreamJob:  map 19%  reduce 0%
13/12/11 17:33:36 INFO streaming.StreamJob:  map 26%  reduce 0%
13/12/11 17:33:40 INFO streaming.StreamJob:  map 32%  reduce 0%.留学论坛-一亩-三分地
13/12/11 17:33:43 INFO streaming.StreamJob:  map 40%  reduce 0%
13/12/11 17:33:46 INFO streaming.StreamJob:  map 47%  reduce 0%
13/12/11 17:33:49 INFO streaming.StreamJob:  map 50%  reduce 0%. From 1point 3acres bbs
13/12/11 17:34:01 INFO streaming.StreamJob:  map 75%  reduce 0%
13/12/11 17:34:02 INFO streaming.StreamJob:  map 81%  reduce 17%
13/12/11 17:34:05 INFO streaming.StreamJob:  map 88%  reduce 17%
13/12/11 17:34:08 INFO streaming.StreamJob:  map 95%  reduce 25%
13/12/11 17:34:11 INFO streaming.StreamJob:  map 100%  reduce 25%
13/12/11 17:34:17 INFO streaming.StreamJob:  map 100%  reduce 69%
13/12/11 17:34:20 INFO streaming.StreamJob:  map 100%  reduce 75%
Waral 博客有更多文章,

hadoop fs -cat outdata1/part-00000
Baby      57491808.44
. 1point3acres.com/bbsBooks      57450757.91

get data out from HDFS to local
code  data
[training@localhost udacity_training]$ mkdir outdata
[training@localhost udacity_training]$ cd outdata/
[training@localhost outdata]$ hadoop fs -get outdata2a/part-00000
[training@localhost outdata]$ ls. more info on 1point3acres.com

For quick tests, make some sample data (I just copied 20 lines from ~/udacity_training/data/purchases.txt). Save it as sampleData.txt in your code directory.

head -40 purchase.txt > sample.txt

Then in a terminal, in the code directory, you can run
./mapper.py <sampleData.txt >mappedData.txt

and then
./reducer.py <mappedData.txt

for me it's more like this:
python ./mapper_f3a.py <../data/sample.txt >../data/mappedData.txt. visit 1point3acres.com for more.
python ./reducer_f3a.py <../data/mappedData.txt

?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 ¥??| СK ?????? 2013-12-12 13:23:14 | ֻ????????

?߷????? udacity Hadoop with python ??debug

Final ??ҵ?????һ??
chained MR

python mapper_f23.py <../outdata2/part-00000 >toreducer2; python reducer_f23.py <toreducer2. more info on 1point3acres.com



hs mapper_f23.py reducer_f23.py data/part-00000  outdataF2outc
packageJobJar: [mapper_f23.py, reducer_f23.py, /tmp/hadoop-training/hadoop-unjar3723990990451201080/] [] /tmp/streamjob2507367586069302810.jar tmpDir=null. from: 1point3acres.com/bbs
13/12/12 00:20:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/12/12 00:20:05 WARN snappy.LoadSnappy: Snappy native library is available
13/12/12 00:20:05 INFO snappy.LoadSnappy: Snappy native library loaded. Waral 博客有更多文章,
13/12/12 00:20:05 INFO mapred.FileInputFormat: Total input paths to process : 1. from: 1point3acres.com/bbs
13/12/12 00:20:05 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
13/12/12 00:20:05 INFO streaming.StreamJob: Running job: job_201312111650_0036
13/12/12 00:20:05 INFO streaming.StreamJob: To kill this job, run:
13/12/12 00:20:05 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker= -kill job_201312111650_0036
13/12/12 00:20:05 INFO streaming.StreamJob: Tracking URL:
13/12/12 00:20:06 INFO streaming.StreamJob:  map 0%  reduce 0%
13/12/12 00:20:11 INFO streaming.StreamJob:  map 100%  reduce 0%
13/12/12 00:20:36 INFO streaming.StreamJob:  map 100%  reduce 100%
13/12/12 00:20:36 INFO streaming.StreamJob: To kill this job, run:
. 围观我们@1point 3 acres13/12/12 00:20:36 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker= -kill job_201312111650_0036
13/12/12 00:20:36 INFO streaming.StreamJob: Tracking URL:
13/12/12 00:20:36 ERROR streaming.StreamJob: Job not successful. Error: NA
13/12/12 00:20:36 INFO streaming.StreamJob: killJob.... visit 1point3acres.com for more.
Streaming Command Failed!

?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 ¥??| СK ?????? 2013-12-12 13:28:44 | ֻ????????

mapper1. just output the filename and 1
reducer1: simple word count reducer

--> output of step 1 MR is (filename, count)

mapper2:read output of step1, output (_dummy_, filename, count)
reducer2:get all the keys called _dummy_, get the max of count, output


    if thisCount > maxcnt:
        maxcnt = thisCount
        maxfile = thisKey
. visit 1point3acres.com for more.

???޷?????Ϊʲôlocal run works but hs fail ??

13/12/12 00:20:36 ERROR streaming.StreamJob: Job not successful. Error: NA
13/12/12 00:20:36 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-12 13:40:27
holy cow.........

#!/usr/bin/env python


error msg not helpful at all
http://stackoverflow.com/questio ... example-not-working
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-28 04:29:45
??????Udacity ?????ŵ?java??????????ǽ??ķdz??õ?

??ʱ????????Berkeley 61B, ?Լ???Head first Java
. more info on 1point3acres.comhttp://www.cs.berkeley.edu/~jrs/61b/
http://www.youtube.com/watch?v=Q ... 2A1049C&index=1

Joyce????share???ҿγ??????Ѿ??????ˣ????ڲ?ѧ?????Dz?????. 围观我们@1point 3 acres
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
nibuxing 发表于 2013-12-28 04:58:30

??Ҳ???ڸ?CS61B???????ſεĻ??Ҿ???Head first??ʱ???ÿ?????һ??ʼ???ڻ????½?һ??java??????????ͦ?õġ?
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-28 05:00:37
nibuxing ?????? 2013-12-27 15:58
??Ҳ???ڸ?CS61B???????ſεĻ??Ҿ???Head first??ʱ???ÿ?????һ??ʼ???ڻ????½?һ??java??????????ͦ?? ...

head first????εĽ̲ģ??Ҵ?ǰ????????Ϊ??ʱû???ֱ?̣????üDz???
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
nibuxing 发表于 2013-12-28 05:22:35
СK ?????? 2013-12-28 05:00
head first????εĽ̲ģ??Ҵ?ǰ????????Ϊ??ʱû???ֱ?̣????üDz???


?ڶ????????ǣ??????????ĸ??£????㿪ʼ??ʵϰ?ˣ?Ҳ???????????Լ??ļ??ܣ?רҵ??IEOR???????Կڣ??????data scientist????ʵϰ?͹???Ӱ?????
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-28 05:27:45
nibuxing ?????? 2013-12-27 16:22
python???棬?????ڻ?????﷨??һЩ?㷨??Ҳ????????CS101??CS215??????֪???Dz?????Ϊ???ϵò?????ϸ?? ...
more info on 1point3acres.com
????ˮƽ??????? :P. 围观我们@1point 3 acres

. 牛人云集,一亩三分地scrape data û??̰?
check "visualize this", python beautiful soup ?÷?
i just googled some "how to" posts ????access??API??ʲô??ò?ƻ?û????
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-28 05:28:27
IEOR?? data scientist ͦ?Կڵ?
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
nibuxing 发表于 2013-12-28 05:29:38
СK ?????? 2013-12-28 05:27
????ˮƽ??????? :P

scrape data û??̰?

?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-28 05:45:21
nibuxing ?????? 2013-12-27 16:29

?ҵ?python??ȷҲ????ѧ????ô???ſ? :D


(exercise on py beautifulsoup and regex)
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
LIIIIIIING 发表于 2013-12-28 11:46:22
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
EroicaCMCS 发表于 2013-12-28 12:33:29
????????? EroicaCMCS ?? 2013-12-28 12:39 ?༭
СK ?????? 2013-12-28 05:45
?ҵ?python??ȷҲ????ѧ????ô???ſ? :D

一亩-三分-地,独家发布
K?㣬??̼???ץ???ݵ????⣺. 一亩-三分-地,独家发布

1 ??ץ??ҳһ????py httplib/urllib2 + regex,û?ù?beautifulsoup????????beautifulsoup?ĺô??Dz??ǽ????ǿ???????һ???нṹ?Ķ??󣿻??????????ܺõ?feature? 来源一亩.三分地论坛.

2 ????һ????վ(????˵ץȡij?????????ҳ??)????ֻ???Լ??????ײ??post????Ĺ???(????˵url????page=1,page=2,etc)??????˵?б?ĸ??õķ?????


BTW, thanks for sharing, D3 looks good
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
 楼主| 小K 发表于 2013-12-28 12:50:33
ͬ????·??. from: 1point3acres.com/bbs
?Ҷ?scrape web data??ʵû??̫?ྭ??
visualize this???????????????????վץȡ??ȥ????ÿ?????£??????????ץ???Ӧ???Ǹ?һ?л???????һ????????״??????. visit 1point3acres.com for more.

coursera??????get twitter data, get location, time etc and do some analysis

¥??˵???໥?ĺؿ???????Ҫȡ??ij??regex pattern??¥???idֵȻ??randomize

??????˵??????Щ???Լ??????õ????ҳ???????ȥijweb app?ռ???????????ȡ?????????ݽ??з??????߶?д???????ҳ??õ?Ч??????rtm, toggl gmail gcal, douban????Щ?????ֳ?api ?Ƚ?????Ū???롣. 牛人云集,一亩三分地

????Ŀǰ?ij̶ȣ?˵?????????????ʲô??????= =


1 ?鿴ȫ??????
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?
EroicaCMCS 发表于 2013-12-28 13:41:02
????????? EroicaCMCS ?? 2013-12-28 13:47 ?༭ . 一亩-三分-地,独家发布
СK ?????? 2013-12-28 12:50
?Ҷ?scrape web data??ʵû??̫?ྭ??
-google 1point3acres????ȡ??????Ҫץȡʲô???ݣ?

???ץ????ѧУ??ѡ?????? ????????????(sina??opta)
?ظ? ֧?? ???? ʹ?õ??? ?ٱ?

