All posts by Huazhi

Down-scaled Big Data Modeling

This is an idea based on rigorous statistical thinking. I need to answer how promising it could be before I may possibly share more details. Some experiments are planned to run for both linear models and generalized linear models, so as to provide empirical evidences. I am also thinking about utilization of certain CNN bases/features for image classification, upon some further understanding of the underlying model and coding exercises.

Paul Embrechts

Prof. Paul Embrechts is visiting HKU as a Hung Hing Ying Distinguished Visiting Professor in Science and Technology. He will give a public lecture next next Thursday. He is a co-author of the celebrated QRM book Quantitative Risk Management: Concepts, Techniques and Tools. For this book (now 2nd edition, published 2015), there is an excellent website called QRM Tutorial (and its GitHub repository), with slides and R codes available.

Today I happened to attend the biweekly time series seminars organized within the department, and for the first time seriously listened about GARCH model (cf. Francq and Zakoïan (2010)), Stationarity, Portmanteau test, etc. There is a great introduction to these concepts by the QRM book (Chapter 4: Financial time series; PDF slides).

PS: Prof. Paul Embrechts has authored another well-known/influential book in 1997, titled “Modelling Extremal Events for Insurance and Finance”, which I have not ever read yet.

Machine Learning Resources

A list of online resources with dynamic updates:

Data Science and Machine Learning

Optimization and Computing

Deep Learning, NLP, AI, etc

Learning Analytics

Start with some general introduction, and

  • US Dept. Education: briefing

Groups that work in this area:

New trend towards using modern decision-analytic approaches:

  • Big amount of person-click data are generated from online platforms, including both LBS and MOOC systems
  • Modern development of decision-analytic methods and tools, like matrix factorization, deep learning, sparse models, social network analysis

Finally come up a brief proposal with nice image(s):

Learning Analytics by Statistical and Machine Learning Techniques

  1. Person-click data structure: both behaviors and feedbacks
  2. Learning pathways through longitudinal and survival analysis: to measure activity and engagement (person-anchored)
  3. Dropout prediction and retention analysis through machine learning techniques
  4. Social network analysis of linked users and peer interactions
  5. Content analysis through ????, e.g.  6min micro-video effectiveness (content-anchored)
  6. Recommendation system for online quizzes

References:

  1. Bienkowski, M., Feng M. and Means, B. (2012). Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief. U.S. Office of Educational Technology, Department of Education. FDF
  2. Guo, P.J., Kim, J. and Rubin. R. (2014). How Video Production Affects Student Engagement: An Empirical Study of MOOC Videos. ACM Conference on Learning at Scale, March 2014. PDF

AWS for Teaching and Research

Three weeks ago when I moved from HKBU to HKU, one of first priority things for me is to purchase and setup a new high-performance computing server. The quoted price from DELL is as high as $70K+ in Hong Kong dollars. We then turned to MICROWARE for quotation but now still waiting for their response. Only about one week before the first week of class!

This morning I started to google AWS/EC2 and in particular the spot instances. I came across the so-called AWS Grants Program for Research and Education. See here. The application for AWS research grants (also called AWS Cloud Credits for Research) is open every 3 months, and the deadline for next round is September 30. One may also try AWS Educate by providing information like Institution and Course website. There might be about $200 credits (or $75 only, if non-member institution) for an educator.

Besides the possible credits to cover the low-cost AWS EC2 solution, we may also look for possibility of using research grants for covering AWS high-end servers (including clusters). Here is a letter I explained to our dept admin:

Alternatively, the “cloud server” provided by Amazon Web Service is much cheaper by paying the monthly rental/usage fees. It is of more or less equal computing power, and easier to maintain. The main difference is that such “cloud server” is a virtual machine, rather than a real computer machine like we see in the lab. Such virtual cloud server is more like a software service.

Good news is that some types of research grants are flexible! This is great! If it really works, we may probably try Cloud Cluster! Below is a tentative plan based on Amazon Elastic MapReduce (EMR):

Name: Amazon EMR with 1 master and 2 nodes
Includes:
1 Master: m3.xlarge, 4 vCPU, 15GB Memory, 2x40GB SSD, 100% Utilization
2 Nodes: c4.4xlarge, each with 16 vCPU, 30GB Memory, 2×160 SSD, 5% Utilization (expected)
Description: Amazon EMR cluster for Big data analytics

AmazonEMR

I choose the region of “Asia Pacific (Singapore)” as it is fastest from Hong Kong in terms of the estimated latency from http://www.cloudping.info/.

Let’s wait and see if it can work out!

STAT3622 Data Visualization

I will teach STAT3622 “Data Visualization” in the coming 2016/17 Fall semester. This would be the first-ever course I teach in HKU.

Some main references ranging from R to Python to D3 are selected and listed as follows:

  1. R Graphics Cookbook
    By Winston Chang, O’Reilly 2013
    Book website: http://www.cookbook-r.com/
  1. ggplot2: Elegant Graphics for Data Analysis
    By Hadley Wickham, Springer (2009, 1ed; 2016, 2ed)
    Book website: http://ggplot2.org/book/; http://hadley.nz/
  1. Learning IPython for Interactive Computing and Data Visualization
    By Cyrille Rossant, Packt Publishing (2013, 1ed; 2015, 2ed)
    Book website: http://ipython-books.github.io/minibook/
  1. Interactive Data Visualization for the Web: An Introduction to Designing with D3
    By Scott Murray, O’Reilly 2013
    Online Read: http://chimera.labs.oreilly.com/books/1230000000345
  1. Visualize This: The FlowingData Guide to Design, Visualization, and Statistics
    By Nathan Yau, Wiley 2011
    Book website: http://book.flowingdata.com/http://as.wiley.com/WileyCDA/WileyTitle/productCd-0470944889.html
  1. Interactive Data Visualization: Foundations, Techniques, and Applications
    By Matthew O. Ward, Georges Grinstein, Daniel Keim, CRC Press (2015, 2ed)
    Book website: http://www.idvbook.com/teaching-aid/

Other thoughts: The sure thing is that we will use RStudio Server for weekly teaching. Some IT-like but useful skills like Github and AWS EC2 could be added to either lectures or tutorials. The LMS Moodle is preferred by the university, and we shall also design a course web page under statsoft/teaching.