# Machine Learning & Stats:

• Algorithms:
• Meta:
• cheatsheets:
• https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
• sci-kit learn flowchart: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
• SAS more readable / general: https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/
• multi-class classification:
• just use one-vs-all (aka one-vs-rest)
• train a model for each class where the negative labels are all the other classes
• regression:
• linear vs logistic: linear is continuous, logistic is categorical
• finding coefficients for standard linear equation: y = a + bx
• for logistic, we use logistic function (s-shaped curve) to convert output into 0 - 1 value, then we can apply a rule to get a binary label
• like >0.5 then yes else no
• decision trees:
• start with most important feature, split based on that, then continue splitting train set on successive features (where each split has the same value for that feature) until get to leaf nodes (which are the output, e.g., the yes/no decision)
• just like any binary tree
• https://medium.com/@chiragsehra42/decision-trees-explained-easily-28f23241248
• prone to overfitting (high variance), so some ways to account for that are:
• bagging:
• similar to bootstrapping, construct multiple tree models for different samples of input, then for each new data point, average all the predictions
• random forests:
• instead of selecting optimal split (i.e., instead of selecting splits that best result in even "sides"), introduce randomness
• boosting (think gradient boosted decision trees / GBDT):
• combine weak learners
• each successive learner attempts to correct for the errors of the previous one
• place more weight on data points where the prev learner went wrong
• collaborative filtering:
• based on assumption that people like things similar to other things they like, and things that are liked by other people with similar taste
• types of filtering:
• A user-item filtering takes a particular user, finds users who are similar to that user based on similarity of ratings, and recommend items that those similar users liked. In contrast, item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations.
https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0
• User-Item Collaborative Filtering: “Users who are similar to you also liked …”
• Item-Item Collaborative Filtering: “Users who liked this item also liked …”
• SVM or support vector machines:
• basically trying to find the hyperplane (non-linear line) that best separates the input into separate classes. do this by maximizing the margin, i.e., the distance from the hyperplace to the closest data points (the support vectors)
• k-means classification:
• not to be confused w k nearest neighbors, which is quite different https://www.wikiwand.com/en/K-nearest_neighbors_algorithm
• algorithm:
• iterative, start with k randomly assigned centroid points, then iterate by:
• for each point, assign to the closest centroid
• for each centroid, update/move to the middle (mean) of the points assigned to it
• stop iterating when no more point assignments or centroid movements need to be made
• Feature engineering:
• cross-validation
• 1-hot encoding
• Performance metrics / evaluation:
• confusion matrix:
• actual vs predicted https://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_files/confusion_matrix_1.png
• Predicted
• P N
• P: TP FN
• N: FP TN
• recall or sensitivity = TP / (TP + FN)
or true positives
• of all actual positive labels, how many did we successfully predict?
• precision / true positive rate = TP / (TP + FP)
• of all predicted positive labels, how many were actually positive?
• false positive rate = FP / (FP + TN)
• specificity or true negatives = TN / (TN + FP)
• F1 is the harmonic mean of precision and recall
• 2TP / (2TP + FP + FN)
• ROC curve is true positives (recall or sensitivity) vs false positives rate
• we want the curve to be at the top left—this would indicate 0 FPs
• sensitivity as a function of fallout (another name for FPR)
• AUC (area under [ROC] curve):
• ROC curve is graph of recall/sensitivity vs [1- ]specificity, which is the false positive rate
• in other words, graph of true positives vs false positives
• so area under the curve measures how well the positive probabilities are separated from negative
• intuitively, measures the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example
• 0.5 = random predictor
• better for skewed or imbalanced classes, where it will reward discriminative classifiers over representative (eg punishes a model that only outputs ham for problem where 90% is ham)
• also it's a measure of the entire classifier and not dependent on the specific threshold you choose
• Learning to rank:
• used for ranking a list of items, eg search engine results or FB newsfeed
• differs from traditional ML classification/regression bc prediction is on more than a single instance at a time:
• Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance.
• LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn't care much about the exact score that each item gets, but cares more about the relative ordering among all the items.
• 3 main approaches: pointwise, pairwise, and listwise
• RankNet, LambdaRank and LambdaMART all transform ranking into a pairwise classification or regression problem. That means you look at pairs of items at a time, come up with the optimal ordering for that pair of items, and then use it to come up with the final ranking for all the results.
• common loss function is minimizing # of inversions: cases where a pair's order should be inverted
• Normalized Discounted Cumulative Gain (NDCG) is a measure of ranking quality:
• https://www.wikiwand.com/en/Discounted_cumulative_gain
• emphasizes highly relevant documents appearing earlier in list
• Optimization:
• Literature
• Machine Learning Open Course/Reading Materials
• http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning
• http://web.stanford.edu/class/cs246/handouts.html
• http://cs229.stanford.edu/materials.html
• http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/index.htm
• https://work.caltech.edu/lectures.html#lectures
• Survey
• Recommender systems survey http://www.sciencedirect.com/science/article/pii/S0950705113001044
• Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions http://www.stanford.edu/class/ee378b/papers/adomavicius-recsys.pdf
• A survey of collaborative filtering techniques http://dl.acm.org/citation.cfm?id=1722966
• The YouTube video recommendation system http://dl.acm.org/citation.cfm?id=1864770
• Sparse Models
• http://www.ideal.ece.utexas.edu/seminar/Deepak_slides.pdf
• Learning to Rank
• Learning to rank using gradient descent http://dl.acm.org/citation.cfm?id=1102363
• Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/events/lr4ir-2008/proceedings-lr4ir%202008.pdf
• Learning to rank: from pairwise approach to listwise approach http://dl.acm.org/citation.cfm?id=1273513
• CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering
• Learning to rank recommendations with the k-order statistic loss
• Latent Collaborative Retrieval
• Efficient Retrieval
• Fast top-k retrieval for model based recommendation
• Item-based top-n recommendation algorithms
• Exploration/Exploitation:
• A contextual-bandit approach to personalized news article recommendation
• Hierarchical exploration for accelerating contextual bandits
• An Empirical Evaluation of Thompson Sampling
• Document/Language Representation
• Word2Vec
• Efficient Estimation of Word Representations in Vector Space
• Distributed Representations of Words and Phrases and their Compositionality
• Linguistic Regularities in Continuous Space Word Representations
• Collaborative topic modeling for recommending scientific articles
• Representing Documents Through Their Readers
• TagSpace (EMNLP 2014)
• Distributed Representations of Sentences and Documents (ICML 2014)
• Large-Scale High-Precision Topic Modeling on Twitter (KDD 2014)
• Collaborative Filtering/Matrix Factorization:
• Matrix factorization techniques for recommender systems
• A matrix factorization technique with trust propagation for recommendation in social networks
• Probabilistic matrix factorization
• Pairwise interaction tensor factorization for personalized tag recommendation
• fLDA: matrix factorization through latent dirichlet allocation
• Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model
• Nonlinear Latent Factorization by Embedding Multiple User Interests (RecSys 2013)
• Scalable Recommendation with Poisson Factorization
• Factorization Machines
• Latent Factor Models
• Regression-based latent factor models
• Latent factor models with additive and hierarchically-smoothed user preferences
• Latent Structured Ranking
• User Modeling
• Learning Relevance from a Heterogeneous Social Network and Its Application in Online Targeting (Rajat Raina's work on our data)
• Understanding the Interaction between Interests, Conversations and Friendships in Facebook
• Turning down the noise in the blogosphere
• Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social E-commerce (KDD 2014)
• Evaluation:
• Large-scale validation and analysis of interleaved search evaluation
• Deep Learning
• Learning Deep Architectures for AI
• Wide & Deep Learning for Recommender Systems
• Deep Neural Networks for YouTube Recommendations
• General Machine Learning
• Learning Classifiers from Only Positive and Unlabeled Data (KDD 2008, useful for lookalike)
darkmode