l.pdf

Publish in

Documents

4 views

Please download to get full document.

View again

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Machine Learning Study online at quizlet.com/_49727f 1. Actor-Critic 1. Produce action,a, for current state,s 7. Bayes Rule Given A: Algorithm 2. Observe next state,s', and the Formula $P(A|x) = \frac{P(X|A)P(A)}{P(x)}$ reward, r 3. Update the utility of state, s(critic) 4. Update the probab
Tags
Transcript
  1.  Actor-CriticAlgorithm 1. Produce action,a, for current state,s 2. Observe next state,s', and thereward, r 3. Update the utility of state, s(critic) 4. Update the probability of the action 2.  Actor-Critic Methods A meta-architecture that splitslearning into two elements: - Actor : A function that selectsactions from states - Critic: A function that provides avalue estimate for a state and action. 3.  Actor Only methods i) Search in policy space - Run n policies for M episodes - Keep the best one - Evolve ii) Ignore state-action valueinformation iii) Potentially less efficient 4.  Addition Law ofprobability $P(A \bigcup B) = P(A) + P(B) - P(A\land B)$ For example picking an ace and adiamond: $P(A) = \frac{4}{52}$ $P(B) = \frac{13}{52}$ $(A \land B) = \frac{1}{52}$ Therefore: $\frac{4+13-1}{52} = \frac{4}{13}$ 5.  Advantages ofTemporal DifferenceLearning i) Do not need model of theenvironment ii) Bootstrap - Learn a guess from a guess iii) Work in an on-line fashion - No need to wait for terminal state - Update every time step 6.  Auto-encoder Trained with a standard trainingalgorithm and learns to map inputback onto itself. 7.  Bayes RuleFormula Given A: $P(A|x) = \frac{P(X|A)P(A)}{P(x)}$ $P(A|x) = \frac{P(X|A)P(A)}{P(X)}$ $P(X) = P(X,A)+P(X,B)$ $P(X,A) = P(A)P(X|A)$ $P(X,B) = P(B)P(X|B)$ For a cluster a given the data point $x_{i}$ itis: $p(a|x_{i}) = \frac{p(x_{i}|a)p(a)}{p(x_{i}|b)p(b)+p(x_{i}|a)p(a)}$ 8.  Belief state Because of not being able to access state, s,directly you introduce a belief state It is essentially a probability distribution 9.  BeliefUpdates We can update the estimated belief stateusing the probabilities in the observation andtransfer functions Point-based value iteration 10.  Bootstrap i) Good method for small datasets. ii) Creates a training set by sampling withreplacement iii) Samples a dataset of N instances by Ntimes to create a new training set of Ninstances - Test on all instances 11.  Bootstrapproperties i) Pessimistic due to high probability of singleinstances not making it to the training set ii) Combine with re-substitution error for arealistic error estimate. 12.  Calculating variance $\frac{1}{n} \sum_{i=1}^{n} (x_{i}- \mu)^2 $ 13.  ClassificationUsing SOMS i) Map training data classes to winning nodes ii) Find winning node for new test data iii) Classify according to the node's majorityclass from the training data 14.  Clustering Idea of grouping patterns so that theirvectors are similar to one another within thesame region. All require a way to determine similarity Machine Learning Study online at quizlet.com/_49727f   15.  Constructing adecision tree The dataset is split recursively into subsetsbased on a single input variable Until: - All subsets has the same class - Splitting no longer improves predicton - Subsets are too small - Top-down induction of decision trees - Greedy 16.  Constructivistapproach Doesn't require access to state space ortransition functions. 17.  Cost functionfor a datapoint $e_{i} = y_{i} - (mx_{i} + c) $ 18.  Cost functionover a data set $e = \sum_{i=1}^{n} (y_{i}-(mx+c))^{2}$ 19.  Covariance Measures the strength of the linearrelationship between 2 variables It is given by: $\epsilon [ (x- \mu x)(y- \mu y)]$ 20.  Critic Onlymethods i) Value function estimation ii) Trivial action selection - TD-learning, Q-learning 21.  Decision tree i) A set of ordering rules for classifying data ii) Each node addresses an input variable iii) Leaves assign labels or values to thedata 22.  Decision treeadvantages i) Simple and intuitive ii) Good computational performance iii) Possible to validate a model usingstatistical tests 23.  Decision treeLimitations i) Greedy algorithms can get stuck in localoptimality. ii) Simple decision structure does notrepresent all problems effectively. - XOR parity. - Producing large trees iii) Can produce overly complex trees thatdo not generalise well - Overfitting 24.  Deep Learning i) Using a neural network with several layersof nodes between the input and output. ii) Series of layers between the input andoutput do feature detection andprocessing. iii) Model human visual system 25.  DimensionalityReduction i) Inputs are of dimension, i - Number of input features ii) Outputs are of dimension 2 - can be more SOM is an example of vector quantization - 2 <i 26.  DiscountedFuture Reward All sequence information is stored Can reconstruct all sequences andcalculate discounted reward for all bottom-level o/a/r tuples 27.  DistanceEquations forclusteringEuclidian Distance:  $d_{ij} = \sqrt{(x_{i}-c_{j})^{T} (x_{i}-c_{j})}$ Mahannabis distance:  $(x-y)^{T} \Sigma^{2} (x-y) $ - This accounts for the fact that thevariances in each direction are different - Accounts for covariance betweenvariables - Reduces the amount of uncorrelatedvariables using unit variance. 28.  EligibilityTraces i) A record of the most recentlyexperienced state,actions, and rewards - A sequence of tuples <s,a,r> inchronological order - Updated on each step of the algorithm ii) Improved mechanism for temporal creditassignment - Spreading new information about state-action values through the value function 29.  Error rate fraction of misclassified data points $1-p_{m}$ 30.  Evaluating adecision tree i) Error rate - Proportion of errors across all instances ii) Resubstitution error - Error rate on the training data - Tends to be optimistic due to overfitting iii) Test set error - Performance on previously unseen set testdata iv) Holdout - Reserve some data, often 20% for testing - N-fold cross validation 31.  Functionapproximation Approximates a function from a subset ofinputs and outputs 32.  Gini Impurity $ Gini(S) = 1-\sum_{c=1}^{l}p_{c}^{2}$  33.  Greedy-layer-wisetraining Steps for training 1) Train first layer using unsupervised withoutlabels 2) Use abundant unlabeled data which is notpart of the training set 3) Freeze the first layer parameters and starttraining the second layer using the output ofthe first layers as unsupervised input to thesecond layer. 4) Repeat 5) Unfreeze all weights and fine tune thenetwork 34.  Greedylayer-wisetrainingadvantages i) Avoids many problems of training deepnetworks in a supervised fashion ii) Each layer gets full learning focus since itsthe top layer. iii) Takes advantage of unlabelled data. iv) Helps with problems of ineffective earlylayer learning v) Deep net local minima 35.  Hidden stateIdentification The match between STM and LTM providesan estimate of how close traces in LTM are toSTM 36.  HierarchicalClustering i) Breaks down the dataset into a series ofnested clusters ii) Single cluster at top with all data iii) N clusters are at the bottom one for eachdata point iv) Can be displayed as a dendogram 37.  Hierarchicalencoding ofsequencedata i) Encode all observation-action-rewardtuples in SOM ii) Record STM as decaying activation ofaugmented SOM nodes iii)Use another SOM to learn the sequenceinformation in the node activations iv) Auto-encoding 38.  Improvementin decisiontrees Improvement $ I(S_{i1},S_{i2})$ is thedifference in quality between the srcinalsubset $s_{i}$, and the joint quality of twonew subsets $s_{i1},s_{i2}$ $I(S_{i1},S_{i2}) = Q(S_{i}) - \sum_{i=1}^{n}\frac{|S_{in}|}{|S_{i}|} Q(S_{in})$ - Gini gain - Improvement based on giniimpurity - Information gain - based on entropy 39.  Instance-BasedMethods i) Doesn't construct a theory ii) Does classification/regression from the rawdata each time iii) Complexity is limited to the size of thetraining set iv) Nearest sequence memory - Match current episode against past episodes 40.  Instance-Basedsolutions i) Nearest sequence memory - Keep current episode in STM - Remember last N episodes ii) Can learn very quickly - Potentially one-shot iii) Fixed memory requirements - Can forget solutions 41.  InterpretingCovariance if cov(x,y) > 0 => x,y are positively correlated if cov(x,y) < 0 => x,y are inversely correlated if cov(x,y) = 0 => x,y Independent 42.  Issues withstoppingcriteria i) Decide on a stopping criteria a) Too specific and it will produce small,underfitted trees - local minima b) Loose criteria produce large, overfitting 43.  Kernel Trick  i) Transforming some datasets can make itlinearlhy separable. ii) This can be done more rigorously using amapping function (kernel) which transformsone given space into another iii) Transformation of functions for easier tosolve problems 44.  K-foldcross validation variants K is commonly set to 5 or 10 based onexperience - Stratification - Each subset has the same proportion of datafrom each class - Leave one out CV - Fold size 1 - Best use of data and most expensive 45.  K-meansclustering K - the number of clusters K is set by hand 1. Set a suitable number of K clusters 2. Randomly assign the first K data points to becentroids of the k clusters 3. Loop 3.1. For each data point assign closest clustercenter 3.2. Re-compute cluster centers as mean ofassigned 3.3. Terminate loop if no change 4. Repeat processing changes  46.  K-MeansFormula Selecting which of 2 classes to allocatevector $x_{i}$. Calculate the point distance to mean of clclusters: $d_{i1} = \sqrt{(x_{i}-c_{1})^{T} (x_{i}-c_{1})}$ $d_{i2} = \sqrt{(x_{i}-c_{2})^{T} (x_{i}-c_{2})}$ $d_{i1} < d_{i2} $ Allocate $x_{i}$ to $c_{1}$ $d_{i1} > d_{i2} $ Allocate $x_{i}$ to $c_{2}$ 47.  K-NearestNeighbourLearning i) Select K-Points in the training set that arenearest to the data point you need toclassify - Requires a distance metric for decidinghow close points are - Euclidian distance $ d(p,q) = \sqrt {\sum_{i=1}^{n}(q_{i} -p_{i})^{2}}$ 48.  KNN SampleSpace i) Distance from the distance metric ii) Classification is the majority class of thenearest neighbours iii) Can weight the neighbours by theirproximity 49.  Kohonen Self-OrganizingMaps i) Preserves similarity between examples - examples that are close in n dimensionalspace will be close in m dimensional space Structure : $X_{i}$ - input vector features $N_{j}$ - Nodes / Map $W_{i,j}$ - Input vector is fully connectedto each node Node weights are similar to K-meanscentroids 50.  Limitations ofBack-propagation i) Multiple hidden layers ii) Get stuck in local optima - Start weight from random positions iii) Slow convergence to optimum - Large training set needed iv) Only uses labelled data v) Error attenuation with deep nets 51.  LinearDecisionBoundary Partitions the input space into discrete regions - Each region is associated with a class label i) 2D is a line ii) 3D is a plane ii) n-D is a hyperplane $ g(x) = w^{T}x+b$ - Where w is weight, b is aconstant For boundary Classification if $ g(x)>0 $ then class 1 if $ g(x)<0$ then class 0 52.  Markov decisionproblemconsists of S - Set of States A - Set of Actions T - Transition Function R - Reward Function 53.  ModelBasedAlgorithms i) Explicitly estimates T ii) Use T to improve policy 54.  Model FreeMethods i) Approximate value functions fromexperience ii) Do not model the transition functionexplicitly 55.  Modelling a1D Gaussian Given parameters with mean $\mu$ andvariance $\delta^{2}$ we can compute theprobability of the data point from the Gaussianmodel. Probabilities of data point $x_{i}$ for model ais: $P(x_{i} | a) = \frac{1}{\sqrt{2 \pi \delta^{2} a}}exp { (- \frac{(x_{i} - \mu)^{2}}{2 \delta^{2}})}$ 56.  Models i) Anything that can be used to predict resultsof actions - Simulated experience ii) Distribution models - Provide probability distributions oversuccessor states iii) Sample Models - Return a single successor state drawn fromthe appropriate probability distribution 57.  Monte-CarloMethods Assume we don't know the transition function,T - Model Free Evaluate the policy based on experience - Obtain episodes by following policy $\pi$ - Start at state $s$, take actions until goal isreached Use average discounted reward
Related Search

Previous Document

MODUL TULI

We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks