**Statistics is using math to do technical analysis of data**. Instead of guesstimating, data helps us get concrete and factual information.

The most widely used statistical concept in data science is called *Statistical Features.* It includes important measurements like bias, variance, mean, median and percentiles. It’s all code-friendly too.

62 STASHED

10 LIKES

A typical data set diagram (box plot) carries a lot of information.

- If it is short, it means the data points are similar, but if it is tall, it implies there is a lot of range and variance.
- A median (the line in the middle of a dataset graph) provides a more accurate reading as it avoids outlier values.
- The lower regions of the box plot represent smaller percentages (like 25 percentile), with the higher regions denoting larger ones.

44 STASHED

3 LIKES

In data science, probability is the percent chance that something will happen. A zero(0) in this case means the event will not occur, while the digit 1 denotes that we are certain it will happen.

45 STASHED

3 LIKES

The common probability distributions are:

**Uniform Distribution**: It is a simple off or on distribution, where anything outside the given range is 0.**Normal (or Gaussian) Distribution:**This distribution has the same standard deviation in all directions. We get to know the average dataset value along with the spread of the data.**Poisson Distribution:**This is similar to Poisson Distribution but also has skewness, in which the variation tells about the spread of the data in different directions.

47 STASHED

3 LIKES

The process of reduction in the number of dimensions (or feature variables) in datasets is known as Dimensionality Reduction.

If a cube has 1000 points, we can reduce its dimensionality by simply taking the 3D data and viewing it as a 2D model. We can also remove *feature variables* to reduce the data volume. This is generally done with features that have a low correlation with the dataset and is called **feature pruning.**

46 STASHED

2 LIKES

Sometimes if we want to compare two datasets, or classify datasets that have an uneven number of samples for different sides or types. Just by taking fewer samples (*undersampling*), one can even out a dataset.

*Oversampling* is a way to copy datasets to have the same number of examples as the other class. The copies are produced maintaining the distribution ratio.

44 STASHED

2 LIKES

Based on the concept of probability, **Bayesian Statistics computes and analyzes prior data to forecast the future trend**. If there is a specific change in the present, the prior data will not reflect that.

Frequency analysis, therefore, is computing the likelihood of a specific occurrence, where new information isn’t computed.

48 STASHED

4 LIKES

MORE LIKE THIS

8 IDEAS

❤️ Brainstash Inc.