# 6 Months Data Science with NLP

6 Months Data Science with NLP

• Statistics
• Data visualization in python
• EDA
• Regression
• Supervised Machine Learning
• Unsupervised Machine Learning
• Ensemble Techniques
• Association Rule
• Recommendation system
• Artificial Neural Network
• Introduction to NLP
• Preprocessing of data
• Feature extraction
• POS
• NER
• How to implement spam detection
• How to implement sentiment analysis
• How to implement an article spinner
• How to implement text summarization
• How to implement latent semantic indexing
• How to implement topic modelling
• Hugging Face Transformers
• Assignments for assessment
• Projects
• Internship

Course Outline

Statistical Foundations

In this module, you will learn everything you need to know about all the statistical methods used for decision making in this Data Science course.

• Probability distribution – Binomial, Poisson, and Normal Distribution in Python.
• Bayes’ theorem – Baye’s Theorem is a mathematical formula named after Thomas Bayes, which determines conditional probability. Conditional Probability is the probability of an outcome occurring predicated on the previously occurred outcome.
• Central limit theorem – This module will teach you how to estimate a normal distribution using the Central Limit Theorem (CLT).
• Hypothesis testing – This module will teach you about Hypothesis Testing in Statistics. One Sample T-Test, Anova and Chi-Square test.

Exploratory Data Analysis (EDA)

This module of 6 months in Data Science courses will teach you all about Exploratory Data Analysis like Pandas, Seaborn, Matplotlib, and Summary Statistics.

• Pandas – Pandas is one of the most widely used Python libraries. Pandas is used to analyze and manipulate data. This module will give you a deep understanding of exploring data sets using Pandas.
• Summary statistics (mean, median, mode, variance, standard deviation) – In this module, you will learn about various statistical formulas and implement them using Python.
• Seaborn – Seaborn is also one of the most widely used Python libraries. Seaborn is a Matplotlib based data visualization library in Python. This module will give you a deep understanding of exploring data sets using Seaborn.
• Matplotlib – Matplotlib is another widely used Python library. Matplotlib is a library to create statically animated, interactive visualizations. This module will give you a deep understanding of exploring data sets using Matplotlib.

Regression- Linear Regression

This module will get us comfortable with all the techniques used in Linear and Logistic Regression.

• Multiple linear regression – Multiple Linear Regression is used for predicting one dependent variable using various independent variables.
• Fitted regression lines – A fitted regression line is a mathematical regression equation on a graph for your data.
• AIC, BIC, Model Fitting, Training and Test Data – In this module, you will go through everything you need to know about several models such as AIC, BIC, Model Fitting, Training, and Test Data.

Regression- Logistic Regression

• Introduction to Logistic regression, interpretation, odds ratio – It is a simple classification algorithm to predict the categorical dependent variables with the assistance of independent variables.
• Misclassification, Probability, AUC, R-Square – This module will teach everyone how to work with Misclassification, Probability, AUC, and R-Square.

Supervised Machine Learning

In the next module, you will learn all the Supervised Learning techniques used in Machine Learning.

• CART – CART is a predictive machine learning model that describes the prediction of outcome variable’s values predicated on other values.
• KNN – KNN is one of the most straightforward machine learning algorithms for solving regression and classification problems.
• Decision Trees – Decision Tree is a Supervised Machine Learning algorithm used for both classification and regression problems. It is a hierarchical structure where internal nodes indicate the dataset features, branches represent the decision rules, and each leaf node indicates the result.
• Naive Bayes – Naive Bayes Algorithm is used to solve classification problems using Baye’s Theorem.

Unsupervised Learning

In the next module, you will learn all the Unsupervised Learning techniques used in Machine Learning.

• Clustering – K-Means & Hierarchical – Clustering is an unsupervised learning technique involving the grouping of data. In this module, you will learn everything you need to know about the method and its types, like K-means clustering and hierarchical clustering.
• Distance methods – This module will teach you how to work with all the distance methods or measures such as Euclidean, Manhattan, Cosine.
• Features of a Cluster – Labels, Centroids, Inertia – This module will drive you through all the features of a Cluster like Labels, Centroids, and Inertia.
• Eigen vectors and Eigen values – In this module, you will learn how to implement Eigenvectors and Eigenvalues in a matrix.
• Principal component analysis – Principal Component Analysis is a technique to reduce the complexity of a model, like eliminating the number of input variables for a predictive model to avoid overfitting.

Ensemble Techniques

In this Machine Learning, we discuss supervised standalone models’ shortcomings and learn a few techniques, such as Ensemble techniques, to overcome these shortcomings.

• Bagging & Boosting – Bagging is a meta-algorithm in machine learning used for enhancing the stability and accuracy of machine learning algorithms, which are used in statistical classification and regression.
Boosting is a meta-algorithm in machine learning that converts robust classifiers from several weak classifiers.
• Random Forest – Random Forest comprises several decision trees on the provided dataset’s several subsets. Then, it calculates the average for enhancing the dataset’s predictive accuracy.

Association Rules Mining & Recommendation Systems

Association rule mining is the data mining process of finding the rules that may govern associations and causal objects between sets of items.

Recommendation engines are a subclass of machine learning which generally deal with ranking or rating products / users. Loosely defined, a recommender system is a system which predicts ratings a user might give to a specific item. These predictions will then be ranked and returned back to the user.

Understanding to Deep Learning – Single Layer Perceptron

Artificial neural networks, usually simply called neural networks or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain.

Convolutional Neural Network

A convolutional neural network is a feed-forward neural network that is generally used to analyze visual images by processing data with grid-like topology. It’s also known as a ConvNet. A convolutional neural network is used to detect and classify objects in an image.

• This topic will introduce some basic NLP concepts, such as Text preprocessing, tokenization, stopwords, lemmatization, and stemming.
• This topic will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You’ll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK and learn to convert text into vectors using CountVectorizer, TF-IDF, word2vec, and GloVe
• This topic will introduce a slightly more advanced topic: Parts-of-speech tagging named-entity recognition. You’ll learn how to identify the who, what, and where of your texts using pre-trained models. You’ll also learn how to use some new libraries like spaCy, to add to your NLP toolbox.
• How to implement a document retrieval system / search engine / similarity search / vector similarity
• In this topic we will learn about Probability models, language models and Markov models which are prerequisite for Transformers, BERT, and GPT-3
• How to implement a cipher decryption algorithm using genetic algorithms and language modeling
• How to implement spam detection

Spam detection is a supervised machine learning problem. This means you must provide your machine learning model with a set of examples of spam and ham messages and let it find the relevant patterns that separate the two different categories.

• How to implement sentiment analysis

Sentiment analysis is a machine learning tool that analyzes texts for polarity, from positive to negative. By training machine learning tools with examples of emotions in text, machines automatically learn how to detect sentiment without human input.

• How to implement an article spinner

An article spinner is a resource or device whose primary function is to rewrite text so that the overall message and meaning are left intact while the wording is changed significantly.

• How to implement text summarization

Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document. Automatic text summarization is a common problem in machine learning and natural language processing.

• How to implement latent semantic indexing

Latent semantic indexing (also referred to as Latent Semantic Analysis) is a method of analyzing a set of documents in order to discover statistical co-occurrences of words that appear together which then give insights into the topics of those words and documents.

• How to implement topic modeling with LDA, NMF, and SVD

Topic modeling in natural language processing is a technique which assigns topic to a given corpus based on the words present. Topic modeling is important, because in this world full of data it has become increasingly important to categories the documents.

• Machine learning (Naive Bayes, Logistic Regression, PCA, SVD, LDA)

Singular Value Decomposition, or SVD, has a wide array of applications. These include dimensionality reduction, image compression, and denoising data. In natural language processing, Latent Dirichlet Allocation is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an example of a topic model.

• Deep learning

NLP stands for natural language processing and refers to the ability of computers to process text and analyze human language. Deep learning refers to the use of multilayer neural networks in machine learning.

• Hugging Face Transformers

The Hugging Face transformers package is an immensely popular Python library providing pretrained models that are extraordinarily useful for a variety of natural language processing (NLP) tasks.

• How to use Python, Scikit-Learn, Tensorflow for NLP

Scikit-learn is mostly used in machine learning applications. The neural network is used indirectly by TensorFlow. In practice, Scikit-learn is utilized with a wide range of models. It provides under-the-hood specialization optimization, making it easier to compare neural network models and TensorFlow models.

• Assignments for assesment
• Projects
• Internship