RWTHx: Basics of Data Science

"Basics of Data Science" gives a comprehensible overview of many fundamental concepts and tools of data science, including data quality and data preprocessing, supervised and unsupervised learning techniques including their evaluation, frequent itemsets and association rules, sequence mining, process mining, text mining, and responsible data science.

9 weeks

5–11 hours per week

Self-paced

Progress at your own speed

Free

Optional upgrade available

There is one session available:

After a course session ends, it will be archived.

Starts Jun 24

Ends Dec 11

Enroll

I would like to receive email from RWTHx and learn about other offerings related to Basics of Data Science.

Enroll now

Starts Jun 24

About this course

Skip About this course

"Basics of Data Science" is designed to provide participants with a comprehensive overview of the fundamental challenges, concepts and tools of data science. The content can be organized in three main areas of data science:

Initially, a brief overview is given to data science infrastructure concerned with volume and velocity. Topics include instrumentation, big data infrastructures and distributed systems, databases and data management. The main challenge is to make things scalable and instant.

The main focus of the course is on data analysis concerned with extracting knowledge from data. Key topics covered are data exploration and visualization, data preprocessing, data quality issues and transformations, various supervised learning techniques with a focus on their evaluation, unsupervised learning, clustering, pattern mining, process mining and text mining. The main challenge of data analysis is to provide answers to known and unknown unknowns.

Finally, data science affects people, organizations, and society. The course is concluded by discussing challenges and providing guidelines and techniques to apply data science techniques responsibly with a focus on confidentiality and fairness. Topics include ethics & privacy, IT law, human-technology interaction, operations management, business models, entrepreneurship, and the main challenge is to do all of the above in a responsible manner.

Throughout the course, the ideas and concepts conveyed in the videos are complemented by hands-on exercises using Python (Jupyter notebooks). Participants will be guided to apply the presented techniques on artificial and real-life data sets to gain valuable hands-on experience.

After the course participants should have a good overview of the best practices, challenges, goals and concepts of the broader data science field, providing a strong foundation for further study or professional development in this rapidly evolving field. Through the combination with hands-on experience with commonly used Python Libraries, participants will be able to conceptualize and implement various basic data analysis techniques in their own projects and accurately evaluate and interpret analysis results.

At a glance

Institution: RWTHx
Subject: Computer Science
Level: Intermediate
Prerequisites:
Everyone from any discipline with an interest in data science can start this course. We expect this course to be useful for everyone. Prior knowledge in math is of advantage (i.e., mathematical notations, linear algebra, stochastics, and statistics), but not mandatory.

Language: English
Video Transcript: English

What you'll learn

Skip What you'll learn

After taking this course, participants will have gained:

Understanding of the role of data science in today’s society and businesses, including challenges and opportunities
Good general overview of a broad range of data science techniques
Ability to conceptualize and basic data analysis and accurately evaluate and interpret the outcomes
Understanding the challenges of responsible data science (fairness, accuracy, confidentiality, transparency) and possible solutions
Understanding of the limitations of machine learning, data mining and AI techniques
Ability to write short Python programs and use mainstream Python libraries
In particular, understanding of and ability to apply the following data analysis concepts and techniques:
data visualization and exploration techniques
decision trees
linear and logistic regression (basic overview)
support vector machines (basic overview)
neural networks (basic overview)
naive bayesian classification (basic overview)
evaluation and interpretation of the results obtained using supervised learning
clustering techniques
frequent item sets
association rules
sequence mining
process mining
text mining
data preprocessing, data transformation, spotting and handling of data quality problems
Application of data analysis techniques without violating confidentiality and fairness

Syllabus

Skip Syllabus

Week 1: Introduction, Data Exploration & Visualization

In the first half of the week, we will provide an overview of the course and illustrate the advantages and challenges when applying data science techniques. Students will get an overview of the data science pipeline, data sources and data types, data analysis techniques and challenges related to their application.

The second half of the week focuses on basic data exploration, visualization and transformation techniques.

Week 2: Supervised Learning Techniques

In the first half of this week, students will delve into data analysis using decision trees. We introduce the basic ID3 Algorithm and its extension to different notions of information gain, as well as pruning techniques, random forests and the applicability of decision trees to continuous data.

The second half of the week is dedicated to a brief overview of other supervised learning techniques (students interested in details are referred to the "Basics of Machine Learning" course which is also part of the BridgingAI course series). These techniques include Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Neural Networks and Naive Bayesian Classification.

Week 3: Evaluation of Supervized Learning, Data Quality & Preprocessing

The first half of this week is dedicated to the evaluation of supervised learning techniques and the models they produce. We introduce the confusion matrix, ROC curve, R2 Coefficient and cross validation including their extension and adaption to specific goals or contexts. Furthermore, challenges and pitfalls regarding the evaluation and interpretation of supervised learning techniques are highlighted.

In the second half of the week, students will learn about data quality issues, their causes and avoidance strategies as well as possible approaches to dealing with outliers or missing values. Furthermore, and overview of data transformation, data reduction and normalization techniques is given.

Week 4: Clustering, Frequent Itemsets

In the first half of this week clustering is introduced as the first unsupervised learning technique. In particular, we present various similarity measures, the k-means and k-medoids algorithms, density-based clustering (DBSCAN) and give an overview of agglomerative clustering techniques and self-organizing maps (SOM).

The second half of the week focuses on the introduction of frequent itemsets. Two algorithms to compute such itemsets are explained: the straightforward Apriori approach as well as the more efficient FP-Growth algorithm.

Week 5: Association Rule Mining, Sequence Mining

In this week, we build upon the concepts of frequent itemsets to generate and evaluate association rules. Furthermore, we use association rules to illustrate Simpson's paradox.

The second half of the week revolves around sequence mining, in particular the AprioriAll algorithm. The relationships between frequent itemsets, association rules, sequence mining and process mining (introduced in Week 6) are clarified.

Week 6: Process Mining

The whole week is dedicated to various aspects of process mining. We start out with an extensive introduction to the topic, including various types of models, tools and applications. Next, various approaches to process discovery are presented as the most prominent example of unsupervised learning in the context of process mining. Finally, supervised problems in process mining are discussed with the main focus on conformance checking techniques.

Week 7: Text Mining

In this week we explore the topic of text mining. Various approaches to text preprocessing are discussed, including corpus annotation, tokenization, stop word removal, token normalization, stemming and lemmatization, followed by an overview of modelling techniques, i.e., BoW, document-term matrix and TF-IDF scoring. We briefly discuss the inclusion of semantics using public databases (Linked Open Data) before proceeding with a detailed introduction to N-grams and their application to word prediction and text generation. These concepts are extended in the following when discussing word embeddings, particularly the concepts of autoencoders, Word2vec, CBoW and Doc2vec.

Week 8: Responsible Data Science

In this week we discuss challenges and solution approaches to confidentiality and fairness in data science. The first half of the week is dedicated to confidentiality. We give a brief overview to data encryption before introducing various techniques to anonymize data while maintaining its usefulness for analysis and to objectively evaluate the level of anonymization.

The second part of the week, focusing on fairness, introduces various metrics to objectively measure fairness and explores approaches to decrease discrimination of data science models and techniques. We conclude with a discussion of the potential trade-offs between model performance and model fairness.

Week 9: The Bigger Picture

In the final week, we briefly recap the contents of the course and discuss connections, trade-offs, conflicts and interactions between the various topics as well as their context and impact within the bigger picture of data science. An outlook to further perspectives and topics omitted in this introductory course is given.

Ways to take this course

Choose your path when you enroll.

Enroll now

Starts Jun 24

	Certificate	Free
Price	$99 USD	-
Access to course materials	Unlimited	Limited Expires on Aug 26
World-class institutions and universities
edX support
Shareable certificate upon completion
Graded assignments and exams

Read our FAQs about frequently asked questions on these tracks.

Interested in this course for your business or team?

Train your employees in the most in-demand topics, with edX For Business.

Purchase now Request information

RWTHx: Basics of Data Science

There is one session available:

Basics of Data Science

Enroll now

About this course

At a glance

What you'll learn

Syllabus

Ways to take this course

Enroll now

Certificate

Free

Price

Access to course materials

World-class institutions and universities

edX support

Shareable certificate upon completion

Graded assignments and exams

Interested in this course for your business or team?