Skip to main content

RWTHx: Basics of Data Science

"Basics of Data Science" gives a comprehensible overview of many fundamental concepts and tools of data science, including data quality and data preprocessing, supervised and unsupervised learning techniques including their evaluation, frequent itemsets and association rules, sequence mining, process mining, text mining, and responsible data science.

Basics of Data Science
9 weeks
5–11 hours per week
Self-paced
Progress at your own speed
Free
Optional upgrade available

There is one session available:

After a course session ends, it will be archivedOpens in a new tab.
Starts Jun 24
Ends Dec 11

About this course

Skip About this course

"Basics of Data Science" is designed to provide participants with a comprehensive overview of the fundamental challenges, concepts and tools of data science. The content can be organized in three main areas of data science:

Initially, a brief overview is given to data science infrastructure concerned with volume and velocity. Topics include instrumentation, big data infrastructures and distributed systems, databases and data management. The main challenge is to make things scalable and instant.

The main focus of the course is on data analysis concerned with extracting knowledge from data. Key topics covered are data exploration and visualization, data preprocessing, data quality issues and transformations, various supervised learning techniques with a focus on their evaluation, unsupervised learning, clustering, pattern mining, process mining and text mining. The main challenge of data analysis is to provide answers to known and unknown unknowns.

Finally, data science affects people, organizations, and society. The course is concluded by discussing challenges and providing guidelines and techniques to apply data science techniques responsibly with a focus on confidentiality and fairness. Topics include ethics & privacy, IT law, human-technology interaction, operations management, business models, entrepreneurship, and the main challenge is to do all of the above in a responsible manner.

Throughout the course, the ideas and concepts conveyed in the videos are complemented by hands-on exercises using Python (Jupyter notebooks). Participants will be guided to apply the presented techniques on artificial and real-life data sets to gain valuable hands-on experience.

After the course participants should have a good overview of the best practices, challenges, goals and concepts of the broader data science field, providing a strong foundation for further study or professional development in this rapidly evolving field. Through the combination with hands-on experience with commonly used Python Libraries, participants will be able to conceptualize and implement various basic data analysis techniques in their own projects and accurately evaluate and interpret analysis results.

At a glance

  • Institution: RWTHx
  • Subject: Computer Science
  • Level: Intermediate
  • Prerequisites:

    Everyone from any discipline with an interest in data science can start this course. We expect this course to be useful for everyone. Prior knowledge in math is of advantage (i.e., mathematical notations, linear algebra, stochastics, and statistics), but not mandatory.

  • Language: English
  • Video Transcript: English

What you'll learn

Skip What you'll learn

After taking this course, participants will have gained:

  • Understanding of the role of data science in today’s society and businesses, including challenges and opportunities
  • Good general overview of a broad range of data science techniques
  • Ability to conceptualize and basic data analysis and accurately evaluate and interpret the outcomes
  • Understanding the challenges of responsible data science (fairness, accuracy, confidentiality, transparency) and possible solutions
  • Understanding of the limitations of machine learning, data mining and AI techniques
  • Ability to write short Python programs and use mainstream Python libraries
  • In particular, understanding of and ability to apply the following data analysis concepts and techniques:
  • data visualization and exploration techniques
  • decision trees
  • linear and logistic regression (basic overview)
  • support vector machines (basic overview)
  • neural networks (basic overview)
  • naive bayesian classification (basic overview)
  • evaluation and interpretation of the results obtained using supervised learning
  • clustering techniques
  • frequent item sets
  • association rules
  • sequence mining
  • process mining
  • text mining
  • data preprocessing, data transformation, spotting and handling of data quality problems
  • Application of data analysis techniques without violating confidentiality and fairness

Week 1: Introduction, Data Exploration & Visualization

In the first half of the week, we will provide an overview of the course and illustrate the advantages and challenges when applying data science techniques. Students will get an overview of the data science pipeline, data sources and data types, data analysis techniques and challenges related to their application.

The second half of the week focuses on basic data exploration, visualization and transformation techniques.

Week 2: Supervised Learning Techniques

In the first half of this week, students will delve into data analysis using decision trees. We introduce the basic ID3 Algorithm and its extension to different notions of information gain, as well as pruning techniques, random forests and the applicability of decision trees to continuous data.

The second half of the week is dedicated to a brief overview of other supervised learning techniques (students interested in details are referred to the "Basics of Machine Learning" course which is also part of the BridgingAI course series). These techniques include Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Neural Networks and Naive Bayesian Classification.

Week 3: Evaluation of Supervized Learning, Data Quality & Preprocessing

The first half of this week is dedicated to the evaluation of supervised learning techniques and the models they produce. We introduce the confusion matrix, ROC curve, R2 Coefficient and cross validation including their extension and adaption to specific goals or contexts. Furthermore, challenges and pitfalls regarding the evaluation and interpretation of supervised learning techniques are highlighted.

In the second half of the week, students will learn about data quality issues, their causes and avoidance strategies as well as possible approaches to dealing with outliers or missing values. Furthermore, and overview of data transformation, data reduction and normalization techniques is given.

Week 4: Clustering, Frequent Itemsets

In the first half of this week clustering is introduced as the first unsupervised learning technique. In particular, we present various similarity measures, the k-means and k-medoids algorithms, density-based clustering (DBSCAN) and give an overview of agglomerative clustering techniques and self-organizing maps (SOM).

The second half of the week focuses on the introduction of frequent itemsets. Two algorithms to compute such itemsets are explained: the straightforward Apriori approach as well as the more efficient FP-Growth algorithm.

Week 5: Association Rule Mining, Sequence Mining

In this week, we build upon the concepts of frequent itemsets to generate and evaluate association rules. Furthermore, we use association rules to illustrate Simpson's paradox.

The second half of the week revolves around sequence mining, in particular the AprioriAll algorithm. The relationships between frequent itemsets, association rules, sequence mining and process mining (introduced in Week 6) are clarified.

Week 6: Process Mining

The whole week is dedicated to various aspects of process mining. We start out with an extensive introduction to the topic, including various types of models, tools and applications. Next, various approaches to process discovery are presented as the most prominent example of unsupervised learning in the context of process mining. Finally, supervised problems in process mining are discussed with the main focus on conformance checking techniques.

Week 7: Text Mining

In this week we explore the topic of text mining. Various approaches to text preprocessing are discussed, including corpus annotation, tokenization, stop word removal, token normalization, stemming and lemmatization, followed by an overview of modelling techniques, i.e., BoW, document-term matrix and TF-IDF scoring. We briefly discuss the inclusion of semantics using public databases (Linked Open Data) before proceeding with a detailed introduction to N-grams and their application to word prediction and text generation. These concepts are extended in the following when discussing word embeddings, particularly the concepts of autoencoders, Word2vec, CBoW and Doc2vec.

Week 8: Responsible Data Science

In this week we discuss challenges and solution approaches to confidentiality and fairness in data science. The first half of the week is dedicated to confidentiality. We give a brief overview to data encryption before introducing various techniques to anonymize data while maintaining its usefulness for analysis and to objectively evaluate the level of anonymization.

The second part of the week, focusing on fairness, introduces various metrics to objectively measure fairness and explores approaches to decrease discrimination of data science models and techniques. We conclude with a discussion of the potential trade-offs between model performance and model fairness.

Week 9: The Bigger Picture

In the final week, we briefly recap the contents of the course and discuss connections, trade-offs, conflicts and interactions between the various topics as well as their context and impact within the bigger picture of data science. An outlook to further perspectives and topics omitted in this introductory course is given.

Interested in this course for your business or team?

Train your employees in the most in-demand topics, with edX For Business.