Working with Data
- Originally Written : September, 2023
This workshop will provide you an overview of some common data formats as well as tools and methods available to help you transform data. It is not an exhaustive list and is not focused on any particular area (e.g. networking APIs).
Pre-requisities
If you intend to run this lab on your own machine you should have the following installed
- Git
- Python 3
- Pip
-
Python Packages:
requests
json
yaml
pandas
numpy
openpyxl
jinja2
matplotlib
plotly
faker
-
VSCode
- VSCode extensions:
yaml
jupyter
Labs
- Tools Overview - Introduction to the primary tools (
VS Code
,Jupyter Notebook
) you will be using to run the lab - Markdown - Overview of some Markdown syntax which can be helpful when writing documentation (including a
mermaid
diagram) - Data Formats - Introduction to JSON and YAML
- CLI Tools - The same JSON and YAML examples but instead using the
jq
command line tool to work with the data - Pandas and Numpy - Transforming datasets such as Excel or CSV files using the
Pandas
library. Working withNumpy
arrays andMatplotlib
to visualise arrays - Jinja - Building templates with the
Jinja
library
Getting Started
If you are using the Cisco dCloud lab you will be provided access credentials from a proctor.
Regarding the 03. CLI Tools
lab
If you are running Windows
, some of the commands such as sed
may not be available. You can either read through the lab, skip it altogether, or use a cloud hosted platform like this one
https://killercoda.com/playgrounds/scenario/ubuntu
You will need to clone the Git repo (steps below) in the Ubuntu playground and copy/paste the commands into the Ubuntu terminal.
- If you have VSCode installed the next steps can be performed through the VSCode built in terminal window. Select the
Terminal
menu and thenNew Terminal
- Clone the lab repository
- Change the directory to the first lab
Lab 0: Tools
This activity will provide an overview of the relevant sections in VSCode and Jupyter notebooks.
- Open
tools.pdf
and have a look at the screenshots to familiarise yourself with the tools
Info
The PDF was created with pandoc using the following command
Lab 1: Markdown
Info
You won't need to run any code in this exercise
Although this lab is focused on working with different tools to transform data, as you take these lessons into your daily activities it's always helpful to have supporting documentation. This could be in the form of an installation guide, diagrams, or even just comments in some code. Markdown
is a great way to easily style documentation and recall from the first lab that Jupyter notebooks combine both Markdown and Code within a single file.
- Change directory to start the markdown lab
-
Read through the
markdown.md
file and note the syntax used to format the document -
If you want to see the formatted Markdown you can right-click the file in VSCode and select
Open Preview to the Side
.
If you can't see the Mermaid diagram in the VSCode Markdown preview you can view the diagram here
- Try it out yourself by adding some content using the syntax you find in the
markdown.md
file
Lab 2: Data Formats
Data can come in a variety of formats, each with their unique characteristics, best suited for different applications.
JSON
, is a lightweight data-interchange format that's commonly used for transmitting data between a server and a web application in a human-readable format. You might see JSON used a response when querying an APIYAML
, which stands for "YAML Ain't Markup Language," is another human-friendly data serialization standard that is often used in configuration files and in applications where data is being stored or transmitted. You would see YAML when using Ansible or Kubernetes.CSV
, or Comma Separated Values, is a simple file format that is widely used for storing tabular data, such as in a spreadsheet or database; its simplicity and wide support makes it a common format for data exchange between applications. The choice of a data format largely depends on the nature of the data, the specific requirements of the application, and the systems that will be used to process the data.
In this lab you will look at the JSON and YAML structure including how to work with these formats in Python
- Change directory to start the data formats lab
-
Open
data_formats_json.ipynb
and proceed through the activities -
Open
data_formats_yaml.ipynb
and proceed through the activities
Lab 3: CLI Tools
Regarding the 03. CLI Tools
lab
If you are running Windows
, some of the commands such as sed
may not be available. You can either read through the lab, skip it altogether, or use a cloud hosted platform like this one
https://killercoda.com/playgrounds/scenario/ubuntu
You will need to clone the Git repo (steps below) in the Ubuntu playground and copy/paste the commands into the Ubuntu terminal.
git clone https://github.com/conmurphy/working-with-data.git
The first few activities of this lab will seem familiar as they are a repeat of the previous. Sometimes you might want to process data through the CLI rather than a dedicated script and this may involve one or more tools.
This lab provides an overview of common CLI tools such as jq
, tr
, cut
, and awk
. These are by no means the only tools available and is only a very brief introduction. The idea is for you to understand what is possible so that you have a starting point for future activities.
- Change directory to start the CLI Tools lab
Info
These activities will need to be performed in a Terminal and not using the Jupyter Notebook play button.
- Open
cli_tools.ipynb
and proceed through the activities by copying and pasting the commands into a terminal
Lab 4: Pandas and Numpy
Info
If you ran the previous lab 03.CLI tools
in Killercoda
you should change back to the Windows pod to run the remaining exercises.
Pandas
is a very popular Python library for data manipulation and analysis, often used for tasks such as data cleaning, data wrangling, statistical analysis, and in use cases such as building machine learning models, data visualization, and creating complex data structures.
This lab demonstrates a usecase I have very often; working with large Excel spreadsheets. Before you can run the lab you will need to generate some sample data using the faker
Python library.
The second activity covers Numpy
which is a Python library for scientific computing that supports large, multi-dimensional arrays and matrices. This makes it suitable for use cases such as numerical analysis, linear algebra, statistical operations, and even simulating physical and mathematical models.
You'll often see Pandas and Numpy used together.
This lab gives an overview of some common Numpy functions such as calculating the square of each element in an array. This lab also has a couple of activities using the matplotlib
library to plot a simple line graph.
- Change directory to start the Pandas and Numpy lab
generate_sample_data.ipynb
and generate the sample data
The file should be ~ 40MB-50MB in size
-
Once the
.xslx
file is generate, open it in Microsoft Excel and confirm the sample data has been generated -
Open
pandas.ipynb
and proceed through the activities
Through a few lines of code you were able to reduce a 40MB-50MB Excel file into a much smaller more manageable file size. Pandas is my go-to library any time I'm working with large Excel files.
But remember, it's a very powerful library and can be used to transform data of many different formats. As demonstrated with the API call and nest JSON.
Next have a look at the Numpy lab.
- Open
numpy.ipynb
and proceed through the activities
Lab 5: Jinja
Jinja
is a modern and designer-friendly templating language for Python. It is often used in creating dynamic web pages, generating configuration files, or producing automated emails, but has many more use cases outside of these areas.
This lab will demonstrate Jinja templates using two basic examples, an email template and a configuration file template. As with the previous labs, the activities in this lab are really only an introduction to the concepts. Have a look at the official documentation for what's possible with Jinja templates.
- Open
jinja.ipynb
and proceed through the activities
Summary
Hopefully you now have an understanding of a few different methods and tools available to transform data. There are many more available to help you with any tasks you have.