Skip to content

Working with Data

  • Originally Written : September, 2023

This workshop will provide you an overview of some common data formats as well as tools and methods available to help you transform data. It is not an exhaustive list and is not focused on any particular area (e.g. networking APIs).

Pre-requisities

If you intend to run this lab on your own machine you should have the following installed

  • Git
  • Python 3
  • Pip
  • Python Packages:

    • requests
    • json
    • yaml
    • pandas
    • numpy
    • openpyxl
    • jinja2
    • matplotlib
    • plotly
    • faker
  • VSCode

  • VSCode extensions:
    • yaml
    • jupyter

Labs

  1. Tools Overview - Introduction to the primary tools (VS Code, Jupyter Notebook) you will be using to run the lab
  2. Markdown - Overview of some Markdown syntax which can be helpful when writing documentation (including a mermaid diagram)
  3. Data Formats - Introduction to JSON and YAML
  4. CLI Tools - The same JSON and YAML examples but instead using the jq command line tool to work with the data
  5. Pandas and Numpy - Transforming datasets such as Excel or CSV files using the Pandas library. Working with Numpy arrays and Matplotlib to visualise arrays
  6. Jinja - Building templates with the Jinja library

Getting Started

If you are using the Cisco dCloud lab you will be provided access credentials from a proctor.

Regarding the 03. CLI Tools lab

If you are running Windows, some of the commands such as sed may not be available. You can either read through the lab, skip it altogether, or use a cloud hosted platform like this one

https://killercoda.com/playgrounds/scenario/ubuntu

You will need to clone the Git repo (steps below) in the Ubuntu playground and copy/paste the commands into the Ubuntu terminal.

  • If you have VSCode installed the next steps can be performed through the VSCode built in terminal window. Select the Terminal menu and then New Terminal

Overview of VScode

  • Clone the lab repository
git clone https://github.com/conmurphy/working-with-data
  • Change the directory to the first lab
cd working-with-data/00.tools

Lab 0: Tools

This activity will provide an overview of the relevant sections in VSCode and Jupyter notebooks.

  • Open tools.pdf and have a look at the screenshots to familiarise yourself with the tools

Info

The PDF was created with pandoc using the following command

pandoc -s -f markdown-implicit_figures  -o tools.pdf tools.md

Lab 1: Markdown

Info

You won't need to run any code in this exercise

Although this lab is focused on working with different tools to transform data, as you take these lessons into your daily activities it's always helpful to have supporting documentation. This could be in the form of an installation guide, diagrams, or even just comments in some code. Markdown is a great way to easily style documentation and recall from the first lab that Jupyter notebooks combine both Markdown and Code within a single file.

  • Change directory to start the markdown lab
cd 01.markdown
  • Read through the markdown.md file and note the syntax used to format the document

  • If you want to see the formatted Markdown you can right-click the file in VSCode and select Open Preview to the Side.

If you can't see the Mermaid diagram in the VSCode Markdown preview you can view the diagram here

01.markdown/markdown.md

Markdown Preview

  • Try it out yourself by adding some content using the syntax you find in the markdown.md file

Lab 2: Data Formats

Data can come in a variety of formats, each with their unique characteristics, best suited for different applications.

  • JSON, is a lightweight data-interchange format that's commonly used for transmitting data between a server and a web application in a human-readable format. You might see JSON used a response when querying an API
  • YAML, which stands for "YAML Ain't Markup Language," is another human-friendly data serialization standard that is often used in configuration files and in applications where data is being stored or transmitted. You would see YAML when using Ansible or Kubernetes.
  • CSV, or Comma Separated Values, is a simple file format that is widely used for storing tabular data, such as in a spreadsheet or database; its simplicity and wide support makes it a common format for data exchange between applications. The choice of a data format largely depends on the nature of the data, the specific requirements of the application, and the systems that will be used to process the data.

In this lab you will look at the JSON and YAML structure including how to work with these formats in Python


  • Change directory to start the data formats lab
cd 02.data-formats
  • Open data_formats_json.ipynb and proceed through the activities

  • Open data_formats_yaml.ipynb and proceed through the activities

Lab 3: CLI Tools

Regarding the 03. CLI Tools lab

If you are running Windows, some of the commands such as sed may not be available. You can either read through the lab, skip it altogether, or use a cloud hosted platform like this one

https://killercoda.com/playgrounds/scenario/ubuntu

You will need to clone the Git repo (steps below) in the Ubuntu playground and copy/paste the commands into the Ubuntu terminal.

git clone https://github.com/conmurphy/working-with-data.git

The first few activities of this lab will seem familiar as they are a repeat of the previous. Sometimes you might want to process data through the CLI rather than a dedicated script and this may involve one or more tools.

This lab provides an overview of common CLI tools such as jq, tr, cut, and awk. These are by no means the only tools available and is only a very brief introduction. The idea is for you to understand what is possible so that you have a starting point for future activities.

  • Change directory to start the CLI Tools lab
cd 03.cli

Info

These activities will need to be performed in a Terminal and not using the Jupyter Notebook play button.

  • Open cli_tools.ipynb and proceed through the activities by copying and pasting the commands into a terminal

Lab 4: Pandas and Numpy

Info

If you ran the previous lab 03.CLI tools in Killercoda you should change back to the Windows pod to run the remaining exercises.

Pandas is a very popular Python library for data manipulation and analysis, often used for tasks such as data cleaning, data wrangling, statistical analysis, and in use cases such as building machine learning models, data visualization, and creating complex data structures.

This lab demonstrates a usecase I have very often; working with large Excel spreadsheets. Before you can run the lab you will need to generate some sample data using the faker Python library.

The second activity covers Numpy which is a Python library for scientific computing that supports large, multi-dimensional arrays and matrices. This makes it suitable for use cases such as numerical analysis, linear algebra, statistical operations, and even simulating physical and mathematical models.

You'll often see Pandas and Numpy used together.

This lab gives an overview of some common Numpy functions such as calculating the square of each element in an array. This lab also has a couple of activities using the matplotlib library to plot a simple line graph.

  • Change directory to start the Pandas and Numpy lab

cd 05.pandas-and-numpy
- Open generate_sample_data.ipynb and generate the sample data

The file should be ~ 40MB-50MB in size

  • Once the .xslx file is generate, open it in Microsoft Excel and confirm the sample data has been generated

  • Open pandas.ipynb and proceed through the activities

Through a few lines of code you were able to reduce a 40MB-50MB Excel file into a much smaller more manageable file size. Pandas is my go-to library any time I'm working with large Excel files.

But remember, it's a very powerful library and can be used to transform data of many different formats. As demonstrated with the API call and nest JSON.

Next have a look at the Numpy lab.

  • Open numpy.ipynb and proceed through the activities

Lab 5: Jinja

Jinja is a modern and designer-friendly templating language for Python. It is often used in creating dynamic web pages, generating configuration files, or producing automated emails, but has many more use cases outside of these areas.

This lab will demonstrate Jinja templates using two basic examples, an email template and a configuration file template. As with the previous labs, the activities in this lab are really only an introduction to the concepts. Have a look at the official documentation for what's possible with Jinja templates.

cd 05.jinja
  • Open jinja.ipynb and proceed through the activities

Summary

Hopefully you now have an understanding of a few different methods and tools available to transform data. There are many more available to help you with any tasks you have.


Last update: February 12, 2024