🌑

Stephen's Blog

Data Collection and File Handling in Python

 

Stephen Cheng

Intro

Gathering data is a critical first step in any data science project. Python provides versatile tools for collecting, importing, and saving data from various sources, such as CSV, Excel, JSON, and databases. This guide highlights key techniques and libraries to help you manage, process, and store data efficiently for analysis and modeling.

How to Gather Data

Here’s an improved and detailed breakdown of the different methods to gather data.

  • Through Surveys and Forms: Online surveys, interviews, focus groups.
  • Internal Data Sources: Company databases, CRM systems.
  • Data from Sensors and IoT Devices: Smart devices, IoT platforms.
  • Public Datasets: Government websites, research institutions, open data portals.
  • Web Scraping: Automated tools, APIs, ethical considerations.
  • Collaborations and Partnerships: Academic and industry partnerships.
  • Purchase Data: Data vendors, market research firms.
  • Crowdsourcing: Crowdsourcing platforms, open innovation challenges.
  • Simulated Data: Synthetic data generation, simulation models.

Importing Data from Different Types of Files

In data science projects, data often comes from various sources and formats. Python provides several libraries to efficiently handle different types of files. In this guide, we’ll explore how to import data from CSV, Excel, JSON, and SQLite databases.

  • Importing Data from CSV Files

CSV (Comma-Separated Values) files are simple text files used to store tabular data. Python’s pandas library is highly efficient for handling CSV files.

1
2
3
4
5
import pandas as pd
# Load data from a CSV file
csv_data = pd.read_csv('books.csv')
# Display the first few rows of the DataFrame
print(csv_data.head())
  • Importing Data from Excel Files

Excel files are common for data storage and analysis. The pandas library, can be used to read Excel files.

1
2
3
4
5
import pandas as pd
# Load data from an Excel file
excel_data = pd.read_excel('weather.xlsx')
# Display the first few rows of the DataFrame
print(excel_data.head())
  • Importing Data from JSON Files

JSON (JavaScript Object Notation) is a widely-used format for data interchange, especially with APIs. Python’s json library and pandas can handle JSON files efficiently.

1
2
3
4
5
6
7
8
9
import pandas as pd
import json
# Load data from a JSON file
with open('latest_launch.json', 'r') as file:
json_data = json.load(file)
# Convert JSON data to a DataFrame
json_df = pd.json_normalize(json_data)
# Display the DataFrame
print(json_df.head())
  • Importing Data from SQLite Databases

SQLite is a lightweight, disk-based database. The sqlite3 module in Python, along with pandas, can be used to query and import data from SQLite databases.

1
2
3
4
5
6
7
8
9
10
11
import sqlite3
import pandas as pd
# Connect to the SQLite database
conn = sqlite3.connect('users.db')
# Query the data and load it into a DataFrame
sql_query = "SELECT * FROM users"
db_data = pd.read_sql_query(sql_query, conn)
# Close the database connection
conn.close()
# Display the DataFrame
print(db_data.head())
  • Gathering Data by Web scraping
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import requests
from bs4 import BeautifulSoup
import csv

url = 'http://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all book items
books = soup.find_all('article', class_='product_pod')

# Open a CSV file to write the data
with open('books.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Title', 'Price'])
# Write data to CSV
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
writer.writerow([title, price])
print("Data has been written to books.csv")

Saving Data to Different Types of Files in Python

  • Saving Data to CSV Files

CSV (Comma-Separated Values) files are simple text files used to store tabular data. Python’s pandas library is highly efficient for handling CSV files.

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30, 25, 35],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
}
df = pd.DataFrame(data)
# Save the DataFrame to a CSV file
df.to_csv('users.csv', index=False)
print("Data has been written to users.csv")
  • Saving Data to Excel Files

Excel files are common for data storage and analysis. The pandas library can be used to write Excel files.

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
# Create a sample DataFrame
data = {
'City': ['New York', 'Los Angeles', 'Chicago'],
'Temperature': [25, 30, 20],
'Description': ['Sunny', 'Cloudy', 'Rainy']
}
df = pd.DataFrame(data)
# Save the DataFrame to an Excel file
df.to_excel('weather.xlsx', index=False)
print("Data has been written to weather.xlsx")
  • Saving Data to JSON Files

JSON (JavaScript Object Notation) is a widely-used format for data interchange, especially with APIs. Python’s json library and pandas can handle JSON files efficiently.

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30, 25, 35],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
}
df = pd.DataFrame(data)
# Save the DataFrame to a JSON file
df.to_json('users.json', orient='records', indent=4)
print("Data has been written to users.json")
  • Saving Data to SQLite Databases

SQLite is a lightweight, disk-based database. The sqlite3 module in Python, along with pandas, can be used to store data in SQLite databases.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import sqlite3
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30, 25, 35],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
}
df = pd.DataFrame(data)
# Connect to the SQLite database (or create it)
conn = sqlite3.connect('users.db')
# Save the DataFrame to the SQLite database
df.to_sql('users', conn, if_exists='replace', index=False)
# Close the database connection
conn.close()
print("Data has been written to users.db")

, , — Dec 21, 2023

Search

    Made with ❤️ and ☀️ on Earth.