Machine Learning Day2 : Reading Diffferent Format Dataset Using Pandas
short and crisp article for read data in ml model
hey folks this day 2 in machine learning journey in this article i will provide you how to read different different dataset format using pandas.
DIFFERENT WAYS TO EXTRACT DATA FOR MACHINE LEARNING MODEL
There is mainly 5 ways to get data for machine learning models.
- using csv file
- using json
- using dbms
- using API
- using web scrapping
lets discuss one by one these methods
CSV FORMAT
csv is the most common dataset format to provide data for the machine learning model. csv stands for comma separated values.
CSV FILE EXAMPLE
CODE
import numpy as np //import numpy library as np
import pandas as pd //import pandas library as pd
data=pd.read_csv("/IMDB_Top250Engmovies2_OMDB_Detailed.csv") //read_csv is a function in pandas library
data.head() //data.head() used to display top 5 rows in the dataset
OUTPUT
TSV FORMAT
TSV format is also a popular format used in machine learning .TSV stands for tab separated values . here in place of comma separated values we have tab separated value.
EXAMPLE OF TSV FILE
CODE
import pandas as pd //import pandas library
data=pd.read_csv('file.csv',sep='\t') // read csv used to read csv file and sep='\t' stands for separator using tab
data.head()
JSON FILE FORMAT
JSON file format is one of the most widely used file format in the industry so let study some basics of json format . json (javascript on notation) are plain text file you can open them in any text editor
EXAMPLE OF JSON FILE
CODE
import numpy as np
import pandas as pd
data=pd.read_json('restuarant_details.json')
data.head()
OUTPUT
READING DATA THROUGH URL IN JSON
some time the json file is so big that we cannot use it in our local machine so we can directly access the data through server
CODE
import pandas as pd // import pandas library
import numpy as np // import numpy library
data=pd.read_json('file_url') // read json file using read_json function
data.head() //display top 5 row using data.head()
i will cover the rest of method in a separate article
for more information read python pandas documentation pandas.pydata.org/docs/user_guide/io.html