Monday, July 24, 2017

pyspark

Few points:

1. pyspark will take input only from HDFS and not from local file system.

PySpark Notes

To install Packages

Python Installation
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Packages Installation
pip3 install <PACKAGENAME>
pip3 install requests --user
pip3 install pandas


Requests Example:
requests package is used while working with rest API's.

import requests
payload= f"grant_type=client_credentials&client_id={client_id_value}&client_secret={client_secret_value}&undefined="
headers={
'content-Type': ""
'Accept':"application/json;v=1"
}

response= requests.request("POST",url,data=payload, header,verify=false)

Pandas Examples:

import pandas as pd
pdf = pd.Dataframe(df_list)
pdf.to_csv(PATH, sep='\u0001', index=False, header = False)

SnowFlake connection:

import snowflake.connector

sfconnec = snowflake.connector.connect(user,
pwd,
account,
warehouse,
dbname,
schema,
role)
sfJdbcConnection = sfconnec.cursor

sqldata = sfJdbcConnection.execute("SELECT * FROM TBL1").fetchall
tableschema = ['Name', 'Address', 'Id']
panda_df = DataFrame(sqldata, columns=tableschema)   #Converts to Pandas DF


Transpose/Pivot  using Pandas:

The advantage of using Pandas is that it has numerous libraries and is powerful. one such example is Transpose operation where columns are to be converted as Rows. 
For this we do not need to write any complex logic. In pandas 'melt' is the function that does this job for used
Eg:
import pandas as pd
tablecols = ['Name', 'Address', 'Id']
pd.melt(csv_df, tablecols)

Later rename the columns

source: https://www.journaldev.com/33398/pandas-melt-unmelt-pivot-function


Steps for AWS Lambda creation:

1. Create Function
2. Add Tags.
3. Add VPS Settings
4. Create Layer
5. Start writing function

To create a layer

pip3 install -U pandas -t ./

Now, Pandas library will be saved in the current directory, zip it and upload as layer to the Lambdas function.

Python Iteration through list

numbers = (1,2,3,4)

def addition_op(n):
return n+ n

out_num = map(addition_op, numbers)

out_nums2 = map(lambda x: x + x, numbers)


Parallelism in Python( Alternate to Futures in Scala)

use ThreadPoolExecutor

Eg:
from concurrent.futures ThreadPoolExecutor, ProcessPoolExecutor


with ThreadPoolExecutor(max_workers = 4) as exe:
    exec.submit(get_Tbl1_data)
    exec.submit(get_Tbl2_data)
    exec.submit(get_Tbl3_data)
    exec.submit(get_Tbl4_data)
    exec.submit(get_Tbl5_data)

def get_Tbl1_data:
    run_snow(conn, Tbl1_sql)

Snowflake Connection:
Used sqlalchemy for connecting to Snowflake from Python.

eg:
from sqlalchemy import create_engine

create_engine(URL(account="hostname", user='user', password='password',warehouse='warehousename',database='databasename',schema='schemaname'))

Install multiple packages in Python

pip3 install --upgrade -r /path/requirements.txt --user

requirements.txt
pandas==1.1.2
snowflake-sqlalchemy==1.2.3

4 comments:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  2. Thank you for the useful information. Share more updates.
    Idioms
    Speaking Test



    ReplyDelete