spa_rk_db Flashcards

1
Q

topandas

A

spark.toPandas()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

from pandas

A

spark.createDataFrame(pandasdf, schema)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

read and write

A

spark. read()
spark. write()
spark. read.jdbl(url=database jdbc url path, table=”tablename”)
spark. write.mode(“overwrite/append”).jdbl(url=database jdbc url path, table=”tablename”)

#csv write
#save without changing the schema
??
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

show

A

spark.show(no of rows to show as integer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

filter based on condition

A

df. filter(“experiment_id = 1”).show()
df. filter((df.col1 == condition) & (df.col2 == condition))
df. filter((df.col1 == condition) & (df.col2 == condition)).select(colname)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

replace values on condition

A

df.replace(old,new)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

sparkdf select

A

sparkdf.select(“col name”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

select row

A

df = spark.createDataFrame([[1,2], [3,4]], [‘a’, ‘b’])

n=1
df.select(df.columns[n]).show()
\+---+                                                                           
|  b|
\+---+
|  2|
|  4|
\+---+
df.drop(df.columns[n]).show()
\+---+
|  a|
\+---+
|  1|
|  3|
\+---+

df.select(df.columns[:n] + df.columns[n+1:]).show()

\+---+
|  a|
\+---+
|  1|
|  3|
\+---+
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

import pyspark types to construct schema

A

from pypark.sql.types import *

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

run sql query

A

spark.sql(“select * from table –limt 100”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

connect to an external database using pyodbc

A

import pyodbc

Driver=’{ODBC Driver 17 for SQL Server}’
Server= sql_server
Database=sql_db

cnxn = pyodbc.connect(‘DRIVER=’+Driver+’;SERVER=’+Server.split(“:”)[0]+’;DATABASE=’+Database+’;UID=’+sql_user+’;PWD=’+sql_pass)

cnxn = pyodbc.connect(‘DRIVER=’+Driver+’;SERVER=’+Server.split(“:”)[0]+’;DATABASE=’+Database+’;UID=’+sql_user+’;PWD=’+sql_pass)

cursor = cnxn.cursor()

cursor. execute(“update Experiment set experimentStatus = ‘Completed’ where experimentId = {}”.format(experiment_id))
cnxn. commit()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

get secrets in databricks

A

dbutils.secrets.get(‘nlpexplorer-secret-scope ‘,’sql-username ‘)

first create the key:value pair in key vault

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

create secrets in databricks

A

https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

databricks jdbc url

A

Obtain from database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

curl comments for installing sql driver

A

%sh
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get -q -y install msodbcsql17

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

change datatype in pyspark

A

df2 = df.withColumn(“age”,col(“age”).cast(StringType())

https://sparkbyexamples.com/pyspark/pyspark-cast-column-type/

17
Q

change datatype in pyspark.

With column
with select
with select exp
with spark sql

A

// Convert String to Integer Type

df. withColumn(“salary”,col(“salary”).cast(IntegerType))
df. withColumn(“salary”,col(“salary”).cast(“int”))
df. withColumn(“salary”,col(“salary”).cast(“integer”))

// Using select
df.select(col("salary").cast("int").as("salary"))

//Using selectExpr()

df. selectExpr(“cast(salary as int) salary”,”isGraduated”)
df. selectExpr(“INT(salary)”,”isGraduated”)

//Using with spark.sql()

spark. sql(“SELECT INT(salary),BOOLEAN(isGraduated),gender from CastExample”)
spark. sql(“SELECT cast(salary as int) salary, BOOLEAN(isGraduated),gender from CastExample”)

18
Q

databricks dbutils to get parameters

A

dbutils.widgets.text(“ExperimentId”,”DefaultName”)

experiment_id = dbutils.widgets.get(“ExperimentId”)

19
Q

call db notebook from another notebook

A

dbutils.notebook.run(“notebook name”,timeout, {})

20
Q

send results out of databricks after run

A

dbutils.notebook.exit(“value”)

import json
dbutils.notebook.exit(json({key:value}))

21
Q

add arbitrary column to spark

A

from pyspark.sql.functions import lit

df_with_x4 = df.withColumn(“x4”, lit(0))
df_with_x4.show()

22
Q

get no of rows

A

sparkdf.count()

23
Q

get column

A

sparkdf.column

24
Q

turn databricks notebook cell to run sql commands

A

%sql

sql query

25
Q
#http requests
Using job id
A
  1. create a job in databricks using jobs
  2. copy job id
  3. in postman create post request
  4. add authorisation token (get it from user settings)
  5. . add API request
    https: //adb-3184120056443208.8.azuredatabricks.net/api/2.0/jobs/run-now
  6. set body to json
  7. add json strings
{
"job_id":5670,
"notebook_params":{
    "lob": "4",
    "dataSource": "CAP",
    "customerGeos": "107,1",
    "textFields": "9",
    "channel": "13",
    "cssProducts": "16",
    "issueCodes": "37056"},
"notebook_task": {"notebook_path": "/Nlp_explorer/Casecount"
            }}
26
Q
#http requests
Using run id
A

https://adb-3184120056443208.8.azuredatabricks.net/api/2.0/jobs/runs/submit

{
"existing_cluster_id": "0510-115652-stop848",
"notebook_params":{
    "lob": "4",
    "dataSource": "CAP",
    "customerGeos": "107,1",
    "textFields": "9",
    "channel": "13",
    "cssProducts": "16",
    "issueCodes": "37056"},
"notebook_task": {"notebook_path": "/Nlp_explorer/Casecount"
            }}
27
Q

get for run id

A

https://adb-3184120056443208.8.azuredatabricks.net/api/2.0/get

{
“run_id”: 5670
}

28
Q

create empty dataframe

A

spark.createDataFrame([],[“col1”, “col2”])

29
Q

how to check if df is empty

A

df.rdd.isEmpty()

30
Q

read csv with header

A

medium article has more on read and write with jsons also

spark.read.format(“csv).option(“header”.”true”).load(filepath)

31
Q

overwrite table without changing schema

A

trans something………… forgot

32
Q

storing the password or key in Azure Key Vault

A

storing the password or key in Azure Key Vault as a secret instead in a notebook (SQL query).

dbutils.secrets.get(scope = “bitools_secrets”, key = “blobkey”)