14)Data Engineering:

Data Engineering:

  • SQL:

    • Cheatsheet: http://files.zeroturnaround.com/pdf/zt_sql_cheat_sheet.pdf
    • Use HAVING when you want to filter on aggregate columns, WHERE won't work for that.
      • WHERE comes before GROUP BY; HAVING after.
      • HAVING requires aggregation, like "HAVING sum(population) > 256000".
    • "SELECT * FROM a LEFT OUTER JOIN b ON a.id = b.a_id WHERE b IS NULL;" gets you the records that are only in table a; the null part excludes the right-hand table's records / matching records are excluded
      • https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
    • Order of clauses:
      • SELECT
      • FROM
      • WHERE
      • GROUP BY
      • HAVING
      • ORDER BY
    • When using a subquery with conditional logic, ensure the subquery is either scalar (returns just 1 value) or the operator is IN:
      • SELECT *
      • FROM tutorial.sf_crime_incidents_2014_01
      • WHERE Date = (SELECT MIN(date)
        • FROM tutorial.sf_crime_incidents_2014_01
        • )
      • or
      • SELECT *
      • FROM tutorial.sf_crime_incidents_2014_01
      • WHERE Date IN (SELECT date
        • FROM tutorial.sf_crime_incidents_2014_01
        • ORDER BY date
        • LIMIT 5
        • )
    • Case details:
      • Useful to "bucket" values in a column (e.g., output 1 when the value is X, 0 otherwise).
      • The CASE statement always goes in the SELECT clause
      • CASE must include the following components: WHEN, THEN, and END. ELSE is an optional component.
      • You can make any conditional statement using any conditional operator (like WHERE) between WHEN and THEN. This includes stringing together multiple conditional statements using AND and OR.
      • You can include multiple WHEN statements, as well as an ELSE statement to deal with any unaddressed conditions.
      • Example:
        • SELECT player_name,
        • CASE WHEN year = 'FR' AND position = 'WR' THEN 'frosh_wr'
        • WHEN year = 'SO' AND position = 'WR' THEN 'soph_wr'
        • ELSE NULL END AS sample_case_statement
        • FROM benn.college_football_players
      • With aggregate functions:
        • But what if you also wanted to count a couple other conditions? Using the WHEREclause only allows you to count one condition. Here’s an example of counting multiple conditions in one query:
        • SELECT CASE WHEN year = 'FR' THEN 'FR'
          • WHEN year = 'SO' THEN 'SO'
          • WHEN year = 'JR' THEN 'JR'
          • WHEN year = 'SR' THEN 'SR'
          • ELSE 'No Year Data' END AS year_group,
          • COUNT(1) AS count
        • FROM benn.college_football_players
        • GROUP BY 1
    • Window functions:
      • https://www.compose.com/articles/metrics-maven-window-functions-in-postgresql/
      • Provide the ability to perform computations across sets of rows that are related to the current query row, like GROUP BY, but without having to collapse the set of rows into a single output row and with the ability to look at arbitrary overlapping windows of rows (vs. exclusive "groups").
      • E.g., this gives us the total national population alongside each state's population:
        • SELECT name AS state_name,
          • popestimate2015 AS state_population,
          • SUM(popestimate2015) OVER() AS national_population
            no partition by so this is over all the data
        • FROM population
        • WHERE state > 0 -- only state-level rows
        • ORDER BY name
      • Without window function, would have to either get the national population by itself (without the state-level information) or use a subquery:
        • SELECT name AS state_name,
          • popestimate2015 AS state_population,
          • x.national_population
        • FROM population,
        • (
          • SELECT SUM(popestimate2015) AS national_population
          • FROM population
          • WHERE state > 0 -- only state-level rows
        • ) x
        • WHERE state > 0 -- only state-level rows
        • ORDER BY name
      • Can apply conditions with PARTITION BY (a la GROUP BY) and ORDER BY:
        • PARTITION BY specifies the subset of rows considered (in the window).
        • ORDER BY orders the rows in the specified window frame, useful for RANK() and running sum-type information.
          • ORDER BY can act as an implicit partition:
            • Query:
              • SELECT start_time, duration_seconds,
              • SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
              • FROM tutorial.dc_bikeshare_q1_2012
            • Results:
              • start_time | duration_seconds | running_total
              • 00:04:00 475 475
              • 00:10:00 1162 2782
              • 00:10:00 1145 2782
              • 00:15:00 485 3738
              • 00:15:00 471 3738
              • 00:17:00 358 4096
              • 00:18:00 1754 5850
          • Also note that the ordering "goes up to" the current query row based on the column being ordered by; that is, the sum is the sum of the current row plus the previous rows (hence running total).
      • Evaluated after the join, group, and having clauses, at the same time as other select statements—meaning window functions can't refer to other fields in the select statement.
      • Running average of daily active users example:
        • Daily active users (DAU):
          • select date(created_at), count(distinct user_id)
          • from activity
          • group by 1
        • Lots of noise, so let's do a daily running average:
          • select
            • date(created_at) d,
            • avg(count(distinct user_id)) over (
              • order by d
              • rows between 7 preceding and current row
            • )
          • from activity
          • group by 1
    • Specific unique window functions:
      • RANK vs DENSE_RANK vs ROW_NUMBER:
        • | V | ROW_NUM | RANK | DENSE_RANK |
        • |---|----------------|---------|-------------|
        • | a | 1 | 1 | 1 |
        • | a | 2 | 1 | 1 |
        • | a | 3 | 1 | 1 |
        • | b | 4 | 4 | 2 |
        • | c | 5 | 5 | 3 |
        • | c | 6 | 5 | 3 |
        • | d | 7 | 7 | 4 |
        • | e | 8 | 8 | 5 |
      • NTILE:
        • You can use window functions to identify what percentile (or quartile, or any other subdivision) a given row falls into. The syntax is NTILE(*# of buckets*). In this case, ORDER BY determines which column to use to determine the quartiles (or whatever number of ‘tiles you specify).
      • LAG / LEAD:
        • It can often be useful to compare rows to preceding or following rows, especially if you’ve got the data in an order that makes sense. You can use LAG or LEAD to create columns that pull values from other rows—all you need to do is enter which column to pull from and how many rows away you’d like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows:
    • Using count and case:
      • SELECT
        • COUNT(CASE WHEN rsp_ind = 0 then 1 ELSE NULL END) as "New",
        • COUNT(CASE WHEN rsp_ind = 1 then 1 ELSE NULL END) as "Accepted"
      • from tb_a
    • Unions vs. joins:
      • SQL joins allow you to combine two datasets side-by-side, but UNION allows you to stack one dataset on top of the other. Put differently, UNION allows you to write two separate SELECT statements, and to have the results of one statement display in the same table as the results from the other statement.
      • SELECT *
      • FROM tutorial.crunchbase_investments_part1
        • UNION
      • SELECT *
      • FROM tutorial.crunchbase_investments_part2
    • If every record in A necessarily includes information in B, A should have a foreign key that references a primary key on B to maintain referential integrity. E.g., if a row in A is inserted but doesn't have the B-related columns (the FK), it should be rejected.
      • There can be multiple unique keys, but just one primary key. Unique keys are nullable, primary keys aren't.
    • In a 1:many relationship, the foreign key is the "anchor" on the many side (the primary key that it references is the "anchor" on the 1 side).
      1:many = PK:FK (multiple records on FK side have PK values).
      • 1:1 relationships are rare and should usually just merge the tables together. But if not, could do a primary:unique foreign key setup.
        1:1 = PK:FK (one record on FK side has PK values).
      • many:many relationships should call for a 3rd (junction) table containing the primary keys of the two tables.
        With junction table, 1:many and many:1.
    • Normalized (no dupes) schema is better for writes; denormalized better for reads.
    • LIMIT:
      • SELECT * FROM tbl LIMIT 5,10;
      • returns rows 6-15
        The offset (the first number) is 0-indexed.
    • Get second highest value:
      • SELECT MAX( col )
      • FROM table
      • WHERE col < ( SELECT MAX( col ) FROM table );
    • Delete duplicate rows:
      Where cols 1-3 unique the table.
      • DELETE FROM MyTable
      • WHERE RowId NOT IN
      • (SELECT MIN(RowId) FROM MyTable GROUP BY Col1, Col2, Col3);
    • Order of operations:
      • FROM clause
      • WHERE clause
        This is why you can't reference a SELECT alias in the WHERE clause.
      • GROUP BY clause
      • HAVING clause
      • SELECT clause
      • ORDER BY clause
        You can reference a SELECT alias in the ORDER BY.
    • In JOIN, can filter one table but not the other: https://community.modeanalytics.com/sql/tutorial/sql-joins-where-vs-on/
      • Putting the condition in the WHERE clause would filter it out completely. Putting the condition in the JOIN just removes it from the specified table.
    • VARCHAR vs. CHAR:
      • VARCHAR is variable-length.
      • CHAR is fixed length.
      • If your content is a fixed size, you'll get better performance with CHAR.
    • MySQL single quotes vs. backticks vs. double quotes:
      • http://stackoverflow.com/questions/11321491/when-to-use-single-quotes-double-quotes-and-backticks-in-mysql
    • SQL Cookbook:
      • emp:
        • create table emp (
          • empno int unsigned,
          • ename varchar(128),
          • job varchar(128),
          • mgr int unsigned,
          • hiredate date,
          • sal int unsigned,
          • comm int unsigned,
          • deptno int unsigned
        • );
        • insert into emp (empno, ename, job, mgr, hiredate, sal, comm, deptno) values
        • (7369, 'smith', 'clerk', 7902, '1980-12-17', 800, null, 20),
        • (7499, 'allen', 'salesman', 7698, '1981-02-20', 1600, 300, 30),
        • (7369, 'smith', 'clerk', 7902, '1980-12-17', 800, null, 20),
  • MapReduce:
    • purpose:
      • scalable, parallelizable computations on big data
      • abstracts away looping
      • simple to write programs, just need to supply mapper and reducer functions
    • algorithm:
      • take a bunch (a list) of data items
      • system/master program splits list and gives a sublist to each (mapper) node
      • mapper takes its list and applies some map function to it that transforms the list of items into a list of key, value pairs
      • list of (k,v) pairs output to some buffer
      • system takes the lists of (k,v) pairs and filters them such that it can give each reducer node a list where all the keys are the same
      • reducer aggregates values (since system ensured they all share the same key)
    • example:
    • operates on a file in HDFS in a similar way. in HDFS, a file is split into chunks, or blocks. many mapper nodes can work on different blocks of the same file. then, reducer nodes can do the same. scales essentially linearly
    • Tez is Apache's MapReduce successor:
      • better performance
        • intermediate data doesn't need to be written to HDFS (can be cached in-memory)
        • YARN containers can be reused
      • simpler "API"
        • computations expressed as workflow DAG vs multiple MR jobs
        • supports Hive, Pig, etc.

  • Hive:

    • Performance tips:
      https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
      • use the Tez execution engine (vs Map-Reduce)
        • set hive.execution.engine=tez;
      • use ORCFile storage format (better than Parquet or RCFile)
        https://www.dropbox.com/s/irhhjl6ivgiak0z/Screenshot%202017-04-22%2017.52.50.png?dl=0
        • create table driver_mileage
        • stored as orc
        • as
        • select driverid, sum(miles) as total_miles
        • from truck_mileage
        • group by driverid;
      • use vectorization (performs operations in batches of 1024 rows)
        • set hive.vectorized.execution.enabled = true;
        • set hive.vectorized.execution.reduce.enabled = true;
      • use cost-based optimization (CBO)
        • Meta:
          • Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations can include the cost of the query if CBO is enabled.
        • set hive.cbo.enable=true;
        • set hive.compute.query.using.stats=true;
        • set hive.stats.fetch.column.stats=true;
        • set hive.stats.fetch.partition.stats=true;
        • then:
        • analyze table tweets compute statistics for columns;
      • avoid joins where possible
        • Consider a click-stream event table:
          • CREATE TABLE clicks (
          • timestamp date, sessionID string, url string, source_ip string
          • ) STORED as ORC tblproperties (“orc.compress” = “SNAPPY”);
        • Each record represents a click event, and we would like to find the latest URL for each sessionID.
        • Could use a join:
          • SELECT clicks.* FROM clicks inner join
          • (select sessionID, max(timestamp) as max_ts from clicks
          • group by sessionID) latest
          • ON clicks.sessionID = latest.sessionID and
          • clicks.timestamp = latest.max_ts;
        • In the above query, we build a sub-query to collect the timestamp of the latest event in each session, and then use an inner join to filter out the rest.
        • Can use over and rank to rewrite more efficiently as:
          • SELECT * FROM
          • (SELECT *, RANK() over (partition by sessionID,
          • order by timestamp desc) as rank
          • FROM clicks) ranked_clicks
          • WHERE ranked_clicks.rank=1;
    • In Hive, columns can store arrays or maps. We can use what are called table-generating functions to "unpack" these multivalued columns.
      • Example built-in table-generating functions include explode and stack: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inTable-GeneratingFunctions(UDTF)
      • Often used in conjunction with lateral view as a type of join:
        • "Lateral view is used in conjunction with user-defined table generating functions such as explode(). As mentioned in Built-in Table-Generating Functions, a UDTF generates zero or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias."
          https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
        • Lateral view is basically a lateral.. table view, made by expanding / unpacking a column or columns into multiple rows.
      • http://www.ericlin.me/how-to-use-hive-lateral-view-in-your-query
        • Imagine we start out with an answer table like:
          • qId cId vId
          • 1 1 v1
          • 1 2 v1
          • 1 2 v2
          • 1 2 v3
        • We can combine vld values like so:
          • qId cId vIds
          • 1 1 “v1”
          • 1 2 “v1, v2”, “v3”
        • And then selectively unpack (with a "where" filter):
          • SELECT qId, cId, vId
          • FROM answer
          • LATERAL VIEW explode(vIds) visitor AS vId
          • WHERE cId = 2
        • “vIds” is the field we're exploding, “visitor” is the LATERAL VIEW TABLE alias and “vId” is the new column alias so that it can be selected in the query.
        • To get:
          • qId cId vId
          • 1 2 v1
          • 1 2 v2
          • 1 2 v3

  • Spark:

    • Spark is in-memory, more functional since it uses RDDs that distinguish between transformations and actions (enabling lazy execution/evaluation), and has playback abilities (owing to the lazy evaluation).
    • In-memory as in it doesn't write intermediate results to disk/HDFS like map-reduce does.

  • Dimensional Modeling:

    • Facts vs. dimensions:
      • Facts are (almost always) numeric measurements / low-level metrics we use for calculation purposes
        • Represent a granular measurement event
      • Dimensions are the (usually non-numeric) attributes we use to filter or group by
        • The "by" nouns, e.g., "revenue by product", so product is a dimension
      • Facts = business process metrics; dimensions = descriptive attributes
      • A fact table has foreign keys to dimension tables
    • In a star schema, key assumption is that the fact::dimension relationship is many::1, e.g., many transactions can have a shared product dimension
      • If many::many, then aggregations will be wonky (think about the cardinalities in the join: 1 * N = N, but M * N = MN)
    • Generally, need to represent many::many relationships as two many::1 and 1::many relationships using a bridge/junction table:
      https://www.dropbox.com/s/kra73kh4l9tm468/Screenshot%202017-03-04%2016.24.31.png?dl=0
      • "Think about a simple relationship like the one between Authors and Books. An author can write many books. A book could have many authors. Now, without a bridge table to resolve the many-to-many relationship, what would the alternative be? You'd have to add multiple Author_ID columns to the Books table, one for each author. But how many do you add? 2? 3? 10? However many you choose, you'll probably end up with a lot of sparse rows where many of the Author_ID values are NULL and there's a good chance that you'll run across a case where you need "just one more." So then you're either constantly modifying the schema to try to accommodate or you're imposing some artificial restriction ("no book can have more than 3 authors") to force things to fit."
      • In other words, with a many::many relationship, there's no meaningful parent-child relationship, no way to place foreign keys that makes sense. So with a bridge/junction/linking table, each record of the bridge table represents each unique combination of, say, author and book. There can be many authors and many books, but each combination only shows up once in the bridge.
        • select *
        • from author a join author_book ab on a.id = ab.author_id
          • join book b on b.id = ab.book_id;
    • Database modeling notation:
      • https://www.dropbox.com/s/dsld9gs3frocxxi/Screenshot%202017-03-04%2016.34.02.png?dl=0
      • https://www.dropbox.com/s/24l3nwsogzxgxhm/Screenshot%202017-03-04%2016.33.27.png?dl=0
        https://www.smartdraw.com/entity-relationship-diagram/
    • Snowflake vs. star schema:
      • Star schema has a fact table and then dimension tables directly associated with it through foreign keys. Denormalized for BI query performance. Simpler queries as well.
        Looks like, well, a star: with the dimension tables surrounding a central fact table.
      • Snowflake schema normalizes the dimension tables, e.g., nests categorical / hierarchical data. Less duplication and more referential integrity protection. Also uses less space.
        The dimension tables branching out is reminiscent of a snowflake.
      • http://www.vertabelo.com/blog/technical-articles/data-warehouse-modeling-star-schema-vs-snowflake-schema
        • https://www.dropbox.com/s/l680pyvr8gxn9ue/Screenshot%202017-03-04%2016.55.32.png?dl=0
        • https://www.dropbox.com/s/573602bdme4vex2/Screenshot%202017-03-04%2016.55.57.png?dl=0
    • Dealing with Slowly Changing Dimensions (SCDs):
      https://www.dropbox.com/s/xfvf0556idivn75/Screenshot%202017-03-05%2011.59.38.png?dl=0
      • 5 methods / types of change tracking:
        • Type 0 - The passive method (just keep original value))
        • Type 1 - Overwriting the old value
        • Type 2 - Creating a new additional record (use date effective columns)
        • Type 3 - Adding a new column
        • Type 4 - Using historical table (or split off the rapidly changing dimension columns into a mini dimension table, the primary key of which will be a FK in the fact table)
        • Type 6 - Combine approaches of types 1,2,3 (overwrite current value columns in historical records, new current row (current value col = historical value col), there's a current flag column and/or effective date columns)
          6 = 1 + 2 + 3
    • Conformed dimensions = master dimensions that can be used across multiple fact tables (meaning is identical, think date dimension)
    • Representing inheritance / subtype relationships (think FB actions: comments, likes, shares (of, say, a news story)):
      • http://stackoverflow.com/questions/3579079/how-can-you-represent-inheritance-in-a-database
      • 3 methods:
      • 1 table = just 1 base / parent table, lots of null values for the attributes that only apply to some subtypes
      • 3 tables (assuming 3 subtypes) = 1 table for each subtype / child table, hard to get the set of all elements across all subtypes
      • 4 tables (assuming 3 subtypes) = 1 base table and 3 derived subtype tables, base table contains common attributes and child tables have an FK and PK that is the base table's PK
        Base table passes down common attributes and PK to subtype child tables.
      • http://www.vertabelo.com/blog/technical-articles/inheritance-in-a-relational-database
      • http://searchsqlserver.techtarget.com/feature/Supertype-and-subtype-tables-in-SQL-Server
    • Conditional join on column value:
      • http://stackoverflow.com/questions/25255240/sql-conditional-join-column
      • http://www.mysqldiary.com/conditional-joins-in-mysql/
      • basically, do left joins then combine results into 1 column to get rid of nulls
    • Freeform text comments should arguably be in a dimension table
    • Modeling groups (e.g., FB users and groups of users):
      • just model a many::many relationship using a bridge table like UserGroup
      • then User::UserGroup would be 1::Many (UserGroup FK to User PK) and Group::UserGroup would also be 1::Many (UserGroup FK to Group PK)

  • Clickstream data dimensional modeling:

    • Unique extra dimensions: visitor/user, page, event/action, session, referral/origin
    • https://www.slideshare.net/mobile/alberthui/albert-hui-clickstreamdwturningclicksintocustomers-19174248

  • ETL optimization:

  • Extract operational data from business sources, transform into required end format (e.g., facts & dimensions), load into data warehouse or mart.
    • Incremental loading:
      • https://dwbi.org/etl/etl/53-methods-of-incremental-loading-in-data-warehouse
darkmode