Cse 373 - University of Washington

Cse 373 - University of Washington

CSE 344 APRIL 13 T H SEMI-STRUCTURED DATA ADMINISTRATIVE MINUTIAE HW3 due Wednesday Pull new upstream-correct schema OQ4 due Wednesday HW4 out Wednesday (Datalog) Midterm Exam Wednesday, May 9th CLASS OVERVIEW Unit 1: Intro

Unit 2: Relational Data Models and Query Languages Unit 3: Non-relational data NoSQL Json SQL++ Unit 4: RDMBS internals and query optimization Unit 5: Parallel query processing Unit 6: DBMS usability, conceptual design Unit 7: Transactions Unit 8: Advanced topics (time permitting) TWO CLASSES OF

DATABASE APPLICATIONS OLTP (Online Transaction Processing) Queries are simple lookups: 0 or 1 join E.g., find customer by ID and their orders Many updates. E.g., insert order, update payment Consistency is critical: transactions (more later) OLAP (Online Analytical Processing) aka Decision Support Queries have many joins, and group-bys E.g., sum revenues by store, product, clerk, date No updates NOSQL MOTIVATION

Originally motivated by Web 2.0 applications E.g. Facebook, Amazon, Instagram, etc Web startups need to scaleup from 10 to 100000 users very quickly Needed: very large scale OLTP workloads Give up on consistency Give up OLAP WHAT IS THE PROBLEM? Single server DBMS are too small for Web data Solution: scale out to multiple servers This is hard for the entire functionality of DMBS NoSQL: reduce functionality for easier scale up

Simpler data model Very restricted updates RDBMS REVIEW: SERVERLESS Desktop User SQLite: One data file One user DBMS Application (SQLite)

One DBMS application Consistency is easy But only a limited number of scenarios work with such model File Data file Disk RDBMS REVIEW: CLIENT-SERVER Client

Applications Server Machine File 1 Connection (JDBC, ODBC) File 2 File 3 DB Server One server running the database Many clients, connecting via the ODBC or JDBC

(Java Database Connectivity) protocol RDBMS REVIEW: CLIENT-SERVER Server Machine Many users and apps Consistency is harder transactions Client Applications

File 1 Connection (JDBC, ODBC) File 2 File 3 DB Server One server running the database Many clients, connecting via the ODBC or JDBC (Java Database Connectivity) protocol CLIENT-SERVER One server that runs the DBMS (or RDBMS):

Your own desktop, or Some beefy system, or A cloud service (SQL Azure) CLIENT-SERVER One server that runs the DBMS (or RDBMS): Your own desktop, or Some beefy system, or A cloud service (SQL Azure) Many clients run apps and connect to DBMS Microsofts Management Studio (for SQL Server), or psql (for postgres)

Some Java program (HW8) or some C++ program CLIENT-SERVER One server that runs the DBMS (or RDBMS): Your own desktop, or Some beefy system, or A cloud service (SQL Azure) Many clients run apps and connect to DBMS Microsofts Management Studio (for SQL Server), or psql (for postgres) Some Java program (HW8) or some C++ program

Clients talk to server using JDBC/ODBC protocol WEB APPS: 3 TIER Browser File 1 File 2 File 3 DB Server WEB APPS: 3 TIER

Browser File 1 Connection (e.g., JDBC) File 2 HTTP/SSL File 3 DB Server App+Web Server

WEB APPS: 3 TIER Browser Web-based applications File 1 Connection (e.g., JDBC) File 2 HTTP/SSL

File 3 DB Server App+Web Server WEB APPS: 3 TIER Web-based applications File 1 App+Web Server Connection

(e.g., JDBC) File 2 HTTP/SSL App+Web Server File 3 DB Server App+Web Server Replicate

App server for scaleup WEB APPS: 3 TIER Web-based applications File 1 App+Web Server Connection (e.g., JDBC)

File 2 HTTP/SSL App+Web Server File 3 DB Server Why not replicate DB server? App+Web Server Replicate App server

for scaleup WEB APPS: 3 TIER Web-based applications File 1 App+Web Server Connection (e.g., JDBC) File 2

HTTP/SSL App+Web Server File 3 DB Server Why not replicate DB server? Consistency! App+Web Server REPLICATING THE DATABASE

Two basic approaches: Scale up through partitioning Scale up through replication Consistency is much harder to enforce SCALE THROUGH PARTITIONING Partition the database across many machines in a cluster Database now fits in main memory Queries spread across these machines Can increase throughput

Easy for writes but reads become expensive! Application updates here May also update here Three partitions SCALE THROUGH REPLICATION Create multiple copies of each database partition Spread queries across these replicas

Can increase throughput and lower latency Can also improve fault-tolerance Easy for reads but writes become expensive! App 1 updates here only Three replicas App 2 updates here only

RELATIONAL MODEL NOSQL Relational DB: difficult to replicate/partition Given Supplier(sno,),Part(pno, ),Supply(sno,pno) Partition: we may be forced to join across servers Replication: local copy has inconsistent versions Consistency is hard in both cases (why?) NoSQL: simplified data model Given up on functionality Application must now handle joins and consistency

DATA MODELS Taxonomy based on data models: Key-value stores e.g., Project Voldemort, Memcached Document stores e.g., SimpleDB, CouchDB, MongoDB Extensible Record Stores e.g., HBase, Cassandra, PNUTS KEY-VALUE STORES

FEATURES Data model: (key,value) pairs Key = string/integer, unique for the entire data Value = can be anything (very complex object) KEY-VALUE STORES FEATURES Data model: (key,value) pairs Key = string/integer, unique for the entire data Value = can be anything (very complex object) Operations get(key), put(key,value) Operations on value not supported

KEY-VALUE STORES FEATURES Data model: (key,value) pairs Key = string/integer, unique for the entire data Value = can be anything (very complex object) Operations get(key), put(key,value) Operations on value not supported Distribution / Partitioning w/ hash function No replication: key k is stored at server h(k) 3-way replication: key k stored at h1(k),h2(k),h3(k)

KEY-VALUE STORES FEATURES Data model: (key,value) pairs Key = string/integer, unique for the entire data Value = can be anything (very complex object) Operations get(key), put(key,value) Operations on value not supported Distribution / Partitioning w/ hash function No replication: key k is stored at server h(k) 3-way replication: key k stored at h1(k),h2(k),h3(k) How does get(k) work? How does put(k,v) work?

Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) EXAMPLE How would you represent the Flights data as key, value pairs? How does query processing work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name)

EXAMPLE How would you represent the Flights data as key, value pairs? Option 1: key=fid, value=entire flight record How does query processing work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) EXAMPLE How would you represent the Flights data as key, value pairs?

Option 1: key=fid, value=entire flight record Option 2: key=date, value=all flights that day How does query processing work? Flights(fid, date, carrier, flight_num, origin, dest, ...) Carriers(cid, name) EXAMPLE How would you represent the Flights data as key, value pairs? Option 1: key=fid, value=entire flight record Option 2: key=date, value=all flights that day

Option 3: key=(origin,dest), value=all flights between How does query processing work? KEY-VALUE STORES INTERNALS Partitioning: Use a hash function h, and store every (key,value) pair on server h(key) In class: discuss get(key), and put(key,value) Replication: Store each key on (say) three servers On update, propagate change to the other servers; eventual

consistency Issue: when an app reads one replica, it may be stale Usually: combine partitioning+replication DATA MODELS Taxonomy based on data models: Key-value stores e.g., Project Voldemort, Memcached Document stores e.g., SimpleDB, CouchDB, MongoDB Extensible Record Stores

e.g., HBase, Cassandra, PNUTS MOTIVATION In Key, Value stores, the Value is often a very complex object Key = 2010/7/1, Value = [all flights that date] Better: allow DBMS to understand the value Represent value as a JSON (or XML...) document [all flights on that date] = a JSON file May search for all flights on a given date DOCUMENT STORES FEATURES

Data model: (key,document) pairs Key = string/integer, unique for the entire data Document = JSon, or XML Operations Get/put document by key Query language over JSon Distribution / Partitioning Entire documents, as for key/value pairs We will discuss JSon DATA MODELS Taxonomy based on data models:

Key-value stores e.g., Project Voldemort, Memcached Document stores e.g., SimpleDB, CouchDB, MongoDB Extensible Record Stores e.g., HBase, Cassandra, PNUTS EXTENSIBLE RECORD STORES Based on Googles BigTable

Data model is rows and columns Scalability by splitting rows and columns over nodes Rows partitioned through sharding on primary key Columns of a table are distributed over multiple nodes by using column groups HBase is an open source implementation of BigTable WHERE WE ARE So far we have studied the relational data model Data is stored in tables(=relations) Queries are expressions in SQL, relational algebra, or Datalog Next week: Semistructured data model

Popular formats today: XML, JSon, protobuf

Recently Viewed Presentations

  • USO Warrior and Family Care Presented By: Donald

    USO Warrior and Family Care Presented By: Donald

    The USO in collaboration with United Through Reading® to host its nationally recognized program at select USO Centers Worldwide. Whether troops are stationed at a forward operating base in Afghanistan or deploying overseas, they can visit their participating USO Center...
  • OYRON WELL D-ONE - PHILNOR Lab & Med Supplies

    OYRON WELL D-ONE - PHILNOR Lab & Med Supplies

    OyronWELL D-ONE® consists of a plate made of polypropylene in which there are 32 conical wells with flat well to allow better visualization of the colorimetric reactions that occur in each of them following the growth of a specific m.o....
  • FFT and ASP for Governors

    FFT and ASP for Governors

    Understanding ASP and FFT Data Benchmarking your school's performance through use of Fischer Family Trust and ASP. Welcome
  • スライド 0

    スライド 0

    HIV-related Knowledge and Attitude toward People Living with HIV/AIDS among University Students in Japan Toshiharu Iida (Meiji Gakuin University)
  • How To Write an A.P. U.S. History Thesis Statement

    How To Write an A.P. U.S. History Thesis Statement

    How To Write an A.P. U.S. History Thesis Statement What is a thesis? A thesis statement is the position a student is going to take, the argument that is going to be made. It is therefore the answer to the...
  • PC11: Microsoft Silverlight: Building Business Focused ...

    PC11: Microsoft Silverlight: Building Business Focused ...

    Housekeeping… Yes, my last name is Cool. Yes, it's my real name. Yes, I know I have a Cool last name. Yes, I do, in fact, hear that all the time.
  • Metallic Bonds and Properties of Metals

    Metallic Bonds and Properties of Metals

    Metallic Bonds and Properties of Metals ... Metallic Bonds Electron sea model: all the metal atoms in a metallic solid contribute their valence electrons to form a "sea" of electrons These electrons are free to move from atom to atom...
  • Measuring College Readiness thru High Impact Practices: Assessing

    Measuring College Readiness thru High Impact Practices: Assessing

    Simmons argues that we must re-think and re-write community writing projects, that we must explore ways to incorporate extended community writing projects—projects that span multiple courses and often require multidisciplinary expertise and a broad sense of critical inquiry to complete—into...