Mergin two tables in proc sql

The data that you need for a report could be located in more than one table. In order to select the data from the tables, join the tables in a query. Joining tables enables you to select data from multiple tables as if the data were contained in one table. Joins do not alter the original tables.

The most basic type of join is simply two tables that are listed in the FROM clause of a SELECT statement. The following query joins the two tables that are shown in Table One and Table Two and creates Cartesian Product of Table One and Table Two.

proc sql; title 'Table One and Table Two'; select * from one, two;
Table One X Y ------------------ 1 2 2 3
Table Two X Z ------------------ 2 5 3 6 4 9
Table One and Table Two X Y X Z -------------------------------------- 1 2 2 5 1 2 3 6 1 2 4 9 2 3 2 5 2 3 3 6 2 3 4 9

Joining tables in this way returns the Cartesian product of the tables. Each row from the first table is combined with every row from the second table. When you run this query, the following message is written to the SAS log:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

The Cartesian product of large tables can be huge. Typically, you want a subset of the Cartesian product. You specify the subset by declaring the join type.

There are two types of joins:

An inner join returns only the subset of rows from the first table that matches rows from the second table. You can specify the columns that you want to be compared for matching values in a WHERE clause.

The following code adds a WHERE clause to the previous query. The WHERE clause specifies that only rows whose values in column X of Table One match values in column X of Table Two should appear in the output. Compare this query's output to Cartesian Product of Table One and Table Two.

proc sql; select * from one, two where one.x=two.x;
Table One and Table Two X Y X Z -------------------------------------- 2 3 2 5

The output contains only one row because only one value in column X matches from each table. In an inner join, only the matching rows are selected. Outer joins can return nonmatching rows; they are covered in Outer Joins.

Note that the column names in the WHERE clause are prefixed by their table names. This is known as qualifying the column names, and it is necessary when you specify columns that have the same name from more than one table. Qualifying the column name avoids creating an ambiguous column reference.

A table alias is a temporary, alternate name for a table. You specify table aliases in the FROM clause. Table aliases are used in joins to qualify column names and can make a query easier to read by abbreviating table names.

The following example compares the oil production of countries to their oil reserves by joining the OILPROD and OILRSRVS tables on their Country columns. Because the Country columns are common to both tables, they are qualified with their table aliases. You could also qualify the columns by prefixing the column names with the table names.

Note: The AS keyword is optional.

proc sql outobs=6; title 'Oil Production/Reserves of Countries'; select * from sql.oilprod as p, sql.oilrsrvs as r where p.country = r.country;
Oil Production/Reserves of Countries Barrels Country PerDay Country Barrels --------------------------------------------------------------------------- Algeria 1,400,000 Algeria 9,200,000,000 Canada 2,500,000 Canada 7,000,000,000 China 3,000,000 China 25,000,000,000 Egypt 900,000 Egypt 4,000,000,000 Indonesia 1,500,000 Indonesia 5,000,000,000 Iran 4,000,000 Iran 90,000,000,000

Note that each table's Country column is displayed. Typically, once you have determined that a join is functioning correctly, you include just one of the matching columns in the SELECT clause.

You can order the output of joined tables by one or more columns from either table. The next example's output is ordered in descending order by the BarrelsPerDay column. It is not necessary to qualify BarrelsPerDay, because the column exists only in the OILPROD table.

proc sql outobs=6; title 'Oil Production/Reserves of Countries'; select p.country, barrelsperday 'Production', barrels 'Reserves' from sql.oilprod p, sql.oilrsrvs r where p.country = r.country order by barrelsperday desc;
Oil Production/Reserves of Countries Country Production Reserves ------------------------------------------------------------ Saudi Arabia 9,000,000 260,000,000,000 United States of America 8,000,000 30,000,000,000 Iran 4,000,000 90,000,000,000 Norway 3,500,000 11,000,000,000 Mexico 3,400,000 50,000,000,000 China 3,000,000 25,000,000,000

The INNER JOIN keywords can be used to join tables. The ON clause replaces the WHERE clause for specifying columns to join. PROC SQL provides these keywords primarily for compatibility with the other joins (OUTER, RIGHT, and LEFT JOIN). Using INNER JOIN with an ON clause provides the same functionality as listing tables in the FROM clause and specifying join columns with a WHERE clause.

This code produces the same output as the previous code but uses the INNER JOIN construction.

proc sql ; select p.country, barrelsperday 'Production', barrels 'Reserves' from sql.oilprod p inner join sql.oilrsrvs r on p.country = r.country order by barrelsperday desc;

Tables can be joined by using comparison operators other than the equal sign ( = ) in the WHERE clause (for a list of comparison operators, see Retrieving Rows Based on a Comparison). In this example, all U.S. cities in the USCITYCOORDS table are selected that are south of Cairo, Egypt. The compound WHERE clause specifies the city of Cairo in the WORLDCITYCOORDS table and joins USCITYCOORDS and WORLDCITYCOORDS on their Latitude columns, using a less-than ( lt ) operator.

proc sql; title 'US Cities South of Cairo, Egypt'; select us.City, us.State, us.Latitude, world.city, world.latitude from sql.worldcitycoords world, sql.uscitycoords us where world.city = 'Cairo' and us.latitude lt world.latitude;
US Cities South of Cairo, Egypt City State Latitude City Latitude ------------------------------------------------------------------------- Honolulu HI 21 Cairo 30 Key West FL 24 Cairo 30 Miami FL 26 Cairo 30 San Antonio TX 29 Cairo 30 Tampa FL 28 Cairo 30

When you run this query, the following message is written to the SAS log:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

Recall that you see this message when you run a query that joins tables without specifying matching columns in a WHERE clause. PROC SQL also displays this message whenever tables are joined by using an inequality operator.

Most database products treat nulls as distinct entities and do not match them in joins. PROC SQL treats nulls as missing values and as matches for joins. Any null will match with any other null of the same type (character or numeric) in a join.

The following example joins Table One and Table Two on column B. There are null values in column B of both tables. Notice in the output that the null value in row c of Table One matches all the null values in Table Two. This is probably not the intended result for the join.

proc sql; title 'One and Two Joined'; select one.a 'One', one.b, two.a 'Two', two.b from one, two where one.b=two.b;
Table One a b ------------------ a 1 b 2 c . d 4
Table Two a b ------------------ a 1 b 2 c . d 4 e . f .
One and Two Joined One b Two b -------------------------------------- a 1 a 1 b 2 b 2 c . c . d 4 d 4 c . e . c . f .

In order to specify only the nonmissing values for the join, use the IS NOT MISSING operator:

proc sql; select one.a 'One', one.b, two.a 'Two', two.b from one, two where one.b=two.b and one.b is not missing;
One and Two Joined One b Two b -------------------------------------- a 1 a 1 b 2 b 2 d 4 d 4

When a row is distinguished by a combination of values in more than one column, use all the necessary columns in the join. For example, a city name could exist in more than one country. To select the correct city, you must specify both the city and country columns in the joining query's WHERE clause.

This example displays the latitude and longitude of capital cities by joining the COUNTRIES table with the WORLDCITYCOORDS table. To minimize the number of rows in the example output, the first part of the WHERE expression selects capitals with names that begin with the letter L from the COUNTRIES table.

proc sql; title 'Coordinates of Capital Cities'; select Capital format=$12., Name format=$12., City format=$12., Country format=$12., Latitude, Longitude from sql.countries, sql.worldcitycoords where Capital like 'L%' and Capital = City;

London occurs once as a capital city in the COUNTRIES table. However, in WORLDCITYCOORDS, London is found twice: as a city in England and again as a city in Canada. Specifying only Capital = City in the WHERE expression yields the following incorrect output:

Coordinates of Capital Cities Capital Name City Country Latitude Longitude --------------------------------------------------------------------------- La Paz Bolivia La Paz Bolivia -16 -69 London England London Canada 43 -81 Lima Peru Lima Peru -13 -77 Lisbon Portugal Lisbon Portugal 39 -10 London England London England 51 0

Notice in the output that the inner join incorrectly matches London, England, to both London, Canada, and London, England. By also joining the country name columns together (COUNTRIES.Name to WORLDCITYCOORDS.Country), the rows match correctly.

proc sql; title 'Coordinates of Capital Cities'; select Capital format=$12., Name format=$12., City format=$12., Country format=$12., latitude, longitude from sql.countries, sql.worldcitycoords where Capital like 'L%' and Capital = City and Name = Country;
Coordinates of Capital Cities Capital Name City Country Latitude Longitude --------------------------------------------------------------------------- La Paz Bolivia La Paz Bolivia -16 -69 Lima Peru Lima Peru -13 -77 Lisbon Portugal Lisbon Portugal 39 -10 London England London England 51 0

The data that you need could be located in more than two tables. For example, if you want to show the coordinates of the capitals of the states in the United States, then you need to join the UNITEDSTATES table, which contains the state capitals, with the USCITYCOORDS table, which contains the coordinates of cities in the United States. Because cities must be joined along with their states for an accurate join (similarly to the previous example), you must join the tables on both the city and state columns of the tables.

Joining the cities, by joining the UNITEDSTATES.Capital column to the USCITYCOORDS.City column, is straightforward. However, in the UNITEDSTATES table the Name column contains the full state name, while in USCITYCOORDS the states are specified by their postal code. It is therefore impossible to directly join the two tables on their state columns. To solve this problem, it is necessary to use the POSTALCODES table, which contains both the state names and their postal codes, as an intermediate table to make the correct relationship between UNITEDSTATES and USCITYCOORDS. The correct solution joins the UNITEDSTATES.Name column to the POSTALCODES.Name column (matching the full state names), and the POSTALCODES.Code column to the USCITYCOORDS.State column (matching the state postal codes).

title 'Coordinates of State Capitals'; proc sql outobs=10; select us.Capital format=$15., us.Name 'State' format=$15., pc.Code, c.Latitude, c.Longitude from sql.unitedstates us, sql.postalcodes pc, sql.uscitycoords c where us.Capital = c.City and us.Name = pc.Name and pc.Code = c.State;
Coordinates of State Capitals Capital State Code Latitude Longitude ----------------------------------------------------------- Albany New York NY 43 -74 Annapolis Maryland MD 39 -77 Atlanta Georgia GA 34 -84 Augusta Maine ME 44 -70 Austin Texas TX 30 -98 Baton Rouge Louisiana LA 31 -91 Bismarck North Dakota ND 47 -101 Boise Idaho ID 43 -116 Boston Massachusetts MA 42 -72 Carson City Nevada NV 39 -120

When you need to show comparative relationships between values in a table, it is sometimes necessary to join columns within the same table. Joining a table to itself is called a self-join , or reflexive join . You can think of a self-join as PROC SQL making an internal copy of a table and joining the table to its copy.

For example, the following code uses a self-join to select cities that have average yearly high temperatures equal to the average yearly low temperatures of other cities.

proc sql; title "Cities' High Temps = Cities' Low Temps"; select High.City format $12., High.Country format $12., High.AvgHigh, ' | ', Low.City format $12., Low.Country format $12., Low.AvgLow from sql.worldtemps High, sql.worldtemps Low where High.AvgHigh = Low.AvgLow and High.city ne Low.city and High.country ne Low.country;

Notice that the WORLDTEMPS table is assigned two aliases, High and Low . Conceptually, this makes a copy of the table so that a join can be made between the table and its copy. The WHERE clause selects those rows that have high temperature equal to low temperature.

The WHERE clause also prevents a city from being joined to itself ( City ne City and Country ne Country ), although, in this case, it is highly unlikely that the high temperature would be equal to the low temperature for the same city.

Cities' High Temps = Cities' Low Temps City Country AvgHigh City Country AvgLow ------------------------------------------------------------------------------- Amsterdam Netherlands 70 | San Juan Puerto Rico 70 Auckland New Zealand 75 | Lagos Nigeria 75 Auckland New Zealand 75 | Manila Philippines 75 Berlin Germany 75 | Lagos Nigeria 75 Berlin Germany 75 | Manila Philippines 75 Bogota Colombia 69 | Bangkok Thailand 69 Cape Town South Africa 70 | San Juan Puerto Rico 70 Copenhagen Denmark 73 | Singapore Singapore 73 Dublin Ireland 68 | Bombay India 68 Glasgow Scotland 65 | Nassau Bahamas 65 London England 73 | Singapore Singapore 73 Oslo Norway 73 | Singapore Singapore 73 Reykjavik Iceland 57 | Caracas Venezuela 57 Stockholm Sweden 70 | San Juan Puerto Rico 70

Outer joins are inner joins that are augmented with rows from one table that do not match any row from the other table in the join. The resulting output includes rows that match and rows that do not match from the join's source tables. Nonmatching rows have null values in the columns from the unmatched table. Use the ON clause instead of the WHERE clause to specify the column or columns on which you are joining the tables. However, you can continue to use the WHERE clause to subset the query result.

A left outer join lists matching rows and rows from the left-hand table (the first table listed in the FROM clause) that do not match any row in the right-hand table. A left join is specified with the keywords LEFT JOIN and ON.

For example, to list the coordinates of the capitals of international cities, join the COUNTRIES table, which contains capitals, with the WORLDCITYCOORDS table, which contains cities' coordinates, by using a left join. The left join lists all capitals, regardless of whether the cities exist in WORLDCITYCOORDS. Using an inner join would list only capital cities for which there is a matching city in WORLDCITYCOORDS.

proc sql outobs=10; title 'Coordinates of Capital Cities'; select Capital format=$20., Name 'Country' format=$20., Latitude, Longitude from sql.countries a left join sql.worldcitycoords b on a.Capital = b.City and a.Name = b.Country order by Capital;
Coordinates of Capital Cities Capital Country Latitude Longitude --------------------------------------------------------------- Channel Islands . . Abu Dhabi United Arab Emirates . . Abuja Nigeria . . Accra Ghana 5 0 Addis Ababa Ethiopia 9 39 Algiers Algeria 37 3 Almaty Kazakhstan . . Amman Jordan 32 36 Amsterdam Netherlands 52 5 Andorra la Vella Andorra . .

A right join, specified with the keywords RIGHT JOIN and ON, is the opposite of a left join: nonmatching rows from the right-hand table (the second table listed in the FROM clause) are included with all matching rows in the output. This example reverses the join of the last example; it uses a right join to select all the cities from the WORLDCITYCOORDS table and displays the population only if the city is the capital of a country (that is, if the city exists in the COUNTRIES table).

proc sql outobs=10; title 'Populations of Capitals Only'; select City format=$20., Country 'Country' format=$20., Population from sql.countries right join sql.worldcitycoords on Capital = City and Name = Country order by City;
Populations of Capitals Only City Country Population ------------------------------------------------------ Abadan Iran . Acapulco Mexico . Accra Ghana 17395511 Adana Turkey . Addis Ababa Ethiopia 59291170 Adelaide Australia . Aden Yemen . Ahmenabad India . Algiers Algeria 28171132 Alice Springs Australia .

A full outer join, specified with the keywords FULL JOIN and ON, selects all matching and nonmatching rows. This example displays the first ten matching and nonmatching rows from the City and Capital columns of WORLDCITYCOORDS and COUNTRIES. Note that the pound sign ( # ) is used as a line split character in the labels.

proc sql outobs=10; title 'Populations and/or Coordinates of World Cities'; select City '#City#(WORLDCITYCOORDS)' format=$20., Capital '#Capital#(COUNTRIES)' format=$20., Population, Latitude, Longitude from sql.countries full join sql.worldcitycoords on Capital = City and Name = Country;
Populations and/or Coordinates of World Cities City Capital (WORLDCITYCOORDS) (COUNTRIES) Population Latitude Longitude --------------------------------------------------------------------------- 146436 . . Abadan . 30 48 Abu Dhabi 2818628 . . Abuja 99062003 . . Acapulco . 17 -100 Accra Accra 17395511 5 0 Adana . 37 35 Addis Ababa Addis Ababa 59291170 9 39 Adelaide . -35 138 Aden . 13 45

Three types of joins--cross joins, union joins, and natural joins--are special cases of the standard join types.

A cross join is a Cartesian product; it returns the product of two tables. Like a Cartesian product, a cross join's output can be limited by a WHERE clause.

This example shows a cross join of the tables One and Two:

Table One X Y ------------------ 1 2 2 3
Table Two W Z ------------------ 2 5 3 6 4 9
proc sql; select * from one cross join two;
The SAS System X Y W Z -------------------------------------- 1 2 2 5 1 2 3 6 1 2 4 9 2 3 2 5 2 3 3 6 2 3 4 9

Like a conventional Cartesian product, a cross join causes a note regarding Cartesian products in the SAS log.

A union join combines two tables without attempting to match rows. All columns and rows from both tables are included. Combining tables with a union join is similar to combining them with the OUTER UNION set operator (see Combining Queries with Set Operators). A union join's output can be limited by a WHERE clause.

This example shows a union join of the same One and Two tables that were used earlier to demonstrate a cross join:

proc sql; select * from one union join two;
X Y W Z -------------------------------------- . 2 5 . 3 6 . 4 9 1 2 . 2 3 .

A natural join automatically selects columns from each table to use in determining matching rows. With a natural join, PROC SQL identifies columns in each table that have the same name and type; rows in which the values of these columns are equal are returned as matching rows. The ON clause is implied.

This example produces the same results as the example in Specifying the Order of Join Output:

proc sql outobs=6; title 'Oil Production/Reserves of Countries'; select country, barrelsperday 'Production', barrels 'Reserve' from sql.oilprod natural join sql.oilrsrvs order by barrelsperday desc;
Oil Production/Reserves of Countries Country Production Reserve --------------------------------------------------------------- Saudi Arabia 9,000,000 260,000,000,000 United States of America 8,000,000 30,000,000,000 Iran 4,000,000 90,000,000,000 Norway 3,500,000 11,000,000,000 Mexico 3,400,000 50,000,000,000 China 3,000,000 25,000,000,000

The advantage of using a natural join is that the coding is streamlined. The ON clause is implied, and you do not need to use table aliases to qualify column names that are common to both tables. These two queries return the same results:

proc sql; select a.W, a.X, Y, Z from table1 a left join table2 b on a.W=b.W and a.X=b.X order by a.W;
proc sql; select W, X, Y, Z from table1 natural left join table2 order by W;

If you specify a natural join on tables that do not have at least one column with a common name and type, then the result is a Cartesian product. You can use a WHERE clause to limit the output.

Because the natural join makes certain assumptions about what you want to accomplish, you should know your data thoroughly before using it. You could get unexpected or incorrect results if, for example, you are expecting two tables to have only one column in common when they actually have two. You can use the FEEDBACK option to see exactly how PROC SQL is implementing your query. See Using PROC SQL Options to Create and Debug Queries for more information about the FEEDBACK option.

A natural join assumes that you want to base the join on equal values of all pairs of common columns. To base the join on inequalities or other comparison operators, use standard inner or outer join syntax.

As you can see from the previous examples, the nonmatching rows in outer joins contain missing values. By using the COALESCE function, you can overlay columns so that only the row from the table that contains data is listed. Recall that COALESCE takes a list of columns as its arguments and returns the first nonmissing value that it encounters.

This example adds the COALESCE function to the previous example to overlay the COUNTRIES.Capital, WORLDCITYCOORDS.City, and COUNTRIES.Name columns. COUNTRIES.Name is supplied as an argument to COALESCE because some islands do not have capitals.

proc sql outobs=10; title 'Populations and/or Coordinates of World Cities'; select coalesce(Capital, City,Name)format=$20. 'City', coalesce(Name, Country) format=$20. 'Country', Population, Latitude, Longitude from sql.countries full join sql.worldcitycoords on Capital = City and Name = Country;
Populations and/or Coordinates of World Cities City Country Population Latitude Longitude --------------------------------------------------------------------------- Channel Islands Channel Islands 146436 . . Abadan Iran . 30 48 Abu Dhabi United Arab Emirates 2818628 . . Abuja Nigeria 99062003 . . Acapulco Mexico . 17 -100 Accra Ghana 17395511 5 0 Adana Turkey . 37 35 Addis Ababa Ethiopia 59291170 9 39 Adelaide Australia . -35 138 Aden Yemen . 13 45

COALESCE can be used in both inner and outer joins. For more information about COALESCE, see Replacing Missing Values.

Many SAS users are familiar with using a DATA step to merge data sets. This section compares merges to joins. DATA step match-merges and PROC SQL joins can produce the same results. However, a significant difference between a match-merge and a join is that you do not have to sort the tables before you join them.

When all of the values match in the BY variable and there are no duplicate BY variables, you can use an inner join to produce the same result as a match-merge. To demonstrate this result, here are two tables that have the column Flight in common. The values of Flight are the same in both tables:

FLTSUPER FLTDEST Flight Supervisor Flight Destination 145 Kang 145 Brussels 150 Miller 150 Paris 155 Evanko 155 Honolulu

FLTSUPER and FLTDEST are already sorted by the matching column Flight. A DATA step merge produces Merged Tables When All the Values Match.

data merged; merge FltSuper FltDest; by Flight; run; proc print data=merged noobs; title 'Table MERGED'; run;
Table MERGED Flight Supervisor Destination 145 Kang Brussels 150 Miller Paris 155 Evanko Honolulu

With PROC SQL, presorting the data is not necessary. The following PROC SQL join gives the same result as that shown in Merged Tables When All the Values Match.

proc sql; title 'Table MERGED'; select s.flight, Supervisor, Destination from fltsuper s, fltdest d where s.Flight=d.Flight;

When only some of the values match in the BY variable, you can use an outer join to produce the same result as a match-merge. To demonstrate this result, here are two tables that have the column Flight in common. The values of Flight are not the same in both tables:

FLTSUPER FLTDEST Flight Supervisor Flight Destination 145 Kang 145 Brussels 150 Miller 150 Paris 155 Evanko 165 Seattle 157 Lei
data merged; merge fltsuper fltdest; by flight; run; proc print data=merged noobs; title 'Table MERGED'; run;
Table MERGED Flight Supervisor Destination 145 Kang Brussels 150 Miller Paris 155 Evanko 157 Lei 165 Seattle

To get the same result with PROC SQL, use an outer join so that the query result will contain the nonmatching rows from the two tables. In addition, use the COALESCE function to overlay the Flight columns from both tables. The following PROC SQL join gives the same result as that shown in Merged Tables When Some of the Values Match:

proc sql; select coalesce(s.Flight,d.Flight) as Flight, Supervisor, Destination from fltsuper s full join fltdest d on s.Flight=d.Flight;

When you want to merge two tables and the position of the values is important, you might need to use a DATA step merge. To demonstrate this idea, here are two tables to consider:

FLTSUPER FLTDEST Flight Supervisor Flight Destination 145 Kang 145 Brussels 145 Ramirez 145 Edmonton 150 Miller 150 Paris 150 Picard 150 Madrid 155 Evanko 165 Seattle 157 Lei

For Flight 145, Kang matches with Brussels and Ramirez matches with Edmonton . Because the DATA step merges data based on the position of values in BY groups, the values of Supervisor and Destination match appropriately. A DATA step merge produces Match-Merge of the FLTSUPER and FLTDEST Tables:

data merged; merge fltsuper fltdest; by flight; run; proc print data=merged noobs; title 'Table MERGED'; run;
Table MERGED Flight Supervisor Destination 145 Kang Brussels 145 Ramirez Edmonton 150 Miller Paris 150 Picard Madrid 155 Evanko 157 Lei 165 Seattle

PROC SQL does not process joins according to the position of values in BY groups. Instead, PROC SQL processes data only according to the data values. Here is the result of an inner join for FLTSUPER and FLTDEST:

proc sql; title 'Table JOINED'; select * from fltsuper s, fltdest d where s.Flight=d.Flight;
Table JOINED Flight Supervisor Flight Destination ------------------------------------------- 145 Kang 145 Brussels 145 Kang 145 Edmonton 145 Ramirez 145 Brussels 145 Ramirez 145 Edmonton 150 Miller 150 Paris 150 Miller 150 Madrid 150 Picard 150 Paris 150 Picard 150 Madrid

PROC SQL builds the Cartesian product and then lists the rows that meet the WHERE clause condition. The WHERE clause returns two rows for each supervisor, one row for each destination. Because Flight has duplicate values and there is no other matching column, there is no way to associate Kang only with Brussels , Ramirez only with Edmonton , and so on.

For more information about DATA step match-merges, see SAS Language Reference: Dictionary .