博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
【面经】Epic: 数据库去重
阅读量:6532 次
发布时间:2019-06-24

本文共 4156 字,大约阅读时间需要 13 分钟。

题目是:有2个10G的数据库,存储了一些string. 2者之间有一些重复的数据。请把它们合并为一个数据库,并且去除重复。

限制:内存是4G

例如: DB1: cmu, ucb, stanford, nyu

        DB2: ucsb, ucb, ucsd, cmu.

两者合并后,应该是: DB: cmu, ucb, stanford, nyu, ucsb, ucsd.

作法:把DB1分为5个小的数据库,分别是DB11, DB12, DB13, DB14, DB15

        把DB2分为5个小的数据库,分别是DB22, DB22, DB23, DB24, DB25

把DB11 与 DB22, DB22, DB23, DB24, DB25 分别进行Union操作,生成DB11Merge.

把DB12 与 DB22, DB22, DB23, DB24, DB25 分别进行Union操作,生成DB12Merge.

......

最后再把DB11Merge, DB12Merge, DB13Merge, DB14Merge, DB15Merge 合并在一起即可

用以下语句即可:

mysql> insert into merge select * from persons2;

1. 

ref: 

以下是实验结果:

A UNION query returns only distinct rows. (There is also UNION ALL, but that would include duplicate rows, so you don't want it here.)

1 mysql> select * from persons2;                                                  +-----------+ 2  3 | FirstName | 4  5 +-----------+ 6  7 | zelin     | 8  9 | qihao     |10 11 +-----------+12 13 2 rows in set (0.00 sec)14 15  16 17 mysql> select * from persons;18 19 +-----------+20 21 | FirstName |22 23 +-----------+24 25 | yu        |26 27 | zhixu     |28 29 | zelin     |30 31 +-----------+32 33 3 rows in set (0.00 sec)34 35  36 37 mysql> 38 39 mysql> select * from persons union select * from persons2;40 41 +-----------+42 43 | FirstName |44 45 +-----------+46 47 | yu        |48 49 | zhixu     |50 51 | zelin     |52 53 | qihao     |54 55 +-----------+56 57 4 rows in set (0.00 sec)
View Code

 

2. Join

顺便介绍几个DB常用的merge用的语句:

An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them.

The most common type of join is: SQL INNER JOIN (simple join). An SQL INNER JOIN return all rows from multiple tables where the join condition is met.

Let's look at a selection from the "Orders" table:

OrderID CustomerID OrderDate
10308 2 1996-09-18
10309 37 1996-09-19
10310 77 1996-09-20

Then, have a look at a selection from the "Customers" table:

CustomerID CustomerName ContactName Country
1 Alfreds Futterkiste Maria Anders Germany
2 Ana Trujillo Emparedados y helados Ana Trujillo Mexico
3 Antonio Moreno Taquería Antonio Moreno Mexico

Notice that the "CustomerID" column in the "Orders" table refers to the "CustomerID" in the "Customers" table. The relationship between the two tables above is the "CustomerID" column.

Then, if we run the following SQL statement (that contains an INNER JOIN):

Example

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;

it will produce something like this:

OrderID CustomerName OrderDate
10308 Ana Trujillo Emparedados y helados 9/18/1996
10365 Antonio Moreno Taquería 11/27/1996
10383 Around the Horn 12/16/1996
10355 Around the Horn 11/15/1996
10278 Berglunds snabbköp 8/12/1996

 


Different SQL JOINs

Before we continue with examples, we will list the types the different SQL JOINs you can use:

    • INNER JOIN: Returns all rows when there is at least one match in BOTH tables
    • LEFT JOIN: Return all rows from the left table, and the matched rows from the right table
    • RIGHT JOIN: Return all rows from the right table, and the matched rows from the left table
    • FULL JOIN: Return all rows when there is a match in ONE of the tables

3. Full Join

 在mysql中没有full join语句,我们需要用union:

mysql> SELECT * FROM persons LEFT JOIN persons2 ON persons.firstName=persons2.firstName UNION SELECT * FROM persons RIGHT JOIN persons2 ON persons.firstName=persons2.firstName;

+-----------+-----------+

| FirstName | FirstName |

+-----------+-----------+

| zelin     | zelin     |

| yu        | NULL      |

| zhixu     | NULL      |

| NULL      | qihao     |

+-----------+-----------+

4 rows in set (0.00 sec)

4.  REPLACE Syntax

使用replace语句也可以达到去重的效果。前提是,我们把想要去重的项目设置为primary key即可。

REPLACE [LOW_PRIORITY | DELAYED] [INTO] tbl_name [(col_name,...)] {VALUES | VALUE} ({

expr | DEFAULT},...),(...),...

Or:

REPLACE [LOW_PRIORITY | DELAYED]    [INTO] tbl_name    SET col_name={
expr | DEFAULT}, ...

Or:

REPLACE [LOW_PRIORITY | DELAYED]    [INTO] tbl_name [(col_name,...)]    SELECT ...

 works exactly like , except that if an old row in the table has the same value as a new row for aPRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See .

 

转载地址:http://xoqbo.baihongyu.com/

你可能感兴趣的文章
java 操作 RabbitMQ 发送、接受消息
查看>>
go run main.go undefined? golang main包那点事
查看>>
前端进阶(13) - 搭建自己的前端脚手架
查看>>
数据挖掘(二):认识数据
查看>>
从零开始写一个npm包,一键生成react组件(偷懒==提高效率)
查看>>
Golang中的路由
查看>>
【期末考试季】JAVA进阶复习提纲
查看>>
Volley(二)—— 基本Request对象 & RequestQueue&请求取消
查看>>
2017中国系统架构师大会“盛装”来袭
查看>>
Google插件switchysharp的用法
查看>>
中国最强的人工智能学术会议来了
查看>>
Metasploit的射频收发器功能 | Metasploit’s RF Transceiver Capabilities
查看>>
Osmocom-BB中cell_log的多种使用姿势
查看>>
主库 归档 删除策略
查看>>
linux服务器多网卡bond
查看>>
Chrome 更新策略大变:优先安装 64 位版本
查看>>
《Linux从入门到精通(第2版)》——导读
查看>>
路过下载攻击利用旧版 Android 漏洞安装勒索软件
查看>>
《ANTLR 4权威指南》——1.2 运行ANTLR并测试识别程序
查看>>
ThinkSNS 六大子版本体验及源码下载
查看>>