数据分析师笔试试卷四:SQL——视频(KS)
假设总共有 N 条视频数据,总共有 M 个不同的审核员 ID,每个审核员 ID 在总量中的数量为 n_i,则抽取数量为 floor(1000 * n_i/N),其中 floor函数表示向下取整。x2~y1之间重复计算的时间就是多开)求每个审核员剔除多开后的时间(注:审核时间=提交时间-领取时间)(25分,考点:逻辑***)4、现在要从总量中抽出1000条提交事件的视频数据,要求每个审核员id的抽取
欢迎您参加数据分析笔试考试,针对该岗位,本次笔试满分为 100 分,考试时间:60min
要求:本次笔试不能外部查询,需独立完成
现有一张表:jf.creative_element_di

1、 取出2024-07-02日17点的视频入队事件的量级 (15分)
SELECT COUNT(*)
FROM jf.creative_element_di
WHERE DATE_FORMAT(enqueue_time, '%Y-%m-%d %H') = '2024-07-02 17' and element_type=2024518 and event_type=’ENQUEUE’;
2、 取出封面中所有的创意id,并且每行只能有1个创意id (20分)
SELECT DISTINCT JSON_UNQUOTE(JSON_EXTRACT(params, '$.extraInfo2.creative_id')) AS creative_id
FROM jf.creative_element_di
WHERE element_type = '2024519';
3、 取出相同广告主所有视频的延时时间,都在4H内的广告主量级(注:①延时时间=提交时间-入队时间) (20分)
WITH video_times AS (
SELECT video_id,
account_id AS advertiser_id,
MAX(CASE WHEN event_type = 'ENQUEUE' THEN operation_time END) AS enqueue_time,
MAX(CASE WHEN event_type = 'SUBMIT' THEN operation_time END) AS submit_time
FROM jf.creative_element_di
WHERE element_type = '2024518'
GROUP BY video_id, account_id
),
account_delays AS (
SELECT account_id,
TIMESTAMPDIFF(HOUR, enqueue_time, submit_time) AS delay_hours
FROM video_times
WHERE enqueue_time IS NOT NULL AND submit_time IS NOT NULL
)
SELECT COUNT(DISTINCT account_id)
FROM account_delays
GROUP BY account_id
HAVING delay_hours <= 4;
4、现在要从总量中抽出1000条提交事件的视频数据,要求每个审核员id的抽取比例要相近 (注:抽取比例=抽取量/总量) (20分,考点:逻辑*)
a) 确定审核员 ID 的分布情况:首先,了解每个审核员 ID 在总量中的出现次数,即每个审核员 ID 被分配的视频数量。
b) 计算每个审核员 ID 的抽取数量: 计算每个审核员 ID 应该抽取的数量,使得抽取比例接近相等。假设总共有 N 条视频数据,总共有 M 个不同的审核员 ID,每个审核员 ID 在总量中的数量为 n_i,则抽取数量为 floor(1000 * n_i/N),其中 floor函数表示向下取整。
c) 处理抽取数量不足或超出的情况:由于每个审核员 ID 的抽取数量是向下取整的,可能会出现抽取数量不足或超出的情况。可以按以下方式处理:
i. 计算实际抽取总量 floor(1000 * n_i/N)之和。
ii. 如果实际抽取总量小于 1000,则根据差额补足,可以按照抽取比例的大小排序,依次补充直到总量达到 1000。
iii. 如果实际抽取总量超过 1000,则从抽取数量最多的审核员 ID 开始减少,直到总量减少到 1000。
WITH total_submissions AS (
SELECT operator_id,
COUNT(*) AS total_count
FROM jf.creative_element_di
WHERE event_type = 'SUBMIT' AND element_type = '2024518'
GROUP BY operator_id
),
sampling_ratios AS (
SELECT operator_id,
total_count,
(total_count / (SELECT COUNT(*) FROM jf.creative_element_di WHERE event_type = 'SUBMIT' AND element_type = '2024518')) AS ratio
FROM total_submissions
),
ranked_submissions AS (
SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY d.operator_id ORDER BY d.operation_time) AS row_num,
FLOOR(s.ratio * 1000) AS sample_count
FROM jf.creative_element_di d
JOIN sampling_ratios s ON d.operator_id = s.operator_id
WHERE d.event_type = 'SUBMIT' AND d.element_type = '2024518'
)
SELECT * FROM ranked_submissions
WHERE row_num <= sample_count
ORDER BY operator_id, row_num
LIMIT 1000;
4、 一个审核员在同一时间开了2个页面,审核时间重复计算的部分称为多开时间(eg:x代表领取时间,y代表出队时间;x2~y1之间重复计算的时间就是多开)求每个审核员剔除多开后的时间(注:审核时间=提交时间-领取时间) (25分,考点:逻辑***)
---提取每个审核员的所有 ENQUEUE 和 SUBMIT 事件,并按审核员和操作时间排序
WITH event_times AS (
SELECT operator_id,
operation_time,
LEAD(operation_time) OVER (PARTITION BY operator_id ORDER BY operation_time) AS next_submit_time, ---下一个时间点
event_type
FROM jf.creative_element_di
WHERE (event_type = 'ENQUEUE' OR event_type = 'SUBMIT')
),
---根据 ENQUEUE 和 SUBMIT 事件时间创建审核时间段
review_intervals AS (
SELECT operator_id,
MIN(CASE WHEN event_type = 'ENQUEUE' THEN operation_time END) AS enqueue_time,
MAX(CASE WHEN event_type = 'SUBMIT' THEN operation_time END) AS submit_time
FROM event_times
GROUP BY operator_id, next_submit_time
),
---检查并调整重叠的时间段,避免重复计算
merged_intervals AS (
SELECT operator_id, enqueue_time, submit_time,
CASE
WHEN LAG(submit_time) OVER (PARTITION BY operator_id ORDER BY enqueue_time) > enqueue_time ---上一个时间点
THEN LAG(submit_time) OVER (PARTITION BY operator_id ORDER BY enqueue_time)
ELSE enqueue_time
END AS adjusted_enqueue_time
FROM review_intervals
),
---计算每个审核员剔除多开后的审核时间
review_times AS (
SELECT operator_id,
SUM(TIMESTAMPDIFF(SECOND, adjusted_enqueue_time, submit_time)) AS total_review_time_seconds
FROM merged_intervals
GROUP BY operator_id
)
SELECT operator_id,
SEC_TO_TIME(total_review_time_seconds) AS total_review_time ---转换为时间格式HH:MM:SS
FROM review_times;
更多推荐
所有评论(0)