Duplicate Leads in Salesforce? It’s not just messy — it’s dangerous! As a Salesforce Architect, one of the most underestimated pain points I see across orgs is poor duplicate management. It silently: ⚠️ Breaks automation 📉 Skews reporting ❌ Slows down sales 🛑 Violates GDPR rules Last week, I worked on optimizing a lead flow where 80,000+ leads were sitting unvalidated — many of them potential contacts, already in the system. Here’s how we tackled it: 📌 Step 1: Smart Duplicate Check → We built a flow that compares incoming Leads to Contacts & Leads using fuzzy logic (Email, Phone, Name, etc.) 📌 Step 2: Decision Branch → If a duplicate is found, we flag it or merge it automatically (using Apex + native Merge tools). → If not, we convert the Lead cleanly to a Contact, ensuring no clutter. 📌 Step 3: Automation with Guardrails → All this runs inside a scalable Salesforce Flow — enriched with Apex where needed — and leaves a full audit trail. 💡 Architecture isn’t just about building — it’s about protecting your data layer. If you're still relying on name-only matching or manual checks, you're setting your CRM up for failure. Let’s talk if you want a duplicate management framework that scales 👇 #Salesforce #CRMStrategy #DuplicateCheck #SalesforceFlow #Architect #RevOps #DataIntegrity
Email File Management
Explore top LinkedIn content from expert professionals.
-
-
𝐒𝐐𝐋 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐒𝐞𝐫𝐢𝐞𝐬 – 𝐃𝐚𝐲 𝟏𝟐 𝐓𝐚𝐬𝐤: Write a SQL query to identify email addresses that appear more than once in the customers table. This type of question is commonly asked in interviews at companies like PwC, KPMG, and Infosys, especially when the role involves data quality audits, reporting, or data migration tasks. The focus here is on identifying duplicates—an essential skill in data cleaning and preprocessing workflows. 𝐇𝐨𝐰 𝐭𝐨 𝐟𝐫𝐚𝐦𝐞 𝐢𝐭: Start by grouping the table by the email column. Then apply the COUNT(*) function to count how many times each email appears in the dataset. To find duplicates, use a HAVING clause to return only those email groups where the count is greater than one. This logic helps detect data integrity issues such as multiple records with the same email due to failed validations or duplicate imports. 𝐂𝐨𝐧𝐜𝐞𝐩𝐭𝐬 𝐮𝐬𝐞𝐝: 1. 𝐆𝐑𝐎𝐔𝐏 𝐁𝐘 𝐂𝐥𝐚𝐮𝐬𝐞: Groups records by email so that aggregation functions can be used to count how many times each unique email appears. 2. 𝐂𝐎𝐔𝐍𝐓(*) 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧: Counts the total number of records for each grouped email value. If an email appears more than once, its count will be greater than one. 3. 𝐀𝐥𝐢𝐚𝐬𝐢𝐧𝐠 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐬: The result of COUNT(*) is aliased as occurrences for better readability and downstream usage in reporting or debugging queries. 4. 𝐇𝐀𝐕𝐈𝐍𝐆 𝐂𝐥𝐚𝐮𝐬𝐞 𝐭𝐨 𝐅𝐢𝐥𝐭𝐞𝐫 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐬: Since WHERE cannot be used with aggregated values, the HAVING clause is applied to filter the grouped data. It returns only those emails with more than one record. 5. 𝐃𝐚𝐭𝐚 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞: Identifying duplicates is critical in ETL pipelines, CRM data syncs, and compliance checks. Interviewers expect you to write efficient queries that surface these issues clearly. 6. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐅𝐨𝐥𝐥𝐨𝐰-𝐔𝐩 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬: Once duplicates are identified, interviewers may ask how to remove or resolve them. You can suggest using ROW_NUMBER() to isolate the latest record, or DISTINCT to retain unique values based on business rules. 7. 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐚𝐥 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧: This pattern is useful in fraud detection, contact deduplication, lead cleanup, or preparing customer data for machine learning models. 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐢𝐬 𝐚𝐬𝐤𝐞𝐝: This question tests your understanding of SQL’s grouping and filtering logic, and how to detect and report anomalies in data. Clean data is the foundation of every analytics project, and the ability to identify duplicates is a must-have skill for any analyst or engineer. Interviewers look for candidates who can think practically and solve messy data problems with precision. #SQL #SQLInterview #PwCInterview #DataCleaning #HAVINGClause #DataAnalytics #SQLQuery #LearnSQL #DataQuality #BusinessIntelligence #InterviewPreparation #DataEngineering #AnalyticsJobs #SQLTips
-
PySpark scenario-based interview questions & answer: 1) Deduplicate and normalize messy user data (Beginner): #Scenario: You receive a user signup CSV with messy names, mixed-case emails, and multiple signups per email. Keep the most recent signup per email and normalize fields. #Purpose: Data hygiene — prevents duplicate users and inconsistent keys that break joins and metrics. Question & data (sample): #Schema: user_id: int, full_name: string, email: string, signup_ts: string, country: string #Samplerows: (1, " john DOE ", "JOHN@EX.COM ", "2025-11-20 10:00", "US") (2, "John Doe", "john@ex.com", "2025-11-21 09:00", "US") (3, "alice", "alice@mail.com", "11/20/2025 12:00", "IN") #Approach: Read CSV with header. Trim & normalize (full_name → title case, email → lower-case). Parse multiple timestamp formats to timestamp. Filter obviously invalid emails (basic regex). Deduplicate by email, keeping row with latest signup_ts. #Explanation: Lowercasing emails and trimming prevents false-unique keys. Multiple to_timestamp attempts handle variable input formats. Window + row_number() deterministically selects the most recent record per email. Caveat: Basic regex filters obvious invalid addresses but not full RFC validation. Karthik K. #PySpark #DataCleaning #ETL #DataEngineering #ApacheSpark code:
-
SQL interview question: How to Identify and Delete Duplicates (with Code) Handling duplicate records in SQL is a common task, especially when dealing with raw or legacy datasets, and interviewers love to ask this. Here are 3 reliable methods to identify and delete duplicates using SQL: 1. 𝗨𝘀𝗶𝗻𝗴 𝗥𝗢𝗪_𝗡𝗨𝗠𝗕𝗘𝗥() (𝗕𝗲𝘀𝘁 𝗳𝗼𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘅 𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝘀) WITH CTE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS rn FROM users ) DELETE FROM users WHERE id IN ( SELECT id FROM CTE WHERE rn > 1 ); ✅ Why this works: It keeps the first occurrence (based on id) and removes the rest. Super handy when deduplication depends on multiple columns. 2. 𝗨𝘀𝗶𝗻𝗴 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 𝘄𝗶𝘁𝗵 𝗠𝗜𝗡() 𝗼𝗿 𝗠𝗔𝗫() DELETE FROM users WHERE id NOT IN ( SELECT MIN(id) FROM users GROUP BY name, email ); ✅ Why this works: Simple datasets with clear duplicate keys. Just keeps the record with the smallest id. 3. 𝗨𝘀𝗶𝗻𝗴 𝗦𝗘𝗟𝗙 𝗝𝗢𝗜𝗡 DELETE u1 FROM users u1 JOIN users u2 ON u1.name = u2.name AND u1.email = u2.email WHERE u1.id > u2.id; ✅ Why this works: No CTE required, straightforward and readable. How would you answer that? Comment that down! ------------------------------------------------------------------ #SQL #interviewquestions
-
One of the most underrated features of Snowflake that can instantly enhance your data quality and pipeline efficiency is the use of HASH() or CHECKSUM() to detect duplicate rows. Why duplicates become a headache: - Multiple systems sending the same record - Late-arriving data - Ingestion retries - Missing primary keys - Manual file loads Traditional duplicate detection methods often involve long JOIN conditions, row-by-row comparisons, and complex WHERE clauses, which can become slow and costly as datasets grow. Snowflake offers a far simpler solution. Instead of manually comparing each column, you can generate a unique fingerprint of a row using: - HASH(col1, col2, col3, ...) - CHECKSUM(col1, col2, col3, ...) This fingerprint condenses the entire row into a single numeric value. If two rows have the same hash value, they are duplicates. For example, consider a table with the following data: NAME EMAIL CITY John j@a.com Pune John j@a.com Pune Rahul r@b.com Delhi You can create a row signature with the following query: SELECT *, HASH(name, email, city) AS row_hash FROM customers; The output will show: NAME EMAIL CITY ROW_HASH John j@a.com Pune 84739281 John j@a.com Pune 84739281 Rahul r@b.com Delhi 12849372 Now, duplicates become obvious, same data equals same hash. To find duplicates in one line, you can use: SELECT row_hash, COUNT(*) FROM ( SELECT HASH(name, email, city) AS row_hash FROM customers ) GROUP BY row_hash HAVING COUNT(*) > 1; This query provides all duplicate groups instantly. Snowflake hides so much power in small functions. HASH() is one of those features easy to use, zero maintenance, and extremely effective. If you aren’t using hash-based deduplication yet, it’s one of the quickest ways to improve your pipelines. Follow Sajal Agarwal for more such content.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development