3 Months Free Update
3 Months Free Update
3 Months Free Update
What is a valid object hierarchy when building a Snowflake environment?
Account --> Database --> Schema --> Warehouse
Organization --> Account --> Database --> Schema --> Stage
Account --> Schema > Table --> Stage
Organization --> Account --> Stage --> Table --> View
This is the valid object hierarchy when building a Snowflake environment, according to the Snowflake documentation and the web search results. Snowflake is a cloud data platform that supports various types of objects, such as databases, schemas, tables, views, stages, warehouses, and more. These objects are organized in a hierarchical structure, as follows:
The other options listed are not valid object hierarchies, because they either omit or misplace some objects in the structure. For example, option A omits the organization level and places the warehouse under the schema level, which is incorrect. Option C omits the organization, account, and stage levels, and places the table under the schema level, which is incorrect. Option D omits the database level and places the stage and table under the account level, which is incorrect.
References:
How do Snowflake databases that are created from shares differ from standard databases that are not created from shares? (Choose three.)
Shared databases are read-only.
Shared databases must be refreshed in order for new data to be visible.
Shared databases cannot be cloned.
Shared databases are not supported by Time Travel.
Shared databases will have the PUBLIC or INFORMATION_SCHEMA schemas without explicitly granting these schemas to the share.
Shared databases can also be created as transient databases.
According to the SnowPro Advanced: Architect documents and learning resources, the ways that Snowflake databases that are created from shares differ from standard databases that are not created from shares are:
The other options are incorrect because they are not ways that Snowflake databases that are created from shares differ from standard databases that are not created from shares. Option B is incorrect because shared databases do not need to be refreshed in order for new data to be visible. The data consumers who access the shared databases can see the latest data as soon as the data providers update the data1. Option E is incorrect because shared databases will not have the PUBLIC or INFORMATION_SCHEMA schemas without explicitly granting these schemas to the share. The data consumers who access the shared databases can only see the objects that the data providers grant to the share, and the PUBLIC and INFORMATION_SCHEMA schemas are not granted by default4. Option F is incorrect because shared databases cannot be created as transient databases. Transient databases are databases that do not support Time Travel or Fail-safe, and can be dropped without affecting the retention period of the data. Shared databases are always created as permanent databases, regardless of the type of the source database5. References: Introduction to Secure Data Sharing | Snowflake Documentation, Cloning Objects | Snowflake Documentation, Time Travel | Snowflake Documentation, Working with Shares | Snowflake Documentation, CREATE DATABASE | Snowflake Documentation
What are some of the characteristics of result set caches? (Choose three.)
Time Travel queries can be executed against the result set cache.
Snowflake persists the data results for 24 hours.
Each time persisted results for a query are used, a 24-hour retention period is reset.
The data stored in the result cache will contribute to storage costs.
The retention period can be reset for a maximum of 31 days.
The result set cache is not shared between warehouses.
In Snowflake, the characteristics of result set caches include persistence of data results for 24 hours (B), each use of persisted results resets the 24-hour retention period (C), and result set caches are not shared between different warehouses (F). The result set cache is specifically designed to avoid repeated execution of the same query within this timeframe, reducing computational overhead and speeding up query responses. These caches do not contribute to storage costs, and their retention period cannot be extended beyond the default duration nor up to 31 days, as might be misconstrued.References: Snowflake Documentation on Result Set Caching.
What are purposes for creating a storage integration? (Choose three.)
Control access to Snowflake data using a master encryption key that is maintained in the cloud provider’s key management service.
Store a generated identity and access management (IAM) entity for an external cloud provider regardless of the cloud provider that hosts the Snowflake account.
Support multiple external stages using one single Snowflake object.
Avoid supplying credentials when creating a stage or when loading or unloading data.
Create private VPC endpoints that allow direct, secure connectivity between VPCs without traversing the public internet.
Manage credentials from multiple cloud providers in one single Snowflake object.
The purpose of creating a storage integration in Snowflake includes:B. Store a generated identity and access management (IAM) entity for an external cloud provider - This helps in managing authentication and authorization with external cloud storage without embedding credentials in Snowflake. It supports various cloud providers like AWS, Azure, or GCP, ensuring that the identity management is streamlined across platforms.C. Support multiple external stages using one single Snowflake object - Storage integrations allow you to set up access configurations that can be reused across multiple external stages, simplifying the management of external data integrations.D. Avoid supplying credentials when creating a stage or when loading or unloading data - By using a storage integration, Snowflake can interact with external storage without the need to continuously manage or expose sensitive credentials, enhancing security and ease of operations.References: Snowflake documentation on storage integrations, found within the SnowPro Advanced: Architect course materials.
An Architect needs to design a Snowflake account and database strategy to store and analyze large amounts of structured and semi-structured data. There are many business units and departments within the company. The requirements are scalability, security, and cost efficiency.
What design should be used?
Create a single Snowflake account and database for all data storage and analysis needs, regardless of data volume or complexity.
Set up separate Snowflake accounts and databases for each department or business unit, to ensure data isolation and security.
Use Snowflake's data lake functionality to store and analyze all data in a central location, without the need for structured schemas or indexes
Use a centralized Snowflake database for core business data, and use separate databases for departmental or project-specific data.
The best design to store and analyze large amounts of structured and semi-structured data for different business units and departments is to use a centralized Snowflake database for core business data, and use separate databases for departmental or project-specific data. This design allows for scalability, security, and cost efficiency by leveraging Snowflake’s features such as:
How can an Architect enable optimal clustering to enhance performance for different access paths on a given table?
Create multiple clustering keys for a table.
Create multiple materialized views with different cluster keys.
Create super projections that will automatically create clustering.
Create a clustering key that contains all columns used in the access paths.
According to the SnowPro Advanced: Architect documents and learning resources, the best way to enable optimal clustering to enhance performance for different access paths on a given table is to create multiple materialized views with different cluster keys. A materialized view is a pre-computed result set that is derived from a query on one or more base tables. A materialized view can be clustered by specifying a clustering key, which is a subset of columns or expressions that determines how the data in the materialized view is co-located in micro-partitions. By creating multiple materialized views with different cluster keys, an Architect can optimize the performance of queries that use different access paths on the same base table. For example, if a base table has columns A, B, C, and D, and there are queries that filter on A and B, or on C and D, or on A and C, the Architect can create three materialized views, each with a different cluster key: (A, B), (C, D), and (A, C). This way, each query can leverage the optimal clustering of the corresponding materialized view and achieve faster scan efficiency and better compression.
References:
https://www.snowflake.com/blog/using-materialized-views-to-solve-multi-clustering-performance-problems/
A table contains five columns and it has millions of records. The cardinality distribution of the columns is shown below:
Column C4 and C5 are mostly used by SELECT queries in the GROUP BY and ORDER BY clauses. Whereas columns C1, C2 and C3 are heavily used in filter and join conditions of SELECT queries.
The Architect must design a clustering key for this table to improve the query performance.
Based on Snowflake recommendations, how should the clustering key columns be ordered while defining the multi-column clustering key?
C5, C4, C2
C3, C4, C5
C1, C3, C2
C2, C1, C3
According to the Snowflake documentation, the following are some considerations for choosing clustering for a table1:
Based on these considerations, the best option for the clustering key columns is C. C1, C3, C2, because:
References: 1: Considerations for Choosing Clustering for a Table | Snowflake Documentation
Which of the below commands will use warehouse credits?
SHOW TABLES LIKE 'SNOWFL%';
SELECT MAX(FLAKE_ID) FROM SNOWFLAKE;
SELECT COUNT(*) FROM SNOWFLAKE;
SELECT COUNT(FLAKE_ID) FROM SNOWFLAKE GROUP BY FLAKE_ID;
References: : Understanding Compute Cost : MAX Function : COUNT Function : GROUP BY Clause : SHOW TABLES
An Architect has chosen to separate their Snowflake Production and QA environments using two separate Snowflake accounts.
The QA account is intended to run and test changes on data and database objects before pushing those changes to the Production account. It is a requirement that all database objects and data in the QA account need to be an exact copy of the database objects, including privileges and data in the Production account on at least a nightly basis.
Which is the LEAST complex approach to use to populate the QA account with the Production account’s data and database objects on a nightly basis?
1) Create a share in the Production account for each database
2) Share access to the QA account as a Consumer
3) The QA account creates a database directly from each share
4) Create clones of those databases on a nightly basis
5) Run tests directly on those cloned databases
1) Create a stage in the Production account
2) Create a stage in the QA account that points to the same external object-storage location
3) Create a task that runs nightly to unload each table in the Production account into the stage
4) Use Snowpipe to populate the QA account
1) Enable replication for each database in the Production account
2) Create replica databases in the QA account
3) Create clones of the replica databases on a nightly basis
4) Run tests directly on those cloned databases
1) In the Production account, create an external function that connects into the QA account and returns all the data for one specific table
2) Run the external function as part of a stored procedure that loops through each table in the Production account and populates each table in the QA account
This approach is the least complex because it uses Snowflake’s built-in replication feature to copy the data and database objects from the Production account to the QA account. Replication is a fast and efficient way to synchronize data across accounts, regions, and cloud platforms. It also preserves the privileges and metadata of the replicated objects. By creating clones of the replica databases, the QA account can run tests on the cloned data without affecting the original data. Clones are also zero-copy, meaning they do not consume any additional storage space unless the data is modified. This approach does not require any external stages, tasks, Snowpipe, or external functions, which can add complexity and overhead to the data transfer process.
References:
How can the Snowpipe REST API be used to keep a log of data load history?
Call insertReport every 20 minutes, fetching the last 10,000 entries.
Call loadHistoryScan every minute for the maximum time range.
Call insertReport every 8 minutes for a 10-minute time range.
Call loadHistoryScan every 10 minutes for a 15-minutes range.
The Snowpipe REST API provides two endpoints for retrieving the data load history: insertReport and loadHistoryScan. The insertReport endpoint returns the status of the files that were submitted to the insertFiles endpoint, while the loadHistoryScan endpoint returns the history of the files that were actually loaded into the table by Snowpipe. To keep a log of data load history, it is recommended to use the loadHistoryScan endpoint, which provides more accurate and complete information about the data ingestion process. The loadHistoryScan endpoint accepts a start time and an end time as parameters, and returns the files that were loaded within that time range. The maximum time range that can be specified is 15 minutes, and the maximum number of files that can be returned is 10,000. Therefore, to keep a log of data load history, the best option is to call the loadHistoryScan endpoint every 10 minutes for a 15-minute time range, and store the results in a log file or a table. This way, the log will capture all the files that were loaded by Snowpipe, and avoid any gaps or overlaps in the time range. The other options are incorrect because:
References:
An Architect is designing a pipeline to stream event data into Snowflake using the Snowflake Kafka connector. The Architect’s highest priority is to configure the connector to stream data in the MOST cost-effective manner.
Which of the following is recommended for optimizing the cost associated with the Snowflake Kafka connector?
Utilize a higher Buffer.flush.time in the connector configuration.
Utilize a higher Buffer.size.bytes in the connector configuration.
Utilize a lower Buffer.size.bytes in the connector configuration.
Utilize a lower Buffer.count.records in the connector configuration.
The minimum value supported for the buffer.flush.time property is 1 (in seconds). For higher average data flow rates, we suggest that you decrease the default value for improved latency. If cost is a greater concern than latency, you could increase the buffer flush time. Be careful to flush the Kafka memory buffer before it becomes full to avoid out of memory exceptions. https://docs.snowflake.com/en/user-guide/data-load-snowpipe-streaming-kafka
An Architect for a multi-national transportation company has a system that is used to check the weather conditions along vehicle routes. The data is provided to drivers.
The weather information is delivered regularly by a third-party company and this information is generated as JSON structure. Then the data is loaded into Snowflake in a column with a VARIANT data type. This
table is directly queried to deliver the statistics to the drivers with minimum time lapse.
A single entry includes (but is not limited to):
- Weather condition; cloudy, sunny, rainy, etc.
- Degree
- Longitude and latitude
- Timeframe
- Location address
- Wind
The table holds more than 10 years' worth of data in order to deliver the statistics from different years and locations. The amount of data on the table increases every day.
The drivers report that they are not receiving the weather statistics for their locations in time.
What can the Architect do to deliver the statistics to the drivers faster?
Create an additional table in the schema for longitude and latitude. Determine a regular task to fill this information by extracting it from the JSON dataset.
Add search optimization service on the variant column for longitude and latitude in order to query the information by using specific metadata.
Divide the table into several tables for each year by using the timeframe information from the JSON dataset in order to process the queries in parallel.
Divide the table into several tables for each location by using the location address information from the JSON dataset in order to process the queries in parallel.
To improve the performance of queries on semi-structured data, such as JSON stored in a VARIANT column, Snowflake’s search optimization service can be utilized. By adding search optimization specifically for the longitude and latitude fields within the VARIANT column, the system can perform point lookups and substring queries more efficiently. This will allow for faster retrieval of weather statistics, which is critical for the drivers to receive timely updates.
References: The solution is supported by Snowflake documentation that details how search optimization can enhance query performance for semi-structured data1.
Company A would like to share data in Snowflake with Company B. Company B is not on the same cloud platform as Company A.
What is required to allow data sharing between these two companies?
Create a pipeline to write shared data to a cloud storage location in the target cloud provider.
Ensure that all views are persisted, as views cannot be shared across cloud platforms.
Setup data replication to the region and cloud platform where the consumer resides.
Company A and Company B must agree to use a single cloud platform: Data sharing is only possible if the companies share the same cloud provider.
According to the SnowPro Advanced: Architect documents and learning resources, the requirement to allow data sharing between two companies that are not on the same cloud platform is to set up data replication to the region and cloud platform where the consumer resides. Data replication is a feature of Snowflake that enables copying databases across accounts in different regions and cloud platforms. Data replication allows data providers to securely share data with data consumers across different regions and cloud platforms by creating a replica database in the consumer’s account. The replica database is read-only and automatically synchronized with the primary database in the provider’s account. Data replication is useful for scenarios where data sharing is not possible or desirable due to latency, compliance, or security reasons1. The other options are incorrect because they are not required or feasible to allow data sharing between two companies that are not on the same cloud platform. Option A is incorrect because creating a pipeline to write shared data to a cloud storage location in the target cloud provider is not a secure or efficient way of sharing data. It would require additional steps to load the data from the cloud storage to the consumer’s account, and it would not leverage the benefits of Snowflake’s data sharing features. Option B is incorrect because ensuring that all views are persisted is not relevant for data sharing across cloud platforms. Views can be shared across cloud platforms as long as they reference objects in the same database. Persisting views is an option to improve the performance of querying views, but it is not required for data sharing2. Option D is incorrect because Company A and Company B do not need to agree to use a single cloud platform. Data sharing is possible across different cloud platforms using data replication or other methods, such as listings or auto-fulfillment3. References: Replicating Databases Across Multiple Accounts | Snowflake Documentation, Persisting Views | Snowflake Documentation, Sharing Data Across Regions and Cloud Platforms | Snowflake Documentation
A company wants to Integrate its main enterprise identity provider with federated authentication with Snowflake.
The authentication integration has been configured and roles have been created in Snowflake. However, the users are not automatically appearing in Snowflake when created and their group membership is not reflected in their assigned rotes.
How can the missing functionality be enabled with the LEAST amount of operational overhead?
OAuth must be configured between the identity provider and Snowflake. Then the authorization server must be configured with the right mapping of users and roles.
OAuth must be configured between the identity provider and Snowflake. Then the authorization server must be configured with the right mapping of users, and the resource server must be configured with the right mapping of role assignment.
SCIM must be enabled between the identity provider and Snowflake. Once both are synchronized through SCIM, their groups will get created as group accounts in Snowflake and the proper roles can be granted.
SCIM must be enabled between the identity provider and Snowflake. Once both are synchronized through SCIM. users will automatically get created and their group membership will be reflected as roles In Snowflake.
The best way to integrate an enterprise identity provider with federated authentication and enable automatic user creation and role assignment in Snowflake is to use SCIM (System for Cross-domain Identity Management). SCIM allows Snowflake to synchronize with the identity provider and create users and groups based on the information provided by the identity provider. The groups are mapped to roles in Snowflake, and the users are assigned the roles based on their group membership. This way, the identity provider remains the source of truth for user and group management, and Snowflake automatically reflects the changes without manual intervention. The other options are either incorrect or incomplete, as they involve using OAuth, which is a protocol for authorization, not authentication or user provisioning, and require additional configuration of authorization and resource servers.
How is the change of local time due to daylight savings time handled in Snowflake tasks? (Choose two.)
A task scheduled in a UTC-based schedule will have no issues with the time changes.
Task schedules can be designed to follow specified or local time zones to accommodate the time changes.
A task will move to a suspended state during the daylight savings time change.
A frequent task execution schedule like minutes may not cause a problem, but will affect the task history.
A task schedule will follow only the specified time and will fail to handle lost or duplicated hours.
According to the Snowflake documentation1 and the web search results2, these two statements are true about how the change of local time due to daylight savings time is handled in Snowflake tasks. A task is a feature that allows scheduling and executing SQL statements or stored procedures in Snowflake. A task can be scheduled using a cron expression that specifies the frequency and time zone of the task execution.
References:
Database DB1 has schema S1 which has one table, T1.
DB1 --> S1 --> T1
The retention period of EG1 is set to 10 days.
The retention period of s: is set to 20 days.
The retention period of t: Is set to 30 days.
The user runs the following command:
Drop Database DB1;
What will the Time Travel retention period be for T1?
10 days
20 days
30 days
37 days
The Time Travel retention period for T1 will be 30 days, which is the retention period set at the table level. The Time Travel retention period determines how long the historical data is preserved and accessible for an object after it is modified or dropped. The Time Travel retention period can be set at the account level, the database level, the schema level, or the table level. The retention period set at the lowest level of the hierarchy takes precedence over the higher levels. Therefore, the retention period set at the table level overrides the retention periods set at the schema level, the database level, or the account level. When the user drops the database DB1, the table T1 is also dropped, but the historical data is still preserved for 30 days, which is the retention period set at the table level. The user can use the UNDROP command to restore the table T1 within the 30-day period. The other options are incorrect because:
References:
A company has a table with that has corrupted data, named Data. The company wants to recover the data as it was 5 minutes ago using cloning and Time Travel.
What command will accomplish this?
CREATE CLONE TABLE Recover_Data FROM Data AT(OFFSET => -60*5);
CREATE CLONE Recover_Data FROM Data AT(OFFSET => -60*5);
CREATE TABLE Recover_Data CLONE Data AT(OFFSET => -60*5);
CREATE TABLE Recover Data CLONE Data AT(TIME => -60*5);
This is the correct command to create a clone of the table Data as it was 5 minutes ago using cloning and Time Travel. Cloning is a feature that allows creating a copy of a database, schema, table, or view without duplicating the data or metadata. Time Travel is a feature that enables accessing historical data (i.e. data that has been changed or deleted) at any point within a defined period. To create a clone of a table at a point in time in the past, the syntax is:
CREATE TABLE
The OFFSET parameter specifies the time difference in seconds from the present time. A negative value indicates a point in the past. For example, -60*5 means 5 minutes ago. Alternatively, the TIMESTAMP parameter can be used to specify an exact timestamp in the past. The clone will contain the data as it existed in the source table at the specified point in time12.
References:
What Snowflake system functions are used to view and or monitor the clustering metadata for a table? (Select TWO).
SYSTEMSCLUSTERING
SYSTEMSTABLE_CLUSTERING
SYSTEMSCLUSTERING_DEPTH
SYSTEMSCLUSTERING_RATIO
SYSTEMSCLUSTERING_INFORMATION
The Snowflake system functions used to view and monitor the clustering metadata for a table are:
Comprehensive But Short Explanation:
References:
A company has a Snowflake environment running in AWS us-west-2 (Oregon). The company needs to share data privately with a customer who is running their Snowflake environment in Azure East US 2 (Virginia).
What is the recommended sequence of operations that must be followed to meet this requirement?
1. Create a share and add the database privileges to the share
2. Create a new listing on the Snowflake Marketplace
3. Alter the listing and add the share
4. Instruct the customer to subscribe to the listing on the Snowflake Marketplace
1. Ask the customer to create a new Snowflake account in Azure EAST US 2 (Virginia)
2. Create a share and add the database privileges to the share
3. Alter the share and add the customer's Snowflake account to the share
1. Create a new Snowflake account in Azure East US 2 (Virginia)
2. Set up replication between AWS us-west-2 (Oregon) and Azure East US 2 (Virginia) for the database objects to be shared
3. Create a share and add the database privileges to the share
4. Alter the share and add the customer's Snowflake account to the share
1. Create a reader account in Azure East US 2 (Virginia)
2. Create a share and add the database privileges to the share
3. Add the reader account to the share
4. Share the reader account's URL and credentials with the customer
Option C is the correct answer because it allows the company to share data privately with the customer across different cloud platforms and regions. The company can create a new Snowflake account in Azure East US 2 (Virginia) and set up replication between AWS us-west-2 (Oregon) and Azure East US 2 (Virginia) for the database objects to be shared. This way, the company can ensure that the data is always up to date and consistent in both accounts. The company can then create a share and add the database privileges to the share, and alter the share and add the customer’s Snowflake account to the share. The customer can then access the shared data from their own Snowflake account in Azure East US 2 (Virginia).
Option A is incorrect because the Snowflake Marketplace is not a private way of sharing data. The Snowflake Marketplace is a public data exchange platform that allows anyone to browse and subscribe to data sets from various providers. The company would not be able to control who can access their data if they use the Snowflake Marketplace.
Option B is incorrect because it requires the customer to create a new Snowflake account in Azure East US 2 (Virginia), which may not be feasible or desirable for the customer. The customer may already have an existing Snowflake account in a different cloud platform or region, and may not want to incur additional costs or complexity by creating a new account.
Option D is incorrect because it involves creating a reader account in Azure East US 2 (Virginia), which is a limited and temporary way of sharing data. A reader account is a special type of Snowflake account that can only access data from a single share, and has a fixed duration of 30 days. The company would have to manage the reader account’s URL and credentials, and renew the account every 30 days. The customer would not be able to use their own Snowflake account to access the shared data, and would have to rely on the company’s reader account.
References:
In a managed access schema, what are characteristics of the roles that can manage object privileges? (Select TWO).
Users with the SYSADMIN role can grant object privileges in a managed access schema.
Users with the SECURITYADMIN role or higher, can grant object privileges in a managed access schema.
Users who are database owners can grant object privileges in a managed access schema.
Users who are schema owners can grant object privileges in a managed access schema.
Users who are object owners can grant object privileges in a managed access schema.
In a managed access schema, the privilege management is centralized with the schema owner, who has the authority to grant object privileges within the schema. Additionally, the SECURITYADMIN role has the capability to manage object grants globally, which includes within managed access schemas. Other roles, such as SYSADMIN or database owners, do not inherently have this privilege unless explicitly granted.
References: The verified answers are based on Snowflake’s official documentation, which outlines the roles and privileges associated with managed access schemas12.
Assuming all Snowflake accounts are using an Enterprise edition or higher, in which development and testing scenarios would be copying of data be required, and zero-copy cloning not be suitable? (Select TWO).
Developers create their own datasets to work against transformed versions of the live data.
Production and development run in different databases in the same account, and Developers need to see production-like data but with specific columns masked.
Data is in a production Snowflake account that needs to be provided to Developers in a separate development/testing Snowflake account in the same cloud region.
Developers create their own copies of a standard test database previously created for them in the development account, for their initial development and unit testing.
The release process requires pre-production testing of changes with data of production scale and complexity. For security reasons, pre-production also runs in the production account.
Zero-copy cloning is a feature that allows creating a clone of a table, schema, or database without physically copying the data. Zero-copy cloning is suitable for scenarios where the cloned object needs to have the same data and metadata as the original object, and where the cloned object does not need to be modified or updated frequently. Zero-copy cloning is also suitable for scenarios where the cloned object needs to be shared within the same Snowflake account or across different accounts in the same cloud region2
However, zero-copy cloning is not suitable for scenarios where the cloned object needs to have different data or metadata than the original object, or where the cloned object needs to be modified or updated frequently. Zero-copy cloning is also not suitable for scenarios where the cloned object needs to be shared across different accounts in different cloud regions. In these scenarios, copying of data would be required, either by using the COPY INTO command or by using data sharing with secure views3
The following are examples of development and testing scenarios where copying of data would be required, and zero-copy cloning would not be suitable:
The following are examples of development and testing scenarios where zero-copy cloning would be suitable, and copying of data would not be required:
What integration object should be used to place restrictions on where data may be exported?
Stage integration
Security integration
Storage integration
API integration
In Snowflake, a storage integration is used to define and configure external cloud storage that Snowflake will interact with. This includes specifying security policies for access control. One of the main features of storage integrations is the ability to set restrictions on where data may be exported. This is done by binding the storage integration to specific cloud storage locations, thereby ensuring that Snowflake can only access those locations. It helps to maintain control over the data and complies with data governance and security policies by preventing unauthorized data exports to unspecified locations.
A media company needs a data pipeline that will ingest customer review data into a Snowflake table, and apply some transformations. The company also needs to use Amazon Comprehend to do sentiment analysis and make the de-identified final data set available publicly for advertising companies who use different cloud providers in different regions.
The data pipeline needs to run continuously ang efficiently as new records arrive in the object storage leveraging event notifications. Also, the operational complexity, maintenance of the infrastructure, including platform upgrades and security, and the development effort should be minimal.
Which design will meet these requirements?
Ingest the data using COPY INTO and use streams and tasks to orchestrate transformations. Export the data into Amazon S3 to do model inference with Amazon Comprehend and ingest the data back into a Snowflake table. Then create a listing in the Snowflake Marketplace to make the data available to other companies.
Ingest the data using Snowpipe and use streams and tasks to orchestrate transformations. Create an external function to do model inference with Amazon Comprehend and write the final records to a Snowflake table. Then create a listing in the Snowflake Marketplace to make the data available to other companies.
Ingest the data into Snowflake using Amazon EMR and PySpark using the Snowflake Spark connector. Apply transformations using another Spark job. Develop a python program to do model inference by leveraging the Amazon Comprehend text analysis API. Then write the results to a Snowflake table and create a listing in the Snowflake Marketplace to make the data available to other companies.
Ingest the data using Snowpipe and use streams and tasks to orchestrate transformations. Export the data into Amazon S3 to do model inference with Amazon Comprehend and ingest the data back into a Snowflake table. Then create a listing in the Snowflake Marketplace to make the data available to other companies.
This design meets all the requirements for the data pipeline. Snowpipe is a feature that enables continuous data loading into Snowflake from object storage using event notifications. It is efficient, scalable, and serverless, meaning it does not require any infrastructure or maintenance from the user. Streams and tasks are features that enable automated data pipelines within Snowflake, using change data capture and scheduled execution. They are also efficient, scalable, and serverless, and they simplify the data transformation process. External functions are functions that can invoke external services or APIs from within Snowflake. They can be used to integrate with Amazon Comprehend and perform sentiment analysis on the data. The results can be written back to a Snowflake table using standard SQL commands. Snowflake Marketplace is a platform that allows data providers to share data with data consumers across different accounts, regions, and cloud platforms. It is a secure and easy way to make data publicly available to other companies.
References:
A company is storing large numbers of small JSON files (ranging from 1-4 bytes) that are received from IoT devices and sent to a cloud provider. In any given hour, 100,000 files are added to the cloud provider.
What is the MOST cost-effective way to bring this data into a Snowflake table?
An external table
A pipe
A stream
A copy command at regular intervals
References: : Pipes : Loading Data Using Snowpipe : External Tables : Streams : COPY INTO